PPOL 5203 - Data Science I: Foundations

Week 8: Parsing Unstructured Data: Scraping Static Websites

Author

Professor: Tiago Ventura

Published

October 29, 2024

Final Project: https://tiagoventura.github.io/ppol5203/finalproject.html.
Take 2 min to read through.

You Project Proposal

Requirement Due Length Percentage
Project Proposal November 15 2 pages 5%
Presentation December 10 10-15 minutes 10%
Project Report December 17 10 pages 25%


Questions?

Where we are…

  • We started with the basics of being a data scientist

  • We moved over to the primitives of Python as your main DS tool

  • Then we started our journey working with tabular data.

  • Now we will learn how to collect/parse with unstructure data sources:

    • Scrapping websites (today)

    • Working with APIs and Dynamic Websites (next week)

Plans for Today

  • Scraping Static Websites

    • Different strategies to acquire digital data
    • Html data structure
    • Scrape static websites
    • Build a scraper programmatically

Acquiring Digital Data

Why? Digital Era

Two ways to access data from the web

  • Scraping: grab information available on websites (Today)

    • Leveraging the structure of a website to grab it’s contents

    • Using a programming environment (such as R, Python, Java, etc.) to systematically extract that content.

    • Accomplishing the above in an “unobtrusive” and legal way.

  • APIs: Data transfer portals built by companies to allow people to access data programmatically. (Next week)

    • When available, we will always prefer APIs

Post-Api Era

Consider the following developments from the past year or so months:

What have I scraped?

  • Electoral data from many different countries;

  • Composition of elites around the world;

  • Wikipedia;

  • Toutiao, a news aggregation from China;

  • Political Manifestos in Brazil

  • Fact-Checking News

  • Facebook and Youtube Live Chats.

  • Property Prices from Zillow.

  • News in Latin American

What is a website?

A website in general is a combination of HTML, CSS, Javascript and PHP.

  • HTML provides the structure of the website.

  • CSS makes that structure visually appealing by controlling the design and layout.

  • JavaScript adds interactivity and makes the website dynamic on the client side.

  • PHP handles server-side tasks like generating dynamic HTML content or interacting with a database.

A simple example

<html>
<head>
  <title> Michael Cohen's Email </title>
  <script>
    var foot = bar;
  <script>
</head>
<body>
  <div id="payments">
  <h2>Second heading</h2>
  <p class='slick'>information about <br/><i>payments</i></p>
  <p>Just <a href="http://www.google.com">google it!</a></p>
</body>
</html>

We will care mostly about HTMLs and CSSs for static websites.

Scraping is all about finding tags and collecting the data associated with them

Tags + HTML Elements

Most common tags

  • p – paragraphs
  • a href – links
  • div – divisions
  • h – headings
  • table – tables

See here for more about html tags

Scraping Routine

Scraping often involves the following routine:

  • Step 1: Find a website with information you want to collect

  • Step 2: Understand and decipher the website

  • Step 3: Write code to collect **one* realization of the data

  • Step 4: Build a scraper – generalize you code into a function.

We will cover these steps with code!!

Ethical Challenges on scraping

Webscraping is legal as long as the scraped data is publicly available and the scraping activity does not harm the website and the people from whom information is being scraped.

Here is a list of good practices for scraping:

  • Don’t hit servers too often and on peak hours

  • Slow down your code to the speed humans would manually do

  • Use data responsibly (As academics often do)

  • Respect robots.txt

# Put the system to sleep by that random unit
import time
time.sleep(random.uniform(1,5))

Notebook on Scraping