PPOL 5203 - Data Science I: Foundations

Week 8: Parsing Unstructured Data: Scraping Static Websites

Professor: Tiago Ventura

Plans for Today

  • Scraping Static Websites (Lecture Notes 1)

    • Different strategies to acquire digital data
    • Html data structure
    • Scrape static websites
    • Build a scraper programmatically
  • Accessing data via APIs (Lecture Notes 2)

  • Optional: Scraping Dynamic Website with Selenium (Lecture Notes 3)

Acquiring Digital Data

Two ways to access data from the web

  • Scraping: grab information available on websites

    • Leveraging the structure of a website to grab it’s contents

    • Using a programming environment (such as R, Python, Java, etc.) to systematically extract that content.

    • Accomplishing the above in an “unobtrusive” and legal way.

  • APIs: Data transfer portals built by companies to allow people to access data programmatically.

    • When available, we will always prefer APIs

Post-Api Era

What is a website?

A website in general is a combination of HTML, CSS, Javascript and PHP.

  • HTML provides the structure of the website.

  • CSS makes that structure visually appealing by controlling the design and layout.

  • JavaScript adds interactivity and makes the website dynamic on the client side.

  • PHP handles server-side tasks like generating dynamic HTML content or interacting with a database.

A simple example

<html>
<head>
  <title> Michael Cohen's Email </title>
  <script>
    var foot = bar;
  <script>
</head>
<body>
  <div id="payments">
  <h2>Second heading</h2>
  <p class='slick'>information about <br/><i>payments</i></p>
  <p>Just <a href="http://www.google.com">google it!</a></p>
</body>
</html>

We will care mostly about HTMLs and CSSs for static websites.

Scraping is all about finding tags and collecting the data associated with them

Tags + HTML Elements

Most common tags

  • p – paragraphs
  • a href – links
  • div – divisions
  • h – headings
  • table – tables

See here for more about html tags

Scraping Routine

Scraping often involves the following routine:

  • Step 1: Find a website with information you want to collect

  • Step 2: Understand and decipher the website

  • Step 3: Write code to collect **one* realization of the data

  • Step 4: Build a scraper – generalize you code into a function.

We will cover these steps with code!!

Ethical Challenges on scraping

Webscraping is legal as long as the scraped data is publicly available and the scraping activity does not harm the website and the people from whom information is being scraped.

Here is a list of good practices for scraping:

  • Don’t hit servers too often and on peak hours

  • Slow down your code to the speed humans would manually do

  • Use data responsibly (As academics often do)

# Put the system to sleep by that random unit
import time
time.sleep(random.uniform(1,5))

Notebook on Scraping

APIs

APIs 101

  • APIs: Application Programming Interface

  • How does it work: online server created to facilitate information exchange between data users and the data holders

  • Use Cases:

    • Access data shared by Companies and NGOs
    • Process our data with model developed by third party organizations
  • We will focus today on the first component. We will see the second later when we do LLMs

API Components

An API is just an URL. See the example below:

http://mywebsite.com/endpoint?key&param_1&param_2

  • Main Components:

  • http://mywebsite.com//: API root. The domain of your api/

  • endpoint: An endpoint is a server route for retrieving specific data from an API

  • key: credentials that some websites ask for you to create before you can query the api.

  • ?param_1&param_2. Those are filters that you can input in apis requests. We call these filters parameters

Making an API Request

In order to work with APIs, we need tools to access the web. In Python, the most common library for making requests and working with APIs is the requests library.

  • get(): to receive information from the API

  • post(): to send information to the API – think about the use of ChatGPT for classification of text.

Step-by-Step

When using an API, our work will often involve the following steps:

  • Step 1: Look at the API documentation and endpoints, and construct a query of interest
  • Step 2: Use requests.get(querystring) to call the API
  • Step 3: Examine the response
  • Step 4: Extract your data and save it.

In-Class Example

We will work with three different types of APIs today:

  • Trivia API: API that does not require authentication and does not have a wrapper

  • Yelp API: API that does require authentication and does not have a wrapper

  • Youtube API: API that does require authentication and does have a wrapper

  • Wrapper: a set of functions or methods (full library) that provide a simplified interface to interact with an underlying API.

Notebook for APIs