# Put the system to sleep by that random unit
import time
time.sleep(random.uniform(1,5))Week 8: Parsing Unstructured Data: Scraping Static Websites
Scraping Static Websites (Lecture Notes 1)
Accessing data via APIs (Lecture Notes 2)
Optional: Scraping Dynamic Website with Selenium (Lecture Notes 3)
Scraping: grab information available on websites
Leveraging the structure of a website to grab it’s contents
Using a programming environment (such as R, Python, Java, etc.) to systematically extract that content.
Accomplishing the above in an “unobtrusive” and legal way.
APIs: Data transfer portals built by companies to allow people to access data programmatically.
Elon Musk has eliminated free access to Twitter’s API, and the only academically useful paid tiers far exceed most researchers’ budgets.
Musk has also demanded that Decahose users delete all Twitter data acquired under previous agreements–whether this demand will be extended to Academic API users is currently unknown.
Reddit has denied access to its API for Pushshift, a popular service used by researchers to collect Reddit data. Popular Reddit app Apollo is facing API charges of $1.7M per month to continue operating.
TikTok released a new API for researchers, which among other things requires them “to regularly refresh TikTok Research API Data at least every fifteen (15) days, and delete data that is not available from the TikTok Research API at the time of each refresh.”
Crowdtangle, Meta’s researcher tool for acquiring data from Facebook and Instagram, still exists as of this writing. But rumors of its imminent demise have been reported in multiple reputable outlets.
A website in general is a combination of HTML, CSS, Javascript and PHP.
HTML provides the structure of the website.
CSS makes that structure visually appealing by controlling the design and layout.
JavaScript adds interactivity and makes the website dynamic on the client side.
PHP handles server-side tasks like generating dynamic HTML content or interacting with a database.
<html>
<head>
<title> Michael Cohen's Email </title>
<script>
var foot = bar;
<script>
</head>
<body>
<div id="payments">
<h2>Second heading</h2>
<p class='slick'>information about <br/><i>payments</i></p>
<p>Just <a href="http://www.google.com">google it!</a></p>
</body>
</html>
We will care mostly about HTMLs and CSSs for static websites.
Scraping often involves the following routine:
Step 1: Find a website with information you want to collect
Step 2: Understand and decipher the website
Step 3: Write code to collect **one* realization of the data
Step 4: Build a scraper – generalize you code into a function.
We will cover these steps with code!!
Webscraping is legal as long as the scraped data is publicly available and the scraping activity does not harm the website and the people from whom information is being scraped.
Here is a list of good practices for scraping:
Don’t hit servers too often and on peak hours
Slow down your code to the speed humans would manually do
Use data responsibly (As academics often do)
APIs: Application Programming Interface
How does it work: online server created to facilitate information exchange between data users and the data holders
Use Cases:
We will focus today on the first component. We will see the second later when we do LLMs
An API is just an URL. See the example below:
http://mywebsite.com/endpoint?key¶m_1¶m_2
Main Components:
http://mywebsite.com//: API root. The domain of your api/
endpoint: An endpoint is a server route for retrieving specific data from an API
key: credentials that some websites ask for you to create before you can query the api.
?param_1¶m_2. Those are filters that you can input in apis requests. We call these filters parameters
In order to work with APIs, we need tools to access the web. In Python, the most common library for making requests and working with APIs is the requests library.
get(): to receive information from the API
post(): to send information to the API – think about the use of ChatGPT for classification of text.
When using an API, our work will often involve the following steps:
We will work with three different types of APIs today:
Trivia API: API that does not require authentication and does not have a wrapper
Yelp API: API that does require authentication and does not have a wrapper
Youtube API: API that does require authentication and does have a wrapper
Wrapper: a set of functions or methods (full library) that provide a simplified interface to interact with an underlying API.
Data science I: Foundations