PPOL 5203 - Data Science I: Foundations
Week 8: Parsing Unstructured Data: Scraping Static Websites
Final Project: https://tiagoventura.github.io/ppol5203/finalproject.html.
Take 2 min to read through.
You Project Proposal
Requirement | Due | Length | Percentage |
---|---|---|---|
Project Proposal | November 15 | 2 pages | 5% |
Presentation | December 10 | 10-15 minutes | 10% |
Project Report | December 17 | 10 pages | 25% |
Groups of three students. You pick your groups. Add your groups here: https://docs.google.com/spreadsheets/d/1tXCNbAV-vpCMA96OZ0yy_E1quHUhSPuwSrFZ6_SafDk/edit?gid=0#gid=0
You should have meet with me to discuss your proposal.
At lest one hour before our meeting, send me a draft of your proposal.
Send me an email with your group and when you are going to my office hours.
Our meeting should happen before November 12. You have three office hours to go!
Questions?
Where we are…
We started with the basics of being a data scientist
We moved over to the primitives of Python as your main DS tool
Then we started our journey working with tabular data.
Now we will learn how to collect/parse with unstructure data sources:
Scrapping websites (today)
Working with APIs and Dynamic Websites (next week)
Plans for Today
Scraping Static Websites
- Different strategies to acquire digital data
- Html data structure
- Scrape static websites
- Build a scraper programmatically
Acquiring Digital Data
Why? Digital Era
Two ways to access data from the web
Scraping: grab information available on websites (Today)
Leveraging the structure of a website to grab it’s contents
Using a programming environment (such as R, Python, Java, etc.) to systematically extract that content.
Accomplishing the above in an “unobtrusive” and legal way.
APIs: Data transfer portals built by companies to allow people to access data programmatically. (Next week)
- When available, we will always prefer APIs
Post-Api Era
Consider the following developments from the past year or so months:
Elon Musk has eliminated free access to Twitter’s API, and the only academically useful paid tiers far exceed most researchers’ budgets.
Musk has also demanded that Decahose users delete all Twitter data acquired under previous agreements–whether this demand will be extended to Academic API users is currently unknown.
Reddit has denied access to its API for Pushshift, a popular service used by researchers to collect Reddit data. Popular Reddit app Apollo is facing API charges of $1.7M per month to continue operating.
TikTok released a new API for researchers, which among other things requires them “to regularly refresh TikTok Research API Data at least every fifteen (15) days, and delete data that is not available from the TikTok Research API at the time of each refresh.”
Crowdtangle, Meta’s researcher tool for acquiring data from Facebook and Instagram, still exists as of this writing. But rumors of its imminent demise have been reported in multiple reputable outlets.
- This was last year. Crowdtangle has been closed this year.
What have I scraped?
Electoral data from many different countries;
Composition of elites around the world;
Wikipedia;
Toutiao, a news aggregation from China;
Political Manifestos in Brazil
Fact-Checking News
Facebook and Youtube Live Chats.
Property Prices from Zillow.
News in Latin American
What is a website?
A website in general is a combination of HTML, CSS, Javascript and PHP.
HTML provides the structure of the website.
CSS makes that structure visually appealing by controlling the design and layout.
JavaScript adds interactivity and makes the website dynamic on the client side.
PHP handles server-side tasks like generating dynamic HTML content or interacting with a database.
A simple example
<html>
<head>
<title> Michael Cohen's Email </title>
<script>
var foot = bar;
<script>
</head>
<body>
<div id="payments">
<h2>Second heading</h2>
<p class='slick'>information about <br/><i>payments</i></p>
<p>Just <a href="http://www.google.com">google it!</a></p>
</body>
</html>
We will care mostly about HTMLs and CSSs for static websites.