In the class today, we will focus on:
We start our first lecture looking at this graph. It shows two things:
Not all this data is available on digital spaces (like websites, social media apps, and digital archives). But some are. And as data scientists a primary skill that is expected from you is to be able to acquire, process, store and analyze this data. Today, we will focus on acquiring data in the digital information era.
There are three primary techniques through which you can acquire digital data:
Scraping consists of automatically collecting data available on websites. In theory, you can collect website data by hand, or asking a couple of friends to help you. However, in a world of abundant data, this is likely not feasible, and in general, it may become more difficult once you have learned to collect it automatically.
Let me give you some examples of websites I have alread scraped:
Scraping can be summarize in:
leveraging the structure of a website to grab it's contents
using a programming environment (such as R, Python, Java, etc.) to systematically extract that content.
accomplishing the above in an "unobtrusive" and legal way.
An API is a set of rules and protocols that allows software applications to communicate with each other. APIs provide an front door for a developer to interact with a website.
APIs are used for many different types of online communication and information sharing, among those, many APIs have been developed to provide an easy and official way for developers and data scientists to access data.
As these APIs are developed by data owners, they are often secure, practical, and more organized than acquiring data through scrapping.
Scraping is a back door for when there’s no API or when we need content beyond the structured fields the API returns
if you can use the API to access a dataset, that's where you will want to go
Webscraping is legal as long as the scraped data is publicly available and the scraping activity does not harm the website being scraped. These are two hugely relevant conditionals. For this reason, before we start coding, it is carefully understand what each entails.
Each call to a web server takes time, server cycles, and memory. Most servers can handle significant traffic, but they can't necessarily handle the strain induced by massive automated requests. Your code can overload the site, taking it offline, or causing the site administrator to ban your IP. See Denial-of-service attack (DoS).
We do not want compromise the functioning of a website just because of our research. First, this overload can crash a server and prevent other users from accessing the site. Second, servers and hosters can, and do, implement countermeasures (i.e. block our access from our IP and so on).
In addition, take as a best practice of only collecting public information. Think about Facebook. In my personal view, it is okay to collect public posts, or data from public groups. If by some way you manage to get into private groups, and group members have an expectation of privacy, it is not okay to collect their data.
Here is a list of good practices for scraping:
Scraping often involves the following routine:
And repeat!
A website in general is a combination of HTML, CSS, XML, PHP, and Javascript. We will care mostly about HTMLs and CSSs.
HTML forms what we call static websites - everything you see is there in the source behind the website. Javascript produces dynamic sites - ones that you browse and click on and the url doesn't change - and are sites typically powered by a database deep within the programming.
Today we will deal with static websites using the Python library Beautiful Soup. For dynamic websites, we will learn next class about working with selenium in Python.
HTML stands for HyperText Markup Language. As it is explict from the name, it is a markup language used to create web pages and is a cornerstone technology of the internet. It is not a programming language as Python, R and Java. Web browsers read HTML documents and render them into visible or audible web pages.
See an example of an html file:
<html>
<head>
<title> Michael Cohen's Email </title>
<script>
var foot = bar;
<script>
</head>
<body>
<div id="payments">
<h2>Second heading</h2>
<p class='slick'>information about <br/><i>payments</i></p>
<p>Just <a href="http://www.google.com">google it!</a></p>
</body>
</html>
HTML code is structured using tags, and information is organized hierarchcially (like a list or an array) from top to bottom.
Some of the most important tags we will use for scraping are:
<div class="alert alert-block alert-danger", style="font-size: 20px;"> Scraping is all about finding tags and collecting the data associated with them </div>
The tags are the target. The information we need from html usually come from texts and attributes of the tag. Very often your work will consist on finding the tag, and then capturing the information you need. The figure below summarizes well this difference on html files.
As you anticipate, a huge part of the scraping work is to understand your website and find the tags/information you are interested in. There are two ways to go about it:
### Inspect the website: command + shift + i or select element, right click in the mouse, and inspect.
Here: https://chromewebstore.google.com/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb
To do webscraping, we will use two main libraries:
requests: to make a get() request and access the html behind the pagesBeautifulSoup: to parse the htmlSee this nice tutorial here to understand the difference between parsing html with BeautifulSoup and using text mining methods to scrap website.
# install libraries - Take the # out if this is the first time you are installing these packages.
#!pip install requests
#!pip install beautifulsoup4
Let's scrap our first website. We will start scrapping some news from BBC
# setup
import pandas as pd
import requests # For downloading the website
from bs4 import BeautifulSoup # For parsing the website
import time # To put the system to sleep
import random # for random numbers
# Get access to the website
url = "https://www.cnn.com/2025/10/26/politics/mamdani-sanders-aoc-rally-nyc"
page = requests.get(url)
# check object type - object class requests
type(page)
# check if you got a connection
page.status_code # 200 == Connection
# See the content.
# notice we downloaded the entire website.
# Do inspect to make sure of this in the web browser
page.content[0:1000]
After we make a request and retrieve a web page's content, we can store that content locally with Python's open() function. Saving a html source could avoid you to hit the website multiple times.
# save html locally
with open("cnn_news1", 'wb') as f:
f.write(page.content)
And here is how to open:
# open a locally saved html
with open("cnn_news1", 'rb') as f:
html = f.read()
# see it
print(html[0:1000])
Next, you will create a beautifulsoup object. A beautifulsoup object is just a parser. It allows us to easily access elements from the raw html.
# create an bs object.
# input 1: request content; input 2: tell you need an html parser
soup = BeautifulSoup(page.content, 'html.parser')
# Let's look at the raw code of the downloaded website
print(soup.prettify()[:1000])
With the parser, we can look start looking at the data. The functions we will use the most are:
.find_all(): to find tags by their names.select(): to select tags by using the CSS selector .get_text(): to access the text in the tag["attr"]: to access attributes of a tagLet's start trying to grab all the textual information of the news. These are often under the tag <p> for paragraph
## find paragraph
cnn_par = soup.find_all('p')
cnn_par
## let's see how it looks like
len(cnn_par)
## let print one
cnn_par[3]
You see you just parsed the full tag for all paragraphs of the text. Let's remove all html tags using the .get_text() method
# get the text.
# This is what is in between the tags <p> TEXT </p>
cnn_par[0].get_text()
# use our friend list compreehension to parse all
all_par = [par.get_text() for par in cnn_par]
all_par
You see neverthless that you did collect some junk that are not the paragraph information you are looking for.
This happens because there are multiple instances (under different tags) in which the tag <p> is used for. For example, if you look at the last element of the all_par list, you will see your scraper is collecting the footer of the webpage.
Be more specific. Work with a CSS selector.
A CSS selector is a pattern used to select and style one or more elements in an HTML document. It is a way to chain multiple style and attributes of an html file.
Another way to do this is using XPATH, which can be super useful to learn, but a bit more complicated for begginers.
See this tutorial here to understand the concept of a css selector
Let's use the selector gadget tool (see tutorial) to get a css selector for all the paragraphs.
# open your webbrowser and use the selector gadget
# website: https://www.cnn.com/2025/10/26/politics/mamdani-sanders-aoc-rally-nyc
# Use a css selector to target specific content
cnn_par = soup.select(".vossi-paragraph")
cnn_par[-1]
story_content = [i.get_text() for i in cnn_par]
story_content
story_content[0].strip()
# Clean and join together with string methods
story_text = "\n".join([i.strip() for i in story_content])
print(story_text)
# title
css_loc = "#maincontent"
story_title = soup.select(css_loc)
story_title[0]
story_title = story_title[0].get_text()
print(story_title)
# story date
story_date = soup.select(".vossi-timestamp")[0].get_text()
print(story_date)
# story authors
story_author = soup.select(".byline__name")[0].get_text()
print(story_author)
# let's nest all in a list
entry = [url, story_title.strip(),story_date.strip(),story_text]
entry
Your task: Look a this website here: https://www.latinnews.com/latinnews-country-database.html?country=2156
Do the following taks:
# write your solution here
After you have your scrapper working for a single case, you will generalize your work. As we learned before, we do this by creating a function (or a full class with different methods).
Notice, here it is also important to add the good practices which try to imitate human behavior inside of our functions.
Let's start with a function to scrap news from CNN
# Building a scraper
# The idea here is to just wrap the above in a function.
# Input: url
# Output: relevant content
def cnn_scraper(url=None):
'''
this function scraps relevant content from cnn website
input: str, url from cnn
'''
# Get access to the website
page = requests.get(url)
# create an bs object.
soup = BeautifulSoup(page.content, 'html.parser') # input 1: request content; input 2: tell you need an html parser
# check if you got a connection
if page.status_code == 200:
# parse text
bbc_par = soup.select(".vossi-paragraph")
story_content = [i.get_text() for i in bbc_par]
story_text = "\n".join([i.strip() for i in story_content])
# parse title
css_loc = "#maincontent"
story_title = soup.select(css_loc)[0].get_text()
# story date
story_date = soup.select(".vossi-timestamp")[0].get_text()
# story authors
story_author = soup.select(".byline__name")[0].get_text()
# let's nest all in a list
entry = {"url":url, "story_title":story_title.strip(),
"story_date":story_date.strip(),"text":story_text}
# return
return entry
Let's see if our function works:
# Test on the same case
url = "https://www.cnn.com/2025/10/26/politics/mamdani-sanders-aoc-rally-nyc"
scrap_news = cnn_scraper(url=url)
scrap_news
Let's now assume you actually have a list of urls. So we will iterated our scrapper through this list.
# create a list of urls
urls = ["https://www.cnn.com/2025/10/26/politics/mamdani-sanders-aoc-rally-nyc",
"https://www.cnn.com/2025/10/26/politics/jessica-tisch-zohran-mamdani-election",
"https://www.cnn.com/2025/10/24/politics/fact-check-ballroom-press-secretary",
"https://www.cnn.com/2025/09/26/politics/cnn-independents-poll-methodology"]
# Then just loop through and collect
scraped_data = []
for url in urls:
# Scrape the content
scraped_data.append(cnn_scraper(url))
# Put the system to sleep for a random draw of time (be kind)
time.sleep(random.uniform(.5,1))
print(url)
# Look at the data object
scraped_data[3]
# Organize as a pandas data frame
dat = pd.DataFrame(scraped_data)
dat.head()
This completes all the steps of scraping:
It is unlikely you will ever have a complete list of urls you want to scrap. Most likely collecting the full list of sources will be a step on your scraping task. Remember, urls usually come embedded as tags attributes. So let's write a function to collect multiple urls from the CNN website. Let's do so following all our pre-determined steps
# Step 1: Find a website with information you want to collect
## let's get links on cnn politics
url = "https://www.cnn.com/politics"
url
# Step 2: Understand the website
# links are embedded across multiple titles.
# these titles have the follwing tag <.container__headline span>
# Step 3: Write code to collect one realization of the data
# Get access to the website
page = requests.get(url)
# create an bs object.
soup = BeautifulSoup(page.content, 'html.parser') # input 1: request content; input 2: tell you need an html parser
# Step 3: Write code to collect one realization of the data
# with a css selector
links = soup.select(".container_lead-plus-headlines__item--type-section")
#links = soup.select(".container_lead-plus-headlines__headline")
links
links[0]
links[0].attrs["data-open-link"]
# grab links
links_from_cnn = []
# iterate
for link in links:
links_from_cnn.append(link["data-open-link"])
# print
print(links_from_cnn)
## another way to do this is by using href attributes of a tag
links_from_cnn = []
# Extract relevant and unique links
for tag in soup.find_all("a"):
href = tag.attrs.get("href")
links_from_cnn.append(href)
# much more extensive set of links
print(links_from_cnn)
import re
## clean the output.
# Keep only stories that starts with "/" and "fourdigits"
# combine with the base url
links_from_cnn_reduced = ["https://www.cnn.com" + l for l in links_from_cnn if re.match(r'^/(\d{4})', str(l))]
links_from_cnn_reduced
## Step 4: Build a scraper -- generalize you code into a function.
# Let's write the above as a single function
def collect_links_cnn(url=None):
"""Scrape multiple CNN URLS.
Args:
url (list): list of valid CNN page to collect links.
Returns:
DataFrame: frame containing headline, date, and content fields
"""
# Get access to the website
page = requests.get(url)
# create an bs object.
soup = BeautifulSoup(page.content, 'html.parser') # input 1: request content; input 2: tell you need an html parser
## another way to do this is by using href attributes of a tag
links_from_cnn = []
# Extract relevant and unique links
for tag in soup.find_all("a"):
href = tag.attrs.get("href")
links_from_cnn.append(href)
## clean the output.
# Keep only stories that starts with "/" and "fourdigits"
# combine with the base url
links_from_cnn_reduced = ["https://www.cnn.com" + l for l in links_from_cnn if re.match(r'^/(\d{4})', str(l))]
links_from_cnn_reduced
return links_from_cnn_reduced
links_cnn = collect_links_cnn("https://www.cnn.com/politics")
links_cnn[:10]
With this list, you can apply your scrapper function to multiple links:
len(links_cnn)
# let's get some cases
links_cnn_ = links_cnn[10:15]
# Then just loop through and collect
scraped_data = []
for url in links_cnn_:
# check what is going on
print(url)
# Scrape the content
scraped_data.append(cnn_scraper(url))
# Put the system to sleep for a random draw of time (be kind)
time.sleep(random.uniform(.5,3))
# save as pandas df
# Organize as a pandas data frame
dat = pd.DataFrame(scraped_data)
dat.head()
# add an error
links_cnn_ = links_cnn[10:15]
links_cnn_.append("https://www.latinnews.com/component/k2/item/107755.html?period=2025&archive=3&Itemid=6&cat_id=837700:in-brief-brazil-s-current-account-deficit-widens-in-september")
# run the loop in a secure setup
# Then just loop through and collect
scraped_data = []
list_of_errors = []
for url in links_cnn_:
# check what is going on
print(url)
# Scrape the content
try:
scraped_data.append(cnn_scraper(url))
# Put the system to sleep for a random draw of time (be kind)
time.sleep(random.uniform(.5,3))
except Exception as e:
list_of_errors.append([url, e])
dat = pd.DataFrame(scraped_data)
dat.head()
list_of_errors
Now, return to https://www.latinnews.com/latinnews-country-database.html?country=2156
Do the following:
# setup
import pandas as pd
import requests # For downloading the website
from bs4 import BeautifulSoup # For parsing the website
import time # To put the system to sleep
import random # for random numbers
import re # regular expressions
!jupyter nbconvert _week-07_scraping_static.ipynb --to html --template classic