<h1><center> PPOL 5203 Data Science I: Foundations <br><br> 
<font color='grey'> 
<br>Scraping static websites<br>
Tiago Ventura </center> <h1> 

---

# Learning Goals

In the class today, we will focus on:

- Understand different strategies to acquire digital data
- Understanding html structure to look up content on a website
- Scrape content from a static website
- Build a scraper to systematically draw content from similarly organized webpages.

# The Digital information age

We start our first lecture looking at this graph. It shows two things: 

- in the past few years we have produced and stored an enourmous among of data
- Most of this data is produced and stored in digital environments. 

<div>
<img src="http://media3.washingtonpost.com/wp-dyn/content/graphic/2011/02/11/GR2011021100614.jpg" width="60%"/>
</div>


Not all this data is available on digital spaces (like websites, social media apps, and digital archives). But some are. And as data scientists a primary skill that is expected from you is to be able to acquire, process, store and analyze this data. Today, we will focus on **acquiring data in the digital information era.** 

There are three primary techniques through which you can acquire digital data: 

- Scrap data from self-contained (static) websites
- Scrap data from dynamic (javascript powered) websites
- Access data through Application Programming Interfaces

## What is scraping? 

**Scraping** consists of automatically collecting data available on websites. In theory, you can collect website data  by hand, or asking a couple of friends to help you. However, in a world of abundant data, this is likely not feasible, and in general, it may become more difficult once you have learned to collect it automatically.

Let me give you some **examples of websites** I have alread scraped: 

- Electoral data from many different countries;
- Composition of elites around the world;
- Wikipedia; 
- Toutiao, a news aggregation from China; 
- Political Manifestos in Brazil 
- Fact-Checking News
- Facebook and Youtube Live Chats. 
- Property Prices from Zillow. 

Scraping can be summarize in: 

- leveraging the structure of a website to **grab it's contents**

- using a programming environment (such as R, Python, Java, etc.) to **systematically extract** that content.

- accomplishing the above in an "unobtrusive" and **legal** way.



## Scraping vs APIs


An API is a set of rules and protocols that allows software applications to communicate with each other. APIs provide an front door for a developer to interact with a website. 

APIs are used for many different types of online communication and information sharing, among those, many **APIs have been developed to provide an easy and official way for developers and data scientists to access data**. 

As these APIs are developed by data owners, they are often secure, practical, and more organized than acquiring data through scrapping. 

Scraping is a back door for when there’s no API or when we need content beyond the structured fields the API returns

**if you can use the API to access a dataset, that's where you will want to go**

## Ethical Challenges with Scraping

Webscraping is legal **as long as the scraped data is publicly available and the scraping activity does not harm the website being scraped**. These are two hugely relevant conditionals.  For this reason, before we start coding, it is carefully understand what each entails. 

Each call to a web server takes time, server cycles, and memory. Most servers can handle significant traffic, but they can't necessarily handle the strain induced by massive automated requests. Your code can overload the site, taking it offline, or causing the site administrator to ban your IP. See [Denial-of-service attack (DoS)](https://en.wikipedia.org/wiki/Denial-of-service_attack).

We do not want compromise the functioning of a website just because of our research. First, this overload can crash a server and prevent other users from accessing the site. Second, servers and hosters can, and do, implement countermeasures (i.e. block our access from our IP and so on). 

In addition, take as a best practice of only collecting public information. Think about Facebook. In my personal view, it is okay to collect public posts, or data from public groups. If by some way you manage to get into private groups, and group members have an expectation of privacy, it is not okay to collect their data. 

Here is a list of good practices for scraping:

- Respect robots.txt
- Don't hit servers too often
- Slow down your code to the speed humans would manually do
- Find trusted source sites
- Do not shave during peak hours
- Improve your code speed
- Use data responsibly (As academics often do)

## Scraping Routine

Scraping often involves the following routine: 

- **Step 1:** Find a website with information you want to collect
- **Step 2:** Understand the website
- **Step 3:** Write code to collect one realization of the data
- **Step 4:** Build a scraper -- generalize you code into a function.

And repeat!

## Step 1: Find a Website... but what is a website? 

A website in general is a combination of **HTML, CSS, XML, PHP, and Javascript**. We will care mostly about HTMLs and CSSs. 


### Static vs Dynamic Websites

HTML forms what we call **static websites** - everything you see is there in the source behind the website. Javascript produces dynamic sites - ones that you browse and click on and the url doesn't change - and are sites typically powered by a database deep within the programming. 

Today we will deal with static websites using the Python library `Beautiful Soup`. For dynamic websites, we will learn next class about working with `selenium` in Python. 

### HTML Website

HTML stands for **HyperText Markup Language**. As it is explict from the name, it is  a markup language used to create web pages and is a cornerstone technology of the internet. It is not a programming language as Python, R and Java.  Web browsers read HTML documents and render them into visible or audible web pages.

See an example of an html file: 


```
<html>
<head>
  <title> Michael Cohen's Email </title>
  <script>
    var foot = bar;
  <script>
</head>
<body>
  <div id="payments">
  <h2>Second heading</h2>
  <p class='slick'>information about <br/><i>payments</i></p>
  <p>Just <a href="http://www.google.com">google it!</a></p>
</body>
</html>
```

HTML code is structured using tags, and information is organized hierarchcially (like a list or an array) from top to bottom. 

Some of the most important tags we will use for scraping are: 


- **p** – paragraphs
- **a href** – links
- **div** – divisions
- **h** – headings
- **table** – tables

See [here for more about html tags](https://betterprogramming.pub/understanding-html-basics-for-web-scraping-ae351ee0b3f9)

<div class="alert alert-block alert-danger", style="font-size: 20px;">
Scraping is all about finding tags and collecting the data associated with them
</div>

### What else exists on HTML beyond tags?

The tags are the target. The information we need from html usually come from texts and attributes of the tag. Very often your work will consist on finding the tag, and then capturing the information you need. The figure below summarizes well this difference on html files. 

<div>
<img src="https://static.semrush.com/blog/uploads/media/59/fc/59fc528eecc00e43b1a3ed5d9b9933ee/4YA3vCJ_Hw6DucoVZ40FbKFRppAReJVOkLKHcZlDkO-9geydLO6tw9uzFJFZf5nam3QcT7p0hRdpFyL2uPhoDISD8CPZwfPE5GTqgpH53q9M99QWgDVhjgQrCMOlQI9fA1T2dCxJ5T2goCV3k1wo-Jc.webp
" width="60%"/>
</div>

Source:[https://www.semrush.com/blog/html-anchor/](https://www.semrush.com/blog/html-anchor/)

## Step 2: Understand the website

As you anticipate, a huge part of the scraping work is to understand your website and find the tags/information you are interested in. There are two ways to go about it: 

- ### Inspect the website: `command` + `shift` + `i` or select element, right click in the mouse, and inspect. 

    - See [an example in practice](https://storage.googleapis.com/lds-media/documents/css_selector_vs_xpath.mp4) (*Source: Ultimate Guide to Web Scraping with Python by Brenda Marting*) 

<br>

- ### Use selector gadget: selector gadget is a tool that allow us to use CSS selector to scrap websites. 
    - See the [documentation](https://selectorgadget.com/) and a [tutorial here](https://www.youtube.com/watch?v=YdIWI6K64zo).

## Break : Install selector gadget

Here: https://chromewebstore.google.com/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb

## Step 3: Collect a realization of the data

To do webscraping, we will use two main libraries: 

- `requests`: to make a `get()` request and access the html behind the pages
- `BeautifulSoup`: to parse the html


See this [nice tutorial here](https://realpython.com/python-web-scraping-practical-introduction/) to understand the difference between parsing html with `BeautifulSoup` and using text mining methods to scrap website. 

In [1]:
# install libraries - Take the # out if this is the first time you are installing these packages. 
#!pip install requests
#!pip install beautifulsoup4

### Scraping: CNN Politics

Let's scrap our first website. We will start scrapping some news from BBC

In [1]:
# setup
import pandas as pd
import requests # For downloading the website
from bs4 import BeautifulSoup # For parsing the website
import time # To put the system to sleep
import random # for random numbers

#### Get Request to collect html data

In [2]:
# Get access to the website
url = "https://www.cnn.com/2023/09/18/politics/iran-american-prisoners-us-return"
page = requests.get(url)

In [3]:
# check object type - object class requests
type(page)

requests.models.Response

In [5]:
# check if you got a connection
page.status_code # 200 == Connection

200

In [6]:
# See the content. 
# notice we downloaded the entire website.
# Do inspect to make sure of this in the web browser
page.content[0:1000]

b'  <!DOCTYPE html>\n<html lang="en" data-uri="cms.cnn.com/_pages/h_c8726845aa4216f1c265db0eed40b250@published" data-layout-uri="cms.cnn.com/_layouts/layout-with-rail/instances/politics-article-v1@published" >\n  <head><style>body,h1,h2,h3,h4,h5{font-family:cnn_sans_display,helveticaneue,Helvetica,Arial,Utkal,sans-serif}:root{--base-space-48:48px;--base-font-letter-spacing-100:1px;--base-size-36:36px;--base-color-transparent-black-20:#0c0c0c33;--base-font-text-transform-uppercase:uppercase;--base-font-text-decoration-none:none;--base-font-line-height-12:12px;--base-color-transparent-black-70:#0c0c0cb3;--base-font-line-height-14:14px;--base-font-letter-spacing-50:0.5;--base-color-transparent-black-10:#0c0c0c1a;--base-font-line-height-10:10px;--base-color-transparent-black-60:#0c0c0c99;--base-font-letter-spacing-1200:12px;--base-size-04:4px;--base-color-transparent-black-50:#0c0c0c80;--base-font-letter-spacing-150:1.5;--base-color-transparent-black-40:#0c0c0c66;--base-font-letter-spacing

#### Saving and Loading a HTML locally

After we make a request and retrieve a web page's content, we can store that content locally with Python's `open()` function. Saving a html source could avoid you to hit the website multiple times. 

In [7]:
# save html locally
with open("cnn_news1", 'wb') as f:
    f.write(page.content)

And here is how to open:

In [8]:
# open a locally saved html
with open("cnn_news1", 'rb') as f:
    html = f.read()
# see it
print(html[0:1000])

b'  <!DOCTYPE html>\n<html lang="en" data-uri="cms.cnn.com/_pages/h_c8726845aa4216f1c265db0eed40b250@published" data-layout-uri="cms.cnn.com/_layouts/layout-with-rail/instances/politics-article-v1@published" >\n  <head><style>body,h1,h2,h3,h4,h5{font-family:cnn_sans_display,helveticaneue,Helvetica,Arial,Utkal,sans-serif}:root{--base-space-48:48px;--base-font-letter-spacing-100:1px;--base-size-36:36px;--base-color-transparent-black-20:#0c0c0c33;--base-font-text-transform-uppercase:uppercase;--base-font-text-decoration-none:none;--base-font-line-height-12:12px;--base-color-transparent-black-70:#0c0c0cb3;--base-font-line-height-14:14px;--base-font-letter-spacing-50:0.5;--base-color-transparent-black-10:#0c0c0c1a;--base-font-line-height-10:10px;--base-color-transparent-black-60:#0c0c0c99;--base-font-letter-spacing-1200:12px;--base-size-04:4px;--base-color-transparent-black-50:#0c0c0c80;--base-font-letter-spacing-150:1.5;--base-color-transparent-black-40:#0c0c0c66;--base-font-letter-spacing

### Here it comes the beautifulsoup

Next, you will create a `beautifulsoup` object. A beautifulsoup object is just a parser. It allows us to easily access elements from the raw html.  

In [9]:
# create an bs object.
# input 1: request content; input 2: tell you need an html parser
soup = BeautifulSoup(page.content, 'html.parser') 

# Let's look at the raw code of the downloaded website
print(soup.prettify())

<!DOCTYPE html>
<html data-layout-uri="cms.cnn.com/_layouts/layout-with-rail/instances/politics-article-v1@published" data-uri="cms.cnn.com/_pages/h_c8726845aa4216f1c265db0eed40b250@published" lang="en">
 <head>
  <style>
   body,h1,h2,h3,h4,h5{font-family:cnn_sans_display,helveticaneue,Helvetica,Arial,Utkal,sans-serif}:root{--base-space-48:48px;--base-font-letter-spacing-100:1px;--base-size-36:36px;--base-color-transparent-black-20:#0c0c0c33;--base-font-text-transform-uppercase:uppercase;--base-font-text-decoration-none:none;--base-font-line-height-12:12px;--base-color-transparent-black-70:#0c0c0cb3;--base-font-line-height-14:14px;--base-font-letter-spacing-50:0.5;--base-color-transparent-black-10:#0c0c0c1a;--base-font-line-height-10:10px;--base-color-transparent-black-60:#0c0c0c99;--base-font-letter-spacing-1200:12px;--base-size-04:4px;--base-color-transparent-black-50:#0c0c0c80;--base-font-letter-spacing-150:1.5;--base-color-transparent-black-40:#0c0c0c66;--base-font-letter-spacing-

With the parser, we can look start looking at the data. The functions we will use the most are: 

- `.find_all()`: to find tags by their names
- `.select()`: to select tags by using the CSS selector 
- `.get_text()`: to access the text in the tag
- `["attr"]`: to access attributes of a tag

Let's start trying to grab all the textual information of the news. These are often under the tag `<p>` for paragraph

In [10]:
## find paragraph
cnn_par = soup.find_all('p')

In [11]:
cnn_par

[<p class="paragraph inline-placeholder vossi-paragraph" data-article-gutter="true" data-component-name="paragraph" data-editable="text" data-uri="cms.cnn.com/_components/paragraph/instances/paragraph_89A7A42D-49B5-80B7-E23F-A9426FD403FC@published">
 <a href="https://www.cnn.com/politics/live-news/iran-prisoner-release-americans-feed" target="_blank">The release on Monday</a> of the Americans who were wrongfully detained in Iran ends a years-long saga that included lengthy detentions in Tehran’s notorious Evin Prison, which is known for its long record of human rights abuses.
     </p>,
 <p class="paragraph inline-placeholder vossi-paragraph" data-article-gutter="true" data-component-name="paragraph" data-editable="text" data-uri="cms.cnn.com/_components/paragraph/instances/paragraph_521FE784-7A95-E43A-E2DA-A8D949B204BC@published">
 <a href="https://www.cnn.com/2023/09/18/politics/iran-us-prisoner-release-intl/index.html" target="_blank">Emad Shargi, Morad Tahbaz and Siamak Namazi</a> 

In [12]:
## let's see how it looks like
len(cnn_par)

17

In [16]:
## let print one
cnn_par[3]

<p class="paragraph inline-placeholder vossi-paragraph" data-article-gutter="true" data-component-name="paragraph" data-editable="text" data-uri="cms.cnn.com/_components/paragraph/instances/paragraph_3A148553-E2C2-BE8C-0A9F-A8D9C137DF84@published">
            The 51-year-old was arrested when he was on a business trip to Iran in what the UN has described as an “arbitrary detention.” The Dubai-based businessman was charged with having “relations with a hostile state,” referring to the US. He was sentenced to 10 years in prison. 
    </p>

You see you just parsed the full tag for all paragraphs of the text. Let's remove all html tags using the `.get_text()` method

In [17]:
# get the text. 
# This is what is in between the tags <p> TEXT </p>
cnn_par[15].get_text()

'\nCNN’s Jennifer Hansler contributed to this report. \n'

In [18]:
# use our friend list compreehension to parse all
all_par = [par.get_text() for par in cnn_par]
all_par

['\nThe release on Monday of the Americans who were wrongfully detained in Iran ends a years-long saga that included lengthy detentions in Tehran’s notorious Evin Prison, which is known for its long record of human rights abuses.\n    ',
 '\nEmad Shargi, Morad Tahbaz and Siamak Namazi are among the five Americans whose release is part of a deal that included the transfer of $6 billion in Iranian funds from South Korea to Qatar and the release of five Iranians in US custody. Two additional Americans in the deal have not yet been publicly identified.\n    ',
 '\n            Namazi, Iran’s longest-held Iranian-American prisoner, had been detained since 2015. \n    ',
 '\n            The 51-year-old was arrested when he was on a business trip to Iran in what the UN has described as an “arbitrary detention.” The Dubai-based businessman was charged with having “relations with a hostile state,” referring to the US. He was sentenced to 10 years in prison. \n    ',
 '\n            The Internati

You see neverthless that you did collect some junk that are not the paragraph information you are looking for. 

This happens because there are multiple instances (under different tags) in which the tag `<p>` is used for. For example, if you look at the last element of the `all_par` list, you will see your scraper is collecting the footer of the webpage. 

### Solution

**Be more specific. Work with a CSS selector.**

A CSS selector is a pattern used to select and style one or more elements in an HTML document. It is a way to chain multiple style and attributes of an html file. 

Another way to do this is using XPATH, which can be super useful to learn, but a bit more complicated for begginers.

See this tutorial [here](https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors) to understand the concept of a css selector

Let's use the selector gadget tool ([see tutorial](https://www.youtube.com/watch?v=YdIWI6K64zo)) to get a css selector for all the paragraphs. 

In [None]:
# open your webbrowser and use the selector gadget
# website: https://www.cnn.com/2023/09/18/politics/iran-american-prisoners-us-return

In [21]:
# Use a css selector to target specific content
cnn_par = soup.select(".vossi-paragraph")

In [22]:
cnn_par[-1]

<p class="paragraph inline-placeholder vossi-paragraph" data-article-gutter="true" data-component-name="paragraph" data-editable="text" data-uri="cms.cnn.com/_components/paragraph/instances/paragraph_A8A29554-2CA8-A4F9-688A-A97E58F9320B@published">
            Both Shargi and Tahbaz were convicted “on charges that international human rights organization stated were lacking evidence and were tried lacking fair trail guarantees,” according to the US State Department.
    </p>

In [23]:
story_content = [i.get_text() for i in cnn_par]
story_content 

['\nThe release on Monday of the Americans who were wrongfully detained in Iran ends a years-long saga that included lengthy detentions in Tehran’s notorious Evin Prison, which is known for its long record of human rights abuses.\n    ',
 '\nEmad Shargi, Morad Tahbaz and Siamak Namazi are among the five Americans whose release is part of a deal that included the transfer of $6 billion in Iranian funds from South Korea to Qatar and the release of five Iranians in US custody. Two additional Americans in the deal have not yet been publicly identified.\n    ',
 '\n            Namazi, Iran’s longest-held Iranian-American prisoner, had been detained since 2015. \n    ',
 '\n            The 51-year-old was arrested when he was on a business trip to Iran in what the UN has described as an “arbitrary detention.” The Dubai-based businessman was charged with having “relations with a hostile state,” referring to the US. He was sentenced to 10 years in prison. \n    ',
 '\n            The Internati

In [24]:
story_content[0].strip()

'The release on Monday of the Americans who were wrongfully detained in Iran ends a years-long saga that included lengthy detentions in Tehran’s notorious Evin Prison, which is known for its long record of human rights abuses.'

In [25]:
# Clean and join together with string methods
story_text = "\n".join([i.strip() for i in story_content])
print(story_text)

The release on Monday of the Americans who were wrongfully detained in Iran ends a years-long saga that included lengthy detentions in Tehran’s notorious Evin Prison, which is known for its long record of human rights abuses.
Emad Shargi, Morad Tahbaz and Siamak Namazi are among the five Americans whose release is part of a deal that included the transfer of $6 billion in Iranian funds from South Korea to Qatar and the release of five Iranians in US custody. Two additional Americans in the deal have not yet been publicly identified.
Namazi, Iran’s longest-held Iranian-American prisoner, had been detained since 2015.
The 51-year-old was arrested when he was on a business trip to Iran in what the UN has described as an “arbitrary detention.” The Dubai-based businessman was charged with having “relations with a hostile state,” referring to the US. He was sentenced to 10 years in prison.
The International Campaign for Human Rights in Iran said the country does not recognize dual citizenshi

### What else can we collect from this news? 

- title
- author
- date

Let's do it. 

In [26]:
# title
css_loc = "#maincontent"
story_title = soup.select(css_loc)
story_title[0]

<h1 class="headline__text inline-placeholder vossi-headline-text" data-editable="headlineText" id="maincontent">
      What we know about 3 of the Americans who were released from Iranian detention
    </h1>

In [27]:
story_title = story_title[0].get_text()
print(story_title)


      What we know about 3 of the Americans who were released from Iranian detention
    


In [28]:
# story date
story_date = soup.select(".vossi-timestamp")[0].get_text()
print(story_date)


  Updated
        6:02 PM EDT, Mon September 18, 2023
    


In [29]:
# story authors
story_author = soup.select(".byline__name")[0].get_text()
print(story_author)

Shawna Mizelle


In [30]:
# let's nest all in a list
entry = [url, story_title.strip(),story_date.strip(),story_text]
entry

['https://www.cnn.com/2023/09/18/politics/iran-american-prisoners-us-return',
 'What we know about 3 of the Americans who were released from Iranian detention',
 'Updated\n        6:02 PM EDT, Mon September 18, 2023',
 'The release on Monday of the Americans who were wrongfully detained in Iran ends a years-long saga that included lengthy detentions in Tehran’s notorious Evin Prison, which is known for its long record of human rights abuses.\nEmad Shargi, Morad Tahbaz and Siamak Namazi are among the five Americans whose release is part of a deal that included the transfer of $6 billion in Iranian funds from South Korea to Qatar and the release of five Iranians in US custody. Two additional Americans in the deal have not yet been publicly identified.\nNamazi, Iran’s longest-held Iranian-American prisoner, had been detained since 2015.\nThe 51-year-old was arrested when he was on a business trip to Iran in what the UN has described as an “arbitrary detention.” The Dubai-based businessman

## Practice with Latin News Website

Your task: Look a this website here: https://www.latinnews.com/latinnews-country-database.html?country=2156

Do the following taks: 

- Select a single link for you to explore. 
- Write code to scrape the link. 
- Collect the following information:
    - text of the news
    - title of the news
    - url associated with the news
    - date of the post
- Return all of them as a pandas dataframe.     

In [34]:
# full solution
import re # regular expressions
## Your code here
# Get access to the website
url = "https://www.latinnews.com/component/k2/item/103448.html?period=2024&archive=3&Itemid=6&cat_id=835004:in-brief-vale-and-bhp-sign-compensation-deal-for-brazil-mining-disaster"
page = requests.get(url)
page.status_code # 200 == Connection

# parse
soup = BeautifulSoup(page.content, 'html.parser') 

#title
title = soup.select(".article-title-single")
text = title[0].text
text

# text
text_ = soup.select(".itemFullText")
text_ = text_[0].text

# clean the text
text_ = re.sub(r'[\n\t\r]', '', text_)
text_[0:500]

# date
date = soup.select("H1")
date = date[0].text.replace("LatinNews Daily - ", "")
date

pd.DataFrame({"url":url, 
              "title":title,
             "text":text_, 
             "date":date})

Unnamed: 0,url,title,text,date
0,https://www.latinnews.com/component/k2/item/10...,[In brief: Vale and BHP sign compensation deal...,*The Brazilian mining company Vale and Austral...,28 October 2024


### Step 4: Build a scraper

After you have your scrapper working for a single case, you will generalize your work. As we learned before, we do this by creating a function (or a full class with different methods). 

Notice, here it is also important to add the good practices which try to imitate human behavior inside of our functions. 

Let's start with a function to scrap news from CNN

In [35]:
# Building a scraper
# The idea here is to just wrap the above in a function.
# Input: url
# Output: relevant content

def cnn_scraper(url=None):
    '''
    this function scraps relevant content from cnn website
    input: str, url from cnn
    '''

    # Get access to the website
    page = requests.get(url)
    # create an bs object.
    soup = BeautifulSoup(page.content, 'html.parser') # input 1: request content; input 2: tell you need an html parser

    # check if you got a connection
    if page.status_code == 200:
        
        # parse text
        bbc_par = soup.select(".paragraph")
        story_content = [i.get_text() for i in bbc_par]
        story_text = "\n".join([i.strip() for i in story_content])

        # parse title
        css_loc = "#maincontent"
        story_title = soup.select(css_loc)[0].get_text()
        
        # story date
        story_date = soup.select(".timestamp")[0].get_text()
        
        # story authors
        story_author = soup.select(".byline__name")[0].get_text()

        # let's nest all in a list
        entry = {"url":url, "story_title":story_title.strip(),
                  "story_date":story_date.strip(),"text":story_text}
        
        # return 
        return entry
   

Let's see if our function works:

In [36]:
# Test on the same case
url = "https://www.cnn.com/2023/09/20/politics/fact-check-house-judiciary-committee-merrick-garland-hunter-biden/?dicbo=v2-5NBpStm&hpt=ob_blogfooterold"
scrap_news = cnn_scraper(url=url)
scrap_news

{'url': 'https://www.cnn.com/2023/09/20/politics/fact-check-house-judiciary-committee-merrick-garland-hunter-biden/?dicbo=v2-5NBpStm&hpt=ob_blogfooterold',
 'story_title': 'Fact check: Jim Jordan makes false claims about Trump, Hunter Biden to begin hearing on handling of the federal cases against them',
 'story_date': 'Updated\n        11:44 PM EDT, Wed September 20, 2023',
 'text': 'House Judiciary Committee chairman Rep. Jim Jordan made false claims in his opening remarks at a Wednesday hearing at which Jordan and other Republicans pressed Attorney General Merrick Garland about the Justice Department’s handling of investigations into former President Donald Trump and President Joe Biden’s son Hunter Biden.\nHere is a fact check of two inaccurate remarks from Jordan, plus one from Rep. Thomas Massie and another from Rep. Chip Roy.\nCriticizing the FBI search of Trump’s home in Florida in August 2022, Jordan, a Republican from Ohio, falsely claimed in his opening statement at Wednesda

### beautiful!

Let's now assume you actually have a list of urls. So we will iterated our scrapper through this list.



In [37]:
# create a list of urls
urls = ["https://www.cnn.com/2023/09/20/politics/senate-republicans-dress-code-letter/index.html", 
       "https://www.cnn.com/2023/09/20/politics/student-loan-payment-restart/index.html", 
       "https://www.cnn.com/2023/09/19/politics/un-speech-biden-what-matters/index.html", 
       "https://www.cnn.com/2023/09/18/politics/fact-check-trump-raffensperger-phone-call-didnt-do-wrong/index.html"]

In [38]:
# Then just loop through and collect
scraped_data = []

for url in urls:

    # Scrape the content
    scraped_data.append(cnn_scraper(url))

    # Put the system to sleep for a random draw of time (be kind)
    time.sleep(random.uniform(.5,1))
    
    print(url)

https://www.cnn.com/2023/09/20/politics/senate-republicans-dress-code-letter/index.html
https://www.cnn.com/2023/09/20/politics/student-loan-payment-restart/index.html
https://www.cnn.com/2023/09/19/politics/un-speech-biden-what-matters/index.html
https://www.cnn.com/2023/09/18/politics/fact-check-trump-raffensperger-phone-call-didnt-do-wrong/index.html


In [42]:
# Look at the data object
scraped_data[3]

{'url': 'https://www.cnn.com/2023/09/18/politics/fact-check-trump-raffensperger-phone-call-didnt-do-wrong/index.html',
 'story_title': 'Fact check: Trump falsely claims Raffensperger said former president ‘didn’t do anything wrong’ on their 2021 phone call',
 'story_date': 'Updated\n        12:31 PM EDT, Tue September 19, 2023',
 'text': 'Georgia Secretary of State Brad Raffensperger has long been a pointed critic of former President Donald Trump’s conduct on a January 2, 2021, phone call in which Trump told numerous lies about supposed election fraud and pressured Raffensperger to somehow “find” enough votes to overturn his defeat in Georgia in the 2020 election.\nIn an interview that aired Sunday on NBC’s “Meet the Press,” though, Trump claimed Raffensperger had recently declared Trump’s conduct on the call perfectly acceptable.\n“That was a phone call made in front of, I guess seven or eight lawyers. Brad Raffensperger, the head – who, by the way, last week said I didn’t do anything

In [43]:
# Organize as a pandas data frame
dat = pd.DataFrame(scraped_data)
dat.head()

Unnamed: 0,url,story_title,story_date,text
0,https://www.cnn.com/2023/09/20/politics/senate...,Senate Republicans urge Schumer to enforce mor...,"Updated\n 1:31 PM EDT, Wed September 20...",Nearly every Senate Republican signed a letter...
1,https://www.cnn.com/2023/09/20/politics/studen...,Are you ready to start repaying your student l...,"Published\n 1:23 PM EDT, Wed September ...",Roughly 28 million borrowers will soon be requ...
2,https://www.cnn.com/2023/09/19/politics/un-spe...,Biden acknowledges the old world order needs a...,"Published\n 5:55 PM EDT, Tue September ...",President Joe Biden addressed the United Natio...
3,https://www.cnn.com/2023/09/18/politics/fact-c...,Fact check: Trump falsely claims Raffensperger...,"Updated\n 12:31 PM EDT, Tue September 1...",Georgia Secretary of State Brad Raffensperger ...


This completes all the steps of scraping: 
    
- Step 1: Find a website with information you want to collect
- Step 2: Understand the website
- Step 3: Write code to collect one realization of the data
- Step 4: Build a scraper -- generalize you code into a function.
- Step 5: Save


### Collecting multiple urls

It is unlikely you will ever have a complete list of urls you want to scrap. Most likely collecting the full list of sources will be a step on your scraping task. Remember, urls usually come embedded as tags attributes. So let's write a function to collect multiple urls from the CNN website. Let's do so following all our pre-determined steps

In [49]:
# Step 1: Find a website with information you want to collect
## let's get links on cnn politics
url = "https://www.cnn.com/politics"

In [48]:
url

'https://www.cnn.com/2023/09/18/politics/fact-check-trump-raffensperger-phone-call-didnt-do-wrong/index.html'

In [3]:
# Step 2: Understand the website
# links are embedded across multiple titles. 
# these titles have the follwing tag <.container__headline span>

In [50]:
# Step 3: Write code to collect one realization of the data
# Get access to the website
page = requests.get(url)
# create an bs object.
soup = BeautifulSoup(page.content, 'html.parser') # input 1: request content; input 2: tell you need an html parser

In [61]:
# Step 3: Write code to collect one realization of the data
# with a css selector
links = soup.select(".container_lead-plus-headlines__item--type-section")
#links = soup.select(".container_lead-plus-headlines__headline")
links

[<div class="card container__item container__item--type-media-image container__item--type-section container_lead-plus-headlines__item container_lead-plus-headlines__item--type-section container_lead-plus-headlines__selected" data-component-name="card" data-created-updated-by="true" data-open-link="/2024/10/29/politics/noncitizen-voting-narrative-republicans/index.html" data-page="cms.cnn.com/_pages/cm2tgzuzo00002cntbjjtez2a@published" data-unselectable="true" data-uri="cms.cnn.com/_components/card/instances/clbdmol44002m3d6euos0g5zg_fill_1@published" data-word-count="2067">
 <a class="container__link container__link--type-article container_lead-plus-headlines__link" data-link-type="article" href="/2024/10/29/politics/noncitizen-voting-narrative-republicans/index.html">
 <div class="container__item-media-wrapper container_lead-plus-headlines__item-media-wrapper" data-breakpoints='{"card--media-large": 525, "card--media-extra-large": 660, "card--media-card-label-show": 200}'>
 <div class

In [64]:
links[0]

<div class="card container__item container__item--type-media-image container__item--type-section container_lead-plus-headlines__item container_lead-plus-headlines__item--type-section container_lead-plus-headlines__selected" data-component-name="card" data-created-updated-by="true" data-open-link="/2024/10/29/politics/noncitizen-voting-narrative-republicans/index.html" data-page="cms.cnn.com/_pages/cm2tgzuzo00002cntbjjtez2a@published" data-unselectable="true" data-uri="cms.cnn.com/_components/card/instances/clbdmol44002m3d6euos0g5zg_fill_1@published" data-word-count="2067">
<a class="container__link container__link--type-article container_lead-plus-headlines__link" data-link-type="article" href="/2024/10/29/politics/noncitizen-voting-narrative-republicans/index.html">
<div class="container__item-media-wrapper container_lead-plus-headlines__item-media-wrapper" data-breakpoints='{"card--media-large": 525, "card--media-extra-large": 660, "card--media-card-label-show": 200}'>
<div class="co

In [65]:
links[0].attrs["data-open-link"]

'/2024/10/29/politics/noncitizen-voting-narrative-republicans/index.html'

In [66]:
# grab links
links_from_cnn = []

# iterate
for link in links:
    links_from_cnn.append(link["data-open-link"])
    
# print
print(links_from_cnn)

['/2024/10/29/politics/noncitizen-voting-narrative-republicans/index.html', '/2024/10/29/politics/joe-biden-campaign-trail/index.html', '/2024/10/29/politics/early-voting-turnout/index.html', '/2024/10/28/politics/russia-china-cuba-hurricane-misinformation/index.html', '/2024/10/29/politics/kamala-harris-ellipse-speech/index.html', '/2024/10/13/politics/donald-trump-tariffs/index.html', '/2024/10/28/politics/bernie-sanders-kamala-harris-israel-gaza/index.html', '/2024/10/29/politics/steve-bannon-released-prison/index.html', '/2024/10/29/politics/2024-election-explained-what-matters/index.html', '/2024/10/28/politics/hispanic-voters-trump-election-rally/index.html', '/2024/10/28/politics/trump-extreme-closing-argument/index.html', '/2024/10/27/politics/red-mirage-blue-shift-what-matters/index.html', '/2024/10/19/politics/election-questions-answered-what-matters/index.html', '/2024/10/25/politics/obama-harris-springsteen-election-analysis/index.html', '/2024/10/24/politics/fascism-trump-

In [67]:
## another way to do this is by using href attributes of a tag
links_from_cnn = []

# Extract relevant and unique links
for tag in soup.find_all("a"):
    href = tag.attrs.get("href")
    links_from_cnn.append(href)

# much more extensive set of links
print(links_from_cnn)

['https://www.cnn.com', 'https://www.cnn.com/politics', 'https://www.cnn.com/politics/supreme-court', 'https://www.cnn.com/politics/congress', 'https://www.cnn.com/politics/fact-check', 'https://www.cnn.com/election/2024', None, 'https://www.cnn.com/politics/supreme-court', 'https://www.cnn.com/politics/congress', 'https://www.cnn.com/politics/fact-check', 'https://www.cnn.com/election/2024', 'https://www.cnn.com/video', 'https://www.cnn.com/audio', 'https://www.cnn.com/live-tv', '/account/settings', '/follow?iid=fw_var-nav', '#', '#', '/account/settings', '/follow?iid=fw_var-nav', '#', '#', 'https://www.cnn.com/live-tv', 'https://www.cnn.com/audio', 'https://www.cnn.com/video', 'https://us.cnn.com?hpt=header_edition-picker', 'https://edition.cnn.com?hpt=header_edition-picker', 'https://arabic.cnn.com?hpt=header_edition-picker', 'https://cnnespanol.cnn.com/?hpt=header_edition-picker', 'https://us.cnn.com?hpt=header_edition-picker', 'https://edition.cnn.com?hpt=header_edition-picker', '

In [70]:
import re
## clean the output. 
# Keep only stories that starts with "/" and "fourdigits"
# combine with the base url
links_from_cnn_reduced = ["https://www.cnn.com" + l for l in links_from_cnn if re.match(r'^/(\d{4})', str(l))]
links_from_cnn_reduced

['https://www.cnn.com/2024/10/29/politics/noncitizen-voting-narrative-republicans/index.html',
 'https://www.cnn.com/2024/10/29/politics/noncitizen-voting-narrative-republicans/index.html',
 'https://www.cnn.com/2024/10/29/politics/joe-biden-campaign-trail/index.html',
 'https://www.cnn.com/2024/10/29/politics/early-voting-turnout/index.html',
 'https://www.cnn.com/2024/10/28/politics/russia-china-cuba-hurricane-misinformation/index.html',
 'https://www.cnn.com/2024/10/29/politics/kamala-harris-ellipse-speech/index.html',
 'https://www.cnn.com/2024/10/13/politics/donald-trump-tariffs/index.html',
 'https://www.cnn.com/2024/10/28/politics/bernie-sanders-kamala-harris-israel-gaza/index.html',
 'https://www.cnn.com/2024/10/29/politics/steve-bannon-released-prison/index.html',
 'https://www.cnn.com/2024/10/29/politics/2024-election-explained-what-matters/index.html',
 'https://www.cnn.com/2024/10/29/politics/2024-election-explained-what-matters/index.html',
 'https://www.cnn.com/2024/10/28

In [71]:
## Step 4: Build a scraper -- generalize you code into a function.

# Let's write the above as a single function
def collect_links_cnn(url=None):
    """Scrape multiple CNN URLS.

    Args:
        url (list): list of valid CNN page to collect links.
    Returns:
        DataFrame: frame containing headline, date, and content fields
    """
    
    # Get access to the website
    page = requests.get(url)
    # create an bs object.
    soup = BeautifulSoup(page.content, 'html.parser') # input 1: request content; input 2: tell you need an html parser

    ## another way to do this is by using href attributes of a tag
    links_from_cnn = []

    # Extract relevant and unique links
    for tag in soup.find_all("a"):
        href = tag.attrs.get("href")
        links_from_cnn.append(href)
        
    ## clean the output. 
    # Keep only stories that starts with "/" and "fourdigits"
    # combine with the base url
    links_from_cnn_reduced = ["https://www.cnn.com" + l for l in links_from_cnn if re.match(r'^/(\d{4})', str(l))]
    links_from_cnn_reduced

    return links_from_cnn_reduced

In [72]:
links_cnn = collect_links_cnn("https://www.cnn.com/politics")
links_cnn[:10]

['https://www.cnn.com/2024/10/29/politics/fact-check-donald-trump-television-ad/index.html',
 'https://www.cnn.com/2024/10/29/politics/fact-check-donald-trump-television-ad/index.html',
 'https://www.cnn.com/2024/10/29/politics/noncitizen-voting-narrative-republicans/index.html',
 'https://www.cnn.com/2024/10/29/politics/joe-biden-campaign-trail/index.html',
 'https://www.cnn.com/2024/10/29/politics/early-voting-turnout/index.html',
 'https://www.cnn.com/2024/10/28/politics/russia-china-cuba-hurricane-misinformation/index.html',
 'https://www.cnn.com/2024/10/29/politics/kamala-harris-ellipse-speech/index.html',
 'https://www.cnn.com/2024/10/13/politics/donald-trump-tariffs/index.html',
 'https://www.cnn.com/2024/10/28/politics/bernie-sanders-kamala-harris-israel-gaza/index.html',
 'https://www.cnn.com/2024/10/29/politics/2024-election-explained-what-matters/index.html']

With this list, you can apply your scrapper function to multiple links:

In [73]:
len(links_cnn)

66

In [74]:
# let's get the first 10
links_cnn_ = links_cnn[:9]

# Then just loop through and collect
scraped_data = []

for url in links_cnn_:

    # check what is going on
    print(url)
    
    # Scrape the content
    scraped_data.append(cnn_scraper(url))

    # Put the system to sleep for a random draw of time (be kind)
    time.sleep(random.uniform(.5,3))

# save as pandas df   
# Organize as a pandas data frame
dat = pd.DataFrame(scraped_data)
dat.head()

https://www.cnn.com/2024/10/29/politics/fact-check-donald-trump-television-ad/index.html
https://www.cnn.com/2024/10/29/politics/fact-check-donald-trump-television-ad/index.html
https://www.cnn.com/2024/10/29/politics/noncitizen-voting-narrative-republicans/index.html
https://www.cnn.com/2024/10/29/politics/joe-biden-campaign-trail/index.html
https://www.cnn.com/2024/10/29/politics/early-voting-turnout/index.html
https://www.cnn.com/2024/10/28/politics/russia-china-cuba-hurricane-misinformation/index.html
https://www.cnn.com/2024/10/29/politics/kamala-harris-ellipse-speech/index.html
https://www.cnn.com/2024/10/13/politics/donald-trump-tariffs/index.html
https://www.cnn.com/2024/10/28/politics/bernie-sanders-kamala-harris-israel-gaza/index.html


Unnamed: 0,url,story_title,story_date,text
0,https://www.cnn.com/2024/10/29/politics/fact-c...,Fact check: Four deceptive quotes in Trump’s w...,"Published\n 11:18 AM EDT, Tue October 2...","On Friday, we published an article about how f..."
1,https://www.cnn.com/2024/10/29/politics/fact-c...,Fact check: Four deceptive quotes in Trump’s w...,"Published\n 11:18 AM EDT, Tue October 2...","On Friday, we published an article about how f..."
2,https://www.cnn.com/2024/10/29/politics/noncit...,How voter purge disputes have fueled the GOP ‘...,"Updated\n 11:00 AM EDT, Tue October 29,...","The letter that Jona Hilario, a mother of two ..."
3,https://www.cnn.com/2024/10/29/politics/joe-bi...,Biden comes to grips with a diminished role on...,"Published\n 10:45 AM EDT, Tue October 2...",President Joe Biden’s role in the 2024 preside...
4,https://www.cnn.com/2024/10/29/politics/early-...,"One week from Election Day, early voters look ...","Updated\n 11:24 AM EDT, Tue October 29,...","With one week until Election Day, more than 43..."


In [75]:
# add an error
links_cnn_.append("https://www.latinnews.com/latinnews-country-database.html?country=2156")

# run the loop in a secure setup
# Then just loop through and collect
scraped_data = []
list_of_errors = []

for url in links_cnn_:

    # check what is going on
    print(url)
    
    # Scrape the content
    try: 
        scraped_data.append(cnn_scraper(url))

        # Put the system to sleep for a random draw of time (be kind)
        time.sleep(random.uniform(.5,3))
    except Exception as e:
        list_of_errors.append([url, e])

https://www.cnn.com/2024/10/29/politics/fact-check-donald-trump-television-ad/index.html
https://www.cnn.com/2024/10/29/politics/fact-check-donald-trump-television-ad/index.html
https://www.cnn.com/2024/10/29/politics/noncitizen-voting-narrative-republicans/index.html
https://www.cnn.com/2024/10/29/politics/joe-biden-campaign-trail/index.html
https://www.cnn.com/2024/10/29/politics/early-voting-turnout/index.html
https://www.cnn.com/2024/10/28/politics/russia-china-cuba-hurricane-misinformation/index.html
https://www.cnn.com/2024/10/29/politics/kamala-harris-ellipse-speech/index.html
https://www.cnn.com/2024/10/13/politics/donald-trump-tariffs/index.html
https://www.cnn.com/2024/10/28/politics/bernie-sanders-kamala-harris-israel-gaza/index.html
https://www.latinnews.com/latinnews-country-database.html?country=2156


In [76]:
dat = pd.DataFrame(scraped_data)
dat.head()

Unnamed: 0,url,story_title,story_date,text
0,https://www.cnn.com/2024/10/29/politics/fact-c...,Fact check: Four deceptive quotes in Trump’s w...,"Published\n 11:18 AM EDT, Tue October 2...","On Friday, we published an article about how f..."
1,https://www.cnn.com/2024/10/29/politics/fact-c...,Fact check: Four deceptive quotes in Trump’s w...,"Published\n 11:18 AM EDT, Tue October 2...","On Friday, we published an article about how f..."
2,https://www.cnn.com/2024/10/29/politics/noncit...,How voter purge disputes have fueled the GOP ‘...,"Updated\n 11:00 AM EDT, Tue October 29,...","The letter that Jona Hilario, a mother of two ..."
3,https://www.cnn.com/2024/10/29/politics/joe-bi...,Biden comes to grips with a diminished role on...,"Published\n 10:45 AM EDT, Tue October 2...",President Joe Biden’s role in the 2024 preside...
4,https://www.cnn.com/2024/10/29/politics/early-...,"One week from Election Day, early voters look ...","Updated\n 11:24 AM EDT, Tue October 29,...","With one week until Election Day, more than 43..."


In [77]:
list_of_errors

[['https://www.latinnews.com/latinnews-country-database.html?country=2156',
  IndexError('list index out of range')]]

## Practice II with Latin News Website

Now, return to https://www.latinnews.com/latinnews-country-database.html?country=2156

Do the following: 

- Build a function encapsulating your code from the Practice I
- Build another function to collect all links from Brazil in Latin News 
- Scrape all news for a single country using these two functions

In [5]:
# setup
import pandas as pd
import requests # For downloading the website
from bs4 import BeautifulSoup # For parsing the website
import time # To put the system to sleep
import random # for random numbers
import re # regular expressions

In [6]:
def scrape_latin_news(url):
    
    '''
    this function scraps relevant content from the latin news website
    input: str, url from latinnews
    '''

    # Get access to the website
    page = requests.get(url)
    # create an bs object.
    soup = BeautifulSoup(page.content, 'html.parser') # input 1: request content; input 2: tell you need an html parser

    # check if you got a connection
    if page.status_code == 200:
        
        #title
        title = soup.select(".article-title-single")
        title = title[0].text

        # text
        text_ = soup.select(".itemFullText")
        text_ = text_[0].text

        # clean the text
        text_ = re.sub(r'[\n\t\r]', '', text_)
        text_  
        
        # date
        date = soup.select("H1")
        date = date[0].text.replace("LatinNews Daily - ", "")

        # return
        out = {"url":url, 
              "title":title,
             "text":text_, 
             "date":date}      
        # return 
        return out


In [7]:
# Let's write the above as a single function
def collect_links_latin(url=None):
    """Scrape multiple Latin News URLS.

    Args:
        url (list): list of valid Latin NEws page to collect links.
    Returns:
        List: frame containing headline, date, and content fields
    """
    
    # Get access to the website
    page = requests.get(url_brazil)

    # create an bs object.
    soup = BeautifulSoup(page.content, 'html.parser') # input 1: request content; input 2: tell you need an html parser

    ## another way to do this is by using href attributes of a tag
    links_from_lnews = []

    # Extract relevant and unique links
    for tag in soup.select(".archive_item"):
        href = tag.attrs.get("href")
        links_from_lnews.append(href)

    # clean
    links_from_lnews = ["https://www.latinnews.com/" + l for l in links_from_lnews]
    return links_from_lnews


In [3]:
url_brazil ="https://www.latinnews.com/latinnews-country-database.html?country=2156"

In [8]:
links_brazil = collect_links_latin(url=url_brazil)

In [10]:
len(links_brazil)

593

In [11]:
# iterate
out = []
for i in range(10):
    out.append(scrape_latin_news(links_brazil[i]))

In [12]:
out

[{'url': 'https://www.latinnews.com//component/k2/item/103556.html?period=2024&archive=3&Itemid=6&cat_id=835066:brazil-police-concludes-probe-on-high-profile-amazon-murders',
  'title': 'BRAZIL: Police concludes probe on high-profile Amazon murders',
  'text': 'On 4 November Brazil’s federal police (PF) announced it had concluded its investigations into the 2022 murders of Brazilian indigenous expert Bruno Pereira and British journalist Dom Phillips in the Vale do Javari demarcated indigenous territory, Amazonas state.Analysis:The murder of Pereira and Phillips on 5 June 2022 was one of the most high-profile incidents in recent years of an organised crime group in Brazil assassinating defenders of human rights and the environment. The conclusion of the PF’s probe marks some progress in the trudge towards justice for the victims, given the police force has recommended charges against the alleged mastermind of the killings and his suspected accomplices. However, some reports in the natio

In [13]:
# conver to a dataframe
pd.DataFrame(out)

Unnamed: 0,url,title,text,date
0,https://www.latinnews.com//component/k2/item/1...,BRAZIL: Police concludes probe on high-profile...,On 4 November Brazil’s federal police (PF) ann...,5 November 2024
1,https://www.latinnews.com//component/k2/item/1...,In brief: Brazil’s Itaú posts higher profits i...,*Brazilian private bank Itaú Unibanco has rele...,5 November 2024
2,https://www.latinnews.com//component/k2/item/1...,BRAZIL: Tensions with Venezuela escalate,On 2 November the Venezuelan foreign ministry ...,4 November 2024
3,https://www.latinnews.com//component/k2/item/1...,In brief: Brazil’s industrial production up in...,*Brazil’s national statistics institute (Ibge)...,4 November 2024
4,https://www.latinnews.com//component/k2/item/1...,In brief: Brazil’s unemployment down in Q3,*Brazil’s national statistics institute (Ibge)...,1 November 2024
5,https://www.latinnews.com//component/k2/item/1...,BRAZIL: Court convicts killers of Marielle Franco,On 31 October former policemen Ronnie Lessa an...,1 November 2024
6,https://www.latinnews.com//component/k2/item/1...,Violence and crime mark Brazil’s local elections,Ahead of the first round of Brazil’s municipal...,Security & Strategic Review - November 2024
7,https://www.latinnews.com//component/k2/item/1...,BRAZIL: Bolsonaro and Lula suffer electoral se...,Since the first round of Brazil’s municipal el...,Weekly Report - 31 October 2024 (WR-24-43)
8,https://www.latinnews.com//component/k2/item/1...,BRAZIL: Alleged killers of Marielle Franco sta...,On 30 October a court in Brazil’s Rio de Janei...,31 October 2024
9,https://www.latinnews.com//component/k2/item/1...,In brief: Brazil’s gov’t boosts investment for...,*Brazil’s President Luiz Inácio Lula da Silva ...,31 October 2024


In [1]:
!jupyter nbconvert _week-07_scraping_static.ipynb --to html --template classic


[NbConvertApp] Converting notebook _week-07_scraping_static.ipynb to html
[NbConvertApp] Writing 3659283 bytes to _week-07_scraping_static.html
