In the class today, we will learn how to use selenium for research purposes. Selenium will allow us to:
# setup
import requests
import os
import pandas as pd
Static web pages: when the browser and the source code content match each other. Everything you see in your browser matchs as source html code. For these type of pages, scrapping can be accomplished using:
Dynamic web pages: when the content we are viewing in our browser does not match the content we see in the HTML source code we are retrieving from the site/ There are two approaches to scraping a dynamic webpage:
Definition: Selenium is an open source tool which is used for automating web browser testing. It allows us to write script in any known programming languages like Java, Python, C# etc. It works across all major OS and also works on all major web browsers.
Selenium works by automating browsers to execute JavaScript to display a web page as we would normally interact with it. By doing so, we can scrape the web pages as we see it, even though their HTML source is not there!
We will use selenium for data science purposes with two different approaches:
Collect data from dynamic websites (such as Youtube, Zillow, Toutiao, among others)
Interact with websites, and potentially conduct algorithmic studies of recommendations systems.
Since you already know the basics of html, we will jump straight to using selenium to scrap and interact with dynamic websites
Setting up selenium on your environment can be a bit tricky. First, you need to install the selenium library in Python. Second, you need to download the selenium webdriver. It is a browser-dependent executable file that acts as a bridge between your script and the browser.
See here instructions to set up your selenium environment: https://selenium-python.readthedocs.io/installation.html
Most importantly, we can use the Web Driver Manager library in Python (https://pypi.org/project/webdriver-manager/) to help us set our selenium environment up.
# setup
#!pip install selenium
#!pip install webdriver-manager
# open selenium from a path
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
# import other packages
import json
import time
import re
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
To scrape data with selenium, our workflow will usually involve:
.get()
method<> element </>
)find_element
methods to find an element based on its attributes/value criteria or selector value that we supply in our script. This is the basic structure of this method:driver.find_element(By.<attribute>, <selector>)
The find_element()
methods from the driver
object receives two inputs: By.<>
methods for the attributes/tags, and a string with the selector to identify the attributtes/tags. You can use the following attribute with the By.
method:
find_element(By.ID, "id")
find_element(By.NAME, "name")
find_element(By.XPATH, "xpath")
find_element(By.LINK_TEXT, "link text")
find_element(By.PARTIAL_LINK_TEXT, "partial link text")
find_element(By.TAG_NAME, "tag name")
find_element(By.CLASS_NAME, "class name")
find_element(By.CSS_SELECTOR, "css selector")
See the selenium documentation page for a in-depth coverage of the find_element
and By
methods
We will start with an example of using Selenium to Scrape a Brazilian Fact-checking Agency. You do not need to know portuguese for this, just you your translator tool on google to understand what is going on.
# call driver
from selenium.webdriver.chrome.options import Options
options = Options()
#options.add_argument('--headless')
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()),
options=options)
# close your drives
driver.close()
from selenium import webdriver
from selenium.webdriver.firefox.service import Service as FirefoxService
from webdriver_manager.firefox import GeckoDriverManager
driver = webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install()))
driver.close()
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
# send your driver to an website
driver.get("https://lupa.uol.com.br/jornalismo/categoria/verifica%C3%A7%C3%A3o")
# put it to sleep a bit. Give some time to load
time.sleep(15)
title = driver.find_elements(By.CSS_SELECTOR,".eRlKZk")
title[0]
Notice the output is a selenium.webdriver.remote.webelement.WebElement
. This is an WebElement
class. It symbolizes an HTML element within an HTML document, and similarly to BeautifulSoup
, it provides us with methods to interact with the html element
See a full descriptions of the WebElements
methods in this tutorial: https://www.geeksforgeeks.org/element-methods-in-selenium-python/
# see the outer HTML
title[0].get_attribute('outerHTML')
# Get the text
title[0].text
# list comprehension for all cases
dict_ = {"title":[t.text for t in title]}
df = pd.DataFrame(dict_)
df.head()
Let's now write a full function to collect all the content we are are interested in.
# function to scrap
def scrap_lupa(driver):
"""
function to scrap the lupa headlines
input:
driver: selenium driver
"""
time.sleep(15)
# collect data
try:
title = driver.find_elements(By.CSS_SELECTOR,".eRlKZk")
except:
title= ""
try:
text = driver.find_elements('css selector',"#init p:nth-child(1)")
except:
text=""
try:
date = driver.find_elements('css selector', '.cuaKEv .kTcxJC')
except:
date = ""
try:
url = driver.find_elements('xpath','//*[@id="init"]/div/div/div[2]/div/div/div/div/div[1]/div/div[2]/div/a')
except:
url= ""
# get information
dict_ = {"title":[t.text for t in title],
"text":[t.text for t in title],
"date":[b.text for b in date],
"url":[u.get_attribute('href')for u in url]}
# make dataframe
df = pd.DataFrame(dict_)
return(df)
# run the function
df = scrap_lupa(driver)
# see outputs
df.head()
df.shape
More than a scraping tool, selenium provides us with a complete API to interact with webpages and mimic user behavior online. Let's see here some examples of this.
When an element is on screen it means that it is embedded in the DOM structure of the web page. Therefore, we can find its DOM structure when inspecting a web page. If the element we act upon is not in the DOM structure, then we will see some kind of error message.
In the previous example, you could see we only collected 30 links. However, the website allowed us at the bottom of the page to load more news. We can do that by scrolling down to the bottom of the website, and ask selenium to click in a particular button.
Let's see how it works.
# call some more methods here
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# function to load more data
def load_more_headlines(driver, n):
"""
function to load more pages in the lupa website
"""
# first iteration you actually need to click on load more
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
driver.execute_script("window.scrollBy(0, -200);")
# find element
element = driver.find_element(By.CSS_SELECTOR,'.bGDmDu')
element.click()
# keep scrolling down
for i in range(0,n):
try:
# goes down on the webpage
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
driver.execute_script("window.scrollBy(0, -200);")
# sleep
time.sleep(5)
print(f"Click to upload more Headlines: {str(i)}")
except:
print("something wrong")
# let's see how it works
#options.add_argument('--headless=new')
options = Options()
options.add_argument("window-size=1920,1080")
# create a new driver
#driver = webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install()))
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()),
options=options)
See where the driver is ....
# go to lupa
driver.get("https://lupa.uol.com.br/jornalismo/categoria/verifica%C3%A7%C3%A3o")
# load more pages
load_more_headlines(driver, 5)
# collect data
df = scrap_lupa(driver)
# see outputs
print(df.shape)
# bottom
df.tail()
driver.close()
Another useful functionality of selenium is to allow us to input text on webforms. Let's see a toy example with google:
# create a new driver
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
# go to google
driver.get("https://www.google.com/")
# get all the searches for Data Science and Public Policy
input_ = driver.find_element(By.CSS_SELECTOR, "#APjFqb")
# send text
input_.send_keys("Data Science and Public Policy")
# click
input_.send_keys(Keys.ENTER)
#!pip install screenshot
# see
from PIL import Image
import matplotlib.pyplot as plt
# screenshot
driver.save_screenshot("google.png")
img = Image.open('google.png')
plt.imshow(img)
One of my research projects focuses on understanding social media recommendation systems in China. The project focuses on when and how digital platform recommendation systems in China promote certain types content. To do so, we investigate the news aggregator in China called Toutiao, and we are currently scraping data from toutiao to answer our research question.
Selenium is the main tool we are working with. With selenium we do the following:
This is a super complex research question. Creating sock puppets is particularly challenging. However, with the class you had today, you could understand the collecting data step.
Below, you will see the class I created to collect data from toutiao. Your in-class/additional exercise is to work through the code, and try to use it to collect content from at least one page in toutiao.
Let me know how it goes!
#import libraries from selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
# import other libraries
import time
import string
import pandas as pd
import numpy as np
# utils functions
def sleep_random_time(range_):
'''
:param range_: a tuple of the range of wait times
ex: (0,30) would randomly sleep between 0 and 30 seconds
'''
if isinstance(range_, int):
time.sleep(np.random.choice(range(range_)))
elif isinstance(range_, tuple):
min_time, max_time = range_
if isinstance(min_time, float):
min_time = int(math.ceil(min_time))
if isinstance(max_time, float):
max_time = int(math.floor(max_time))
time.sleep(np.random.choice(range(min_time, max_time)))
# create a class
class ToutiaoBot():
'''
Returns a Bot to interact with the Toutiao Webpage
:param
headless (bool): True spins up a headless browser. Default equals to False.
'''
# create instance attributes -----------------
def __init__(self, headless=False):
if headless==True:
options = Options()
# HEADLESS OPTIONS
options.add_argument('--headless=new')
options.add_argument("window-size=1920,1080")
self.driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()),
options=options)
# bypass OS security
options.add_argument('--no-sandbox')
# overcome limited resources
options.add_argument('--disable-dev-shm-usage')
# don't tell chrome that it is automated
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
# disable images
prefs = {"profile.managed_default_content_settings.images": 2}
options.add_experimental_option("prefs", prefs)
# Setting Capabilities
capabilities = webdriver.DesiredCapabilities.CHROME.copy()
capabilities['acceptSslCerts'] = True
capabilities['acceptInsecureCerts'] = True
self.driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()),
options=options)
else:
self.driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
# method to close the bot -----------------------------
def close(self):
'''
close the bot
'''
self.driver.close()
# method to visit the home feed -----------------------------
def go_home_feed(self):
'''
Sends the bot to the main toutiao homepage
'''
url = "https://www.toutiao.com/"
# go to home toutia
self.driver.get(url)
# add some wait time
self.element = WebDriverWait(self.driver, 10).until(
EC.element_to_be_clickable((By.XPATH, '//*[@id="root"]/div/div[5]/div[2]/div[1]/div/a')))
# method to open a article -----------------
def go_article(self, article_url, time_read):
'''
sends the bot to a seed article
:param
article_url: str, string wiht the full url for the article
time_read: int, time to spend "reading" the article
'''
# go to seed vide
self.driver.get(article_url)
# add some wait time
#self.element = WebDriverWait(self.driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR,
# '.media-info .user-info')))
# time to stay in the article
time.sleep(time_read)
# method to collect articles metadata. -------------------------------
def collect_metadata_article(self, article_url, time_read):
'''
sends the bot to a see article and collects the url metadata
:param
article_url: str, string wiht the full url for the article
time_read: int, time to spend "reading" the article
'''
# navigate to the article
self.go_article(article_url, time_read)
# create a dictionary
collector = dict()
collector["video_url"] = article_url
# scrape article information
# author info
try:
author_info = self.driver.find_element(By.CSS_SELECTOR, ".media-info .user-info")
author_info_ = author_info.text.split("\n")
collector["author_info_abbrv"] = author_info_[0]
collector["author_info_full_name"] = author_info_[1]
except:
collector["author_info_abbrv"] = ""
collector["author_info_full_name"] = ""
# link to author
try:
collector["author_link"] = author_info.find_element(By.CLASS_NAME, "user-avatar").get_attribute("href")
except:
collector["author_link"] = ""
# title
try:
title = self.driver.find_element(By.CLASS_NAME, "article-content h1")
collector["title"] = title.text
except:
collector["title"] =""
# text
try:
text_boxes= self.driver.find_elements(By.CSS_SELECTOR, 'p[data-track]')
collector["text"] = " ".join([t.text for t in text_boxes])
except:
collector["text"] = ""
# reactions
try:
likes = self.driver.find_element(By.CLASS_NAME, "detail-like")
comments = self.driver.find_element(By.CLASS_NAME, "detail-interaction-comment")
collector["likes"] = likes.text
collector["n_comments"] = comments.text
except:
collector["likes"] = ""
collector["n_comments"] = ""
# add time of the publication
try:
time_= self.driver.find_element(By.CSS_SELECTOR, '.original-tag+ span')
collector["time"] = time_.text
except:
collector["time"] = ""
# return
return collector
### method to collect related articles --------------------------------
def collect_related_articles(self, article_url, time_read):
'''
sends the bot to a see article and collects the related articles reccomended from the same author
:param
article_url: str, string wiht the full url for the article
time_read: int, time to spend "reading" the article
'''
# navigate to the article
self.go_article(article_url, time_read)
# create a dictionary
collector = dict()
collector["video_url"] = article_url
# collect related article
try:
related = self.driver.find_elements(By.CSS_SELECTOR, ".related-list-item")
collector["link_related"]=[r.find_element(By.TAG_NAME, "a").get_attribute("href") for r in related]
collector["text_related"] = [r.find_element(By.CLASS_NAME, "title").text for r in related]
except:
collector["link_related"] =""
collector["text_related"] =""
return collector
### method to collect hot topics from a article --------------------------------
def collect_hot_topic_from_article(self, article_url, time_read):
'''
sends the bot to a see article and collects the hot topics reccomended in the articles page
:param
article_url: str, string wiht the full url for the article
time_read: int, time to spend "reading" the article
'''
# navigate to the article
self.go_article(article_url, time_read)
# create a dictionary
collector = dict()
collector["video_url"] = article_url
# collect related article
try:
related = self.driver.find_elements(By.CSS_SELECTOR, ".article-item")
collector["link_hot_topic"]=[r.get_attribute("href") for r in related]
collector["text_hot_topic"] = [r.get_attribute("aria-label") for r in related]
except:
collector["link_hot_topic"] =""
collector["text_text_topic"] =""
return collector
### method to collect reccomendations from a article --------------------------------
def collect_rec_from_article(self, article_url, time_read):
'''
sends the bot to a see article and collects the recommendations from the articles page
:param
article_url: str, string wiht the full url for the article
time_read: int, time to spend "reading" the article
'''
# navigate to the article
self.go_article(article_url, time_read)
# create a dictionary
collector = dict()
collector["video_url"] = article_url
# scrool down to the bottom of the page
time.sleep(np.random.choice(range(3, 7)))
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(np.random.choice(range(3, 7)))
try:
rec = self.driver.find_elements(By.CSS_SELECTOR, ".feed-card-article-l .title")
# collect reccomend
if len(rec)>0:
pass
else:
time.sleep(np.random.choice(range(3, 7)))
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
rec= self.driver.find_elements(By.CSS_SELECTOR, ".feed-card-article-l .title")
try:
collector["links"]=[r.get_attribute("href") for r in rec]
collector["text"]=[r.get_attribute("aria-label") for r in rec]
collector["title"]=[r.get_attribute("title") for r in rec]
# code to clean later. some of the links are coming without text
# clean text
#text_=[r.get_attribute("aria-label") for r in rec]
#text_=["" if text is None else str(text) for text in text_]
# clean title
#title_=[r.get_attribute("title") for r in rec]
#title_=["" if text is None else str(text) for text in title_]
# combine text and title
#z = zip(text_, title_)
#collector["title"]=["".join(z_) for z_ in z]
except:
collector["links"]=""
collector["text"]=""
collector["title"]=""
except:
pass
return collector
### method to collect reccomendations from a article --------------------------------
def collect_rec_from_home(self, user_id):
'''
sends the bot to home page and collects the recommendations
'''
# navigate to the article
self.go_home_feed()
# create a dictionary
collector = dict()
collector["user_id"] = user_id
# scrool down to the bottom of the page
time.sleep(np.random.choice(range(3, 7)))
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(np.random.choice(range(3, 7)))
rec = self.driver.find_elements(By.CSS_SELECTOR, ".feed-card-article-l .title")
# collect reccomend
if len(rec)>0:
pass
else:
time.sleep(np.random.choice(range(3, 7)))
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
rec = self.driver.find_elements(By.CSS_SELECTOR, ".feed-card-article-l .title")
try:
collector["links"]=[r.get_attribute("href") for r in rec]
collector["title"]=[r.get_attribute("aria-label") for r in rec]
collector["text"]=[r.text for r in rec]
# code to clean later. some of the links are coming without text
# clean text
#text_=[r.get_attribute("aria-label") for r in rec]
#text_=["" if text is None else str(text) for text in text_]
# clean title
#title_=[r.get_attribute("title") for r in rec]
#title_=["" if text is None else str(text) for text in title_]
# combine text and title
#z = zip(text_, title_)
#collector["title"]=["".join(z_) for z_ in z]
except:
collector["links"]=""
collector["text"]=""
collector["title"]=""
return collector
##### method to read articles on toutiao -------------------------
def action_read_article(self, article_url, time_read):
'''
sends the bot to a see article and collects the url metadata
:param
article_url: str, string wiht the full url for the article
time_read: int, time to spend "reading" the article
'''
# navigate to the article
self.go_article(article_url, time_read)
# replicate some common user behavior
# 1 - move mouse to the author of the article
try:
element = self.driver.find_element(By.CSS_SELECTOR, ".media-info .user-info")
action = ActionChains(self.driver)
action.move_to_element(element).pause(np.random.choice(range(10))).perform()
except Exception as error:
print("Error occured when trying to move the mouse to the source", error)
# 2 - move the mouse back to the main title
try:
title_elem = self.driver.find_element(By.CLASS_NAME, "article-content h1")
action = ActionChains(self.driver)
action.move_to_element(title_elem).pause(np.random.choice(range(10))).perform()
except Exception as error:
print("Error occured when trying to move the mouse to the source", error)
# 3 - scroll down through the article
try:
# get a proxy for the length of the article
text_boxes= self.driver.find_elements(By.CSS_SELECTOR, 'p[data-track]')
len_text = len(text_boxes)
#scroll down and spend some time in the article
for idx, par in enumerate(text_boxes):
time.sleep(np.random.choice(range(3, 7)))
self.driver.execute_script("arguments[0].scrollIntoView();", par)
print(f'Reading the paragraph:{idx}')
except Exception as error:
print("An exception occurred:", error)
# 4 - scroll back to the title
try:
self.driver.execute_script("arguments[0].scrollIntoView();", title_elem)
except Exception as error:
print("An exception occurred:", error)
# instantiate
bot = ToutiaoBot()
# add code executing at least one method of this class.
!jupyter nbconvert _week-08_selenium.ipynb --to html --template classic