PPOL 5203 Data Science I: Foundations

Parsing Unstructure Digital Data. Scraping Part II - Selenium

Tiago Ventura


Learning Goals

In the class today, we will learn how to use selenium for research purposes. Selenium will allow us to:

  • Webscrap dynamic websites
  • Imitate human behavior online
In [223]:
# setup
import requests
import os
import pandas as pd

Static vs Dynamic web pages

Static web pages: when the browser and the source code content match each other. Everything you see in your browser matchs as source html code. For these type of pages, scrapping can be accomplished using:

  • string methods and regex
  • beautifulsoup
  • scrapy

Dynamic web pages: when the content we are viewing in our browser does not match the content we see in the HTML source code we are retrieving from the site/ There are two approaches to scraping a dynamic webpage:

  • Scrape the content directly from the JavaScript
  • Scrape the website as we view it in our browser — using Python packages capable of executing the JavaScript.

Selenium

Definition: Selenium is an open source tool which is used for automating web browser testing. It allows us to write script in any known programming languages like Java, Python, C# etc. It works across all major OS and also works on all major web browsers.

Selenium works by automating browsers to execute JavaScript to display a web page as we would normally interact with it. By doing so, we can scrape the web pages as we see it, even though their HTML source is not there!

We will use selenium for data science purposes with two different approaches:

  • Collect data from dynamic websites (such as Youtube, Zillow, Toutiao, among others)

  • Interact with websites, and potentially conduct algorithmic studies of recommendations systems.

Since you already know the basics of html, we will jump straight to using selenium to scrap and interact with dynamic websites

Installing Selenium

Setting up selenium on your environment can be a bit tricky. First, you need to install the selenium library in Python. Second, you need to download the selenium webdriver. It is a browser-dependent executable file that acts as a bridge between your script and the browser.

See here instructions to set up your selenium environment: https://selenium-python.readthedocs.io/installation.html

Most importantly, we can use the Web Driver Manager library in Python (https://pypi.org/project/webdriver-manager/) to help us set our selenium environment up.

In [224]:
# setup
#!pip install selenium
#!pip install webdriver-manager

# open selenium from a path
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains

# import other packages
import json
import time
import re
from datetime import datetime, timedelta
import pandas as pd
import numpy as np

Scraping Data with Selenium

To scrape data with selenium, our workflow will usually involve:

  • Navigate to a page using .get() method
  • Define which elements we are interested in collecting. Elements are all things inside of html tags (<> element </>)
  • Find the htmls attributes or tags that identify these elements. For this task, we can use:
  • Use the find_element methods to find an element based on its attributes/value criteria or selector value that we supply in our script. This is the basic structure of this method:
    • driver.find_element(By.<attribute>, <selector>)

The find_element() methods from the driver object receives two inputs: By.<> methods for the attributes/tags, and a string with the selector to identify the attributtes/tags. You can use the following attribute with the By. method:

  • find_element(By.ID, "id")
  • find_element(By.NAME, "name")
  • find_element(By.XPATH, "xpath")
  • find_element(By.LINK_TEXT, "link text")
  • find_element(By.PARTIAL_LINK_TEXT, "partial link text")
  • find_element(By.TAG_NAME, "tag name")
  • find_element(By.CLASS_NAME, "class name")
  • find_element(By.CSS_SELECTOR, "css selector")

See the selenium documentation page for a in-depth coverage of the find_element and By methods

Scraping a Brazilian Fact-Checking Agency

We will start with an example of using Selenium to Scrape a Brazilian Fact-checking Agency. You do not need to know portuguese for this, just you your translator tool on google to understand what is going on.

Step 1 : Create your web driver

In [225]:
# call driver
from selenium.webdriver.chrome.options import Options
options = Options()
#options.add_argument('--headless')
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()),
                                      options=options)
In [226]:
# close your drives
driver.close()
In [7]:
from selenium import webdriver
from selenium.webdriver.firefox.service import Service as FirefoxService
from webdriver_manager.firefox import GeckoDriverManager

driver = webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install()))
In [8]:
driver.close()

Step 2 : Navigate to your webpage of interest

In [12]:
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

# send your driver to an website
driver.get("https://lupa.uol.com.br/jornalismo/categoria/verifica%C3%A7%C3%A3o")

# put it to sleep a bit. Give some time to load
time.sleep(15)

Step 3: Collect the Data

In [14]:
title = driver.find_elements(By.CSS_SELECTOR,".eRlKZk")
title[0]
Out[14]:
<selenium.webdriver.remote.webelement.WebElement (session="d447852af98c523e6ef5490f8b401afa", element="f.A5D0610EF615F60EA7824C7D1D454416.d.889A95CB3C409F4C1EE2BB7984B6140F.e.5")>

Notice the output is a selenium.webdriver.remote.webelement.WebElement. This is an WebElement class. It symbolizes an HTML element within an HTML document, and similarly to BeautifulSoup, it provides us with methods to interact with the html element

See a full descriptions of the WebElements methods in this tutorial: https://www.geeksforgeeks.org/element-methods-in-selenium-python/

In [15]:
# see the outer HTML
title[0].get_attribute('outerHTML')
Out[15]:
'<div size="12" class="sc-hLBbgP eRlKZk"><span class="sc-eDvSVe hAIjQn">Voos de repatriados do Oriente Médio incluem homens, mulheres e crianças</span></div>'
In [16]:
# Get the text
title[0].text
Out[16]:
'Voos de repatriados do Oriente Médio incluem homens, mulheres e crianças'
In [17]:
# list comprehension for all cases
dict_ = {"title":[t.text for t in title]}
df = pd.DataFrame(dict_)
df.head()
Out[17]:
title
0 Voos de repatriados do Oriente Médio incluem h...
1 Médico que teria descoberto cura para fungos n...
2 É falso que urnas de Kentucky estejam trocando...
3 Post engana ao dizer que Lula não feriu a cabe...
4 Jornais não ocultaram tatuagem de Lula em caso...

Let's now write a full function to collect all the content we are are interested in.

In [18]:
# function to scrap
def scrap_lupa(driver):
  
  """
  function to scrap the lupa headlines
  input:
    driver: selenium driver
  """    
  time.sleep(15)
  
  # collect data
  try:
    title = driver.find_elements(By.CSS_SELECTOR,".eRlKZk")
  except:
    title= ""
  
  try:  
    text = driver.find_elements('css selector',"#init p:nth-child(1)")
  except:
    text=""
  
  try:  
    date = driver.find_elements('css selector', '.cuaKEv .kTcxJC')
  except:
    date = ""
  
  try:
    url = driver.find_elements('xpath','//*[@id="init"]/div/div/div[2]/div/div/div/div/div[1]/div/div[2]/div/a')
  except:
    url= ""
  
  
  # get information  
  dict_ = {"title":[t.text for t in title], 
            "text":[t.text for t in title], 
             "date":[b.text for b in date], 
             "url":[u.get_attribute('href')for u in url]}
  
  # make dataframe
  df = pd.DataFrame(dict_)
  
  return(df)
In [19]:
# run the function
df = scrap_lupa(driver)

# see outputs
df.head()
Out[19]:
title text date url
0 Voos de repatriados do Oriente Médio incluem h... Voos de repatriados do Oriente Médio incluem h... 04.11.2024 - 11h02 https://lupa.uol.com.br/jornalismo/2024/11/04/...
1 Médico que teria descoberto cura para fungos n... Médico que teria descoberto cura para fungos n... 01.11.2024 - 20h34 https://lupa.uol.com.br/jornalismo/2024/11/01/...
2 É falso que urnas de Kentucky estejam trocando... É falso que urnas de Kentucky estejam trocando... 01.11.2024 - 18h58 https://lupa.uol.com.br/jornalismo/2024/11/01/...
3 Post engana ao dizer que Lula não feriu a cabe... Post engana ao dizer que Lula não feriu a cabe... 01.11.2024 - 18h35 https://lupa.uol.com.br/jornalismo/2024/11/01/...
4 Jornais não ocultaram tatuagem de Lula em caso... Jornais não ocultaram tatuagem de Lula em caso... 01.11.2024 - 17h54 https://lupa.uol.com.br/jornalismo/2024/11/01/...
In [20]:
df.shape
Out[20]:
(30, 4)

Interacting with the Webpage

More than a scraping tool, selenium provides us with a complete API to interact with webpages and mimic user behavior online. Let's see here some examples of this.

Scroling Down + Clicking

When an element is on screen it means that it is embedded in the DOM structure of the web page. Therefore, we can find its DOM structure when inspecting a web page. If the element we act upon is not in the DOM structure, then we will see some kind of error message.

In the previous example, you could see we only collected 30 links. However, the website allowed us at the bottom of the page to load more news. We can do that by scrolling down to the bottom of the website, and ask selenium to click in a particular button.

Let's see how it works.

In [228]:
# call some more methods here
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# function to load more data
def load_more_headlines(driver, n):
    """
      function to load more pages in the lupa website
    """
  
  # first iteration you actually need to click on load more
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    driver.execute_script("window.scrollBy(0, -200);")
    
  # find element
    element = driver.find_element(By.CSS_SELECTOR,'.bGDmDu') 
    element.click()

    # keep scrolling down

    for i in range(0,n):
        try:
      # goes down on the webpage  
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            driver.execute_script("window.scrollBy(0, -200);")
    
      # sleep  
            time.sleep(5)
            print(f"Click to upload more Headlines: {str(i)}")
        except:
            print("something wrong")
In [232]:
# let's see how it works
#options.add_argument('--headless=new')
options = Options()
options.add_argument("window-size=1920,1080")

# create a new driver
#driver = webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install()))
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), 
                                       options=options)

See where the driver is ....

In [233]:
# go to lupa
driver.get("https://lupa.uol.com.br/jornalismo/categoria/verifica%C3%A7%C3%A3o")
In [234]:
# load more pages
load_more_headlines(driver, 5)
Click to upload more Headlines: 0
Click to upload more Headlines: 1
Click to upload more Headlines: 2
Click to upload more Headlines: 3
Click to upload more Headlines: 4
In [213]:
# collect data
df = scrap_lupa(driver)
In [214]:
# see outputs
print(df.shape)

# bottom
df.tail()
(150, 4)
Out[214]:
title text date url
145 É falso que mpox seja efeito colateral da vaci... É falso que mpox seja efeito colateral da vaci... 12.09.2024 - 12h56 https://lupa.uol.com.br/jornalismo/2024/09/12/...
146 Governo Lula não comprou drone que provoca que... Governo Lula não comprou drone que provoca que... 11.09.2024 - 17h35 https://lupa.uol.com.br/jornalismo/2024/09/11/...
147 É falso que governo criou 'Dops' para censurar... É falso que governo criou 'Dops' para censurar... 11.09.2024 - 17h14 https://lupa.uol.com.br/jornalismo/2024/09/11/...
148 É falso texto que orienta usuário a publicar d... É falso texto que orienta usuário a publicar d... 11.09.2024 - 16h54 https://lupa.uol.com.br/jornalismo/2024/09/11/...
149 É falso que ‘bloqueio da Starlink’ atrapalha c... É falso que ‘bloqueio da Starlink’ atrapalha c... 11.09.2024 - 15h54 https://lupa.uol.com.br/jornalismo/2024/09/11/...
In [215]:
driver.close()

Inputing Text

Another useful functionality of selenium is to allow us to input text on webforms. Let's see a toy example with google:

In [216]:
# create a new driver
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
In [217]:
# go to google
driver.get("https://www.google.com/")
In [218]:
# get all the searches for Data Science and Public Policy
input_ = driver.find_element(By.CSS_SELECTOR, "#APjFqb")
In [219]:
# send text
input_.send_keys("Data Science and Public Policy")

# click
input_.send_keys(Keys.ENTER)
In [150]:
#!pip install screenshot
Requirement already satisfied: screenshot in /Users/tb186/anaconda3/lib/python3.11/site-packages (1.0.0)
Requirement already satisfied: click in /Users/tb186/anaconda3/lib/python3.11/site-packages (from screenshot) (8.0.4)
Requirement already satisfied: pyobjc-framework-Quartz in /Users/tb186/anaconda3/lib/python3.11/site-packages (from screenshot) (10.0)
Requirement already satisfied: pyobjc-core>=10.0 in /Users/tb186/anaconda3/lib/python3.11/site-packages (from pyobjc-framework-Quartz->screenshot) (10.0)
Requirement already satisfied: pyobjc-framework-Cocoa>=10.0 in /Users/tb186/anaconda3/lib/python3.11/site-packages (from pyobjc-framework-Quartz->screenshot) (10.0)
In [220]:
# see
from PIL import Image
import matplotlib.pyplot as plt
In [221]:
# screenshot
driver.save_screenshot("google.png")
img = Image.open('google.png')
plt.imshow(img)
Out[221]:
<matplotlib.image.AxesImage at 0x17920bcd0>

Practice: Understanding Chinese Social Media Recommendation Systems

One of my research projects focuses on understanding social media recommendation systems in China. The project focuses on when and how digital platform recommendation systems in China promote certain types content. To do so, we investigate the news aggregator in China called Toutiao, and we are currently scraping data from toutiao to answer our research question.

Selenium is the main tool we are working with. With selenium we do the following:

  • Create a few sock puppets from selenium, which are basically a bunch of headless browsers in selenium
  • Direct these account to read different articles on Toutiao
  • Investigate if reading more article of content X leads to more or less recommendations from accounts from the topic X.
  • Collect data from reccomendations

This is a super complex research question. Creating sock puppets is particularly challenging. However, with the class you had today, you could understand the collecting data step.

Below, you will see the class I created to collect data from toutiao. Your in-class/additional exercise is to work through the code, and try to use it to collect content from at least one page in toutiao.

Let me know how it goes!

In [120]:
#import libraries from selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

# import other libraries
import time
import string
import pandas as pd
import numpy as np
In [121]:
# utils functions
def sleep_random_time(range_):
    '''
    :param range_: a tuple of the range of wait times
    ex: (0,30) would randomly sleep between 0 and 30 seconds
    '''
    if isinstance(range_, int):
        time.sleep(np.random.choice(range(range_)))
    elif isinstance(range_, tuple):
        min_time, max_time = range_
        if isinstance(min_time, float):
            min_time = int(math.ceil(min_time))
        if isinstance(max_time, float):
            max_time = int(math.floor(max_time))

        time.sleep(np.random.choice(range(min_time, max_time)))
In [122]:
# create a class
class ToutiaoBot():
    '''
    Returns a Bot to interact with the Toutiao Webpage
    
        :param 
            headless (bool): True spins up a headless browser. Default equals to False.  
    '''
    # create instance attributes -----------------

    def __init__(self, headless=False):
        if headless==True:
            options = Options()
            
            # HEADLESS OPTIONS
            options.add_argument('--headless=new')
            options.add_argument("window-size=1920,1080")
            self.driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), 
                                      options=options)
                
            # bypass OS security
            options.add_argument('--no-sandbox')
            # overcome limited resources
            options.add_argument('--disable-dev-shm-usage')
            # don't tell chrome that it is automated
            options.add_experimental_option("excludeSwitches", ["enable-automation"])
            options.add_experimental_option('useAutomationExtension', False)
            # disable images
            prefs = {"profile.managed_default_content_settings.images": 2}
            options.add_experimental_option("prefs", prefs)

            # Setting Capabilities
            capabilities = webdriver.DesiredCapabilities.CHROME.copy()
            capabilities['acceptSslCerts'] = True
            capabilities['acceptInsecureCerts'] = True

            self.driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), 
                                      options=options)
        else:
            self.driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
    
    # method to close the bot -----------------------------
        
    def close(self):
        '''
        close the bot
        '''
        self.driver.close()
    
    # method to visit the home feed -----------------------------

    def go_home_feed(self):
        '''
        Sends the bot to the main toutiao homepage
        '''
        url = "https://www.toutiao.com/"
        # go to home toutia
        self.driver.get(url)
        # add some wait time
        self.element = WebDriverWait(self.driver, 10).until(
            EC.element_to_be_clickable((By.XPATH, '//*[@id="root"]/div/div[5]/div[2]/div[1]/div/a')))
    
    
    # method to open a article -----------------
    
    def go_article(self, article_url, time_read):
        '''
        sends the bot to a seed article
            :param 
                article_url: str, string wiht the full url for the article
                time_read: int, time to spend "reading" the article
        '''
        # go to seed vide
        self.driver.get(article_url)
        
        # add some wait time
        #self.element = WebDriverWait(self.driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, 
         #                                                                               '.media-info .user-info')))
        
        # time to stay in the article
        time.sleep(time_read)
            
    
    # method to collect articles metadata. -------------------------------
    
    def collect_metadata_article(self, article_url, time_read):
        '''
        sends the bot to a see article and collects the url metadata
            :param 
                article_url: str, string wiht the full url for the article
                time_read: int, time to spend "reading" the article

        '''
        
        # navigate to the article
        self.go_article(article_url, time_read)
            
        
        # create a dictionary
        collector = dict()
        collector["video_url"] = article_url
        
        # scrape article information
        
        # author info
        try: 
            author_info = self.driver.find_element(By.CSS_SELECTOR, ".media-info .user-info")
            author_info_ = author_info.text.split("\n")
            collector["author_info_abbrv"] = author_info_[0]
            collector["author_info_full_name"] = author_info_[1]
        except:
            collector["author_info_abbrv"] = ""
            collector["author_info_full_name"] = ""
        # link to author
        try: 
            collector["author_link"] = author_info.find_element(By.CLASS_NAME, "user-avatar").get_attribute("href")
        except:
            collector["author_link"] = ""
        # title
        try:
            title = self.driver.find_element(By.CLASS_NAME, "article-content h1")
            collector["title"] = title.text
        except:
            collector["title"] =""
        # text
        try:
            text_boxes= self.driver.find_elements(By.CSS_SELECTOR, 'p[data-track]')
            collector["text"] = " ".join([t.text for t in text_boxes])
        except:
            collector["text"] = ""
        # reactions
        try:
            likes = self.driver.find_element(By.CLASS_NAME, "detail-like")
            comments = self.driver.find_element(By.CLASS_NAME, "detail-interaction-comment")
            collector["likes"] = likes.text
            collector["n_comments"] = comments.text
        except:
            collector["likes"] = ""
            collector["n_comments"] = ""  
        # add time of the publication
        try:
            time_= self.driver.find_element(By.CSS_SELECTOR, '.original-tag+ span')
            collector["time"] = time_.text
        except:
            collector["time"] = ""

        # return
        return collector
    
    
    ### method to collect related articles --------------------------------
    
    def collect_related_articles(self, article_url, time_read):
        '''
        sends the bot to a see article and collects the related articles reccomended from the same author
            :param 
                article_url: str, string wiht the full url for the article
                time_read: int, time to spend "reading" the article

        '''

        
        # navigate to the article
        self.go_article(article_url, time_read)
            
        
        # create a dictionary
        collector = dict()
        collector["video_url"] = article_url
        
        # collect related article
        try:
            related = self.driver.find_elements(By.CSS_SELECTOR, ".related-list-item")
            collector["link_related"]=[r.find_element(By.TAG_NAME, "a").get_attribute("href") for r in related]
            collector["text_related"] = [r.find_element(By.CLASS_NAME, "title").text for r in related]
        except: 
            collector["link_related"] =""
            collector["text_related"] =""
            
        return collector

    ### method to collect hot topics from a article --------------------------------
    
    def collect_hot_topic_from_article(self, article_url, time_read):
        '''
        sends the bot to a see article and collects the hot topics reccomended in the articles page
            :param 
                article_url: str, string wiht the full url for the article
                time_read: int, time to spend "reading" the article

        '''

        
        # navigate to the article
        self.go_article(article_url, time_read)
            
        
        # create a dictionary
        collector = dict()
        collector["video_url"] = article_url
        
        # collect related article
        try:
            related = self.driver.find_elements(By.CSS_SELECTOR, ".article-item")
            collector["link_hot_topic"]=[r.get_attribute("href") for r in related]
            collector["text_hot_topic"] = [r.get_attribute("aria-label") for r in related]
        except: 
            collector["link_hot_topic"] =""
            collector["text_text_topic"] =""
            
        return collector

    ### method to collect reccomendations from a article --------------------------------
    def collect_rec_from_article(self, article_url, time_read):
        '''
        sends the bot to a see article and collects the recommendations from the articles page
            :param 
                article_url: str, string wiht the full url for the article
                time_read: int, time to spend "reading" the article

        '''

        
        # navigate to the article
        self.go_article(article_url, time_read)
            
        
        # create a dictionary
        collector = dict()
        collector["video_url"] = article_url
        
        # scrool down to the bottom of the page
        time.sleep(np.random.choice(range(3, 7)))
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") 
        time.sleep(np.random.choice(range(3, 7)))
        
        try: 
            rec = self.driver.find_elements(By.CSS_SELECTOR, ".feed-card-article-l .title")
            # collect reccomend
            if len(rec)>0:
                pass
            else:
                time.sleep(np.random.choice(range(3, 7)))
                self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") 
                rec= self.driver.find_elements(By.CSS_SELECTOR, ".feed-card-article-l .title")

            try:
                collector["links"]=[r.get_attribute("href") for r in rec]
                collector["text"]=[r.get_attribute("aria-label") for r in rec]
                collector["title"]=[r.get_attribute("title") for r in rec]

                # code to clean later. some of the links are coming without text

                # clean text
                #text_=[r.get_attribute("aria-label") for r in rec]
                #text_=["" if text is None else str(text) for text in text_]

                # clean title
                #title_=[r.get_attribute("title") for r in rec]
                #title_=["" if text is None else str(text) for text in title_]

                # combine text and title
                #z = zip(text_, title_)
                #collector["title"]=["".join(z_) for z_ in z]
            except:
                collector["links"]=""
                collector["text"]=""
                collector["title"]=""
        except:
            pass
        
        return collector
    
     ### method to collect reccomendations from a article --------------------------------
    def collect_rec_from_home(self, user_id):
        '''
        sends the bot to home page and collects the recommendations

        '''

        # navigate to the article
        self.go_home_feed()
            
        
        # create a dictionary
        collector = dict()
        collector["user_id"] = user_id
        
        # scrool down to the bottom of the page
        time.sleep(np.random.choice(range(3, 7)))
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") 
        time.sleep(np.random.choice(range(3, 7)))
        rec = self.driver.find_elements(By.CSS_SELECTOR, ".feed-card-article-l .title")

        
        # collect reccomend
        if len(rec)>0:
            pass
        else:
            time.sleep(np.random.choice(range(3, 7)))
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") 
            rec = self.driver.find_elements(By.CSS_SELECTOR, ".feed-card-article-l .title")
                         
        try:
            collector["links"]=[r.get_attribute("href") for r in rec]
            collector["title"]=[r.get_attribute("aria-label") for r in rec]
            collector["text"]=[r.text for r in rec]
            
            # code to clean later. some of the links are coming without text
           
            # clean text
            #text_=[r.get_attribute("aria-label") for r in rec]
            #text_=["" if text is None else str(text) for text in text_]

            # clean title
            #title_=[r.get_attribute("title") for r in rec]
            #title_=["" if text is None else str(text) for text in title_]

            # combine text and title
            #z = zip(text_, title_)
            #collector["title"]=["".join(z_) for z_ in z]
        except:
            collector["links"]=""
            collector["text"]=""
            collector["title"]=""
        return collector
   
    ##### method to read articles on toutiao -------------------------
    
    def action_read_article(self, article_url, time_read):   
        '''
        sends the bot to a see article and collects the url metadata
            :param 
                article_url: str, string wiht the full url for the article
                time_read: int, time to spend "reading" the article

        '''
        
        # navigate to the article
        self.go_article(article_url, time_read)
        
        # replicate some common user behavior
        
        # 1 - move mouse to the author of the article
        try: 
            element = self.driver.find_element(By.CSS_SELECTOR, ".media-info .user-info")
            action = ActionChains(self.driver)
            action.move_to_element(element).pause(np.random.choice(range(10))).perform()
        except Exception as error: 
            print("Error occured when trying to move the mouse to the source", error)
        
        # 2 - move the mouse back to the main title
        try: 
            title_elem = self.driver.find_element(By.CLASS_NAME, "article-content h1")
            action = ActionChains(self.driver)
            action.move_to_element(title_elem).pause(np.random.choice(range(10))).perform()
        except Exception as error:
            print("Error occured when trying to move the mouse to the source", error)
        
        # 3 - scroll down through the article
        try:
        # get a proxy for the length of the article
            text_boxes= self.driver.find_elements(By.CSS_SELECTOR, 'p[data-track]')
            len_text = len(text_boxes)
            #scroll down and spend some time in the article
            for idx, par in enumerate(text_boxes):
                time.sleep(np.random.choice(range(3, 7)))
                self.driver.execute_script("arguments[0].scrollIntoView();", par)
                print(f'Reading the paragraph:{idx}')
        except Exception as error:
            print("An exception occurred:", error)         
        
        # 4 - scroll back to the title
        try:
            self.driver.execute_script("arguments[0].scrollIntoView();", title_elem)
        except Exception as error:
            print("An exception occurred:", error)         
            
            
In [123]:
# instantiate
bot = ToutiaoBot()
In [124]:
# add code executing at least one method of this class.
In [227]:
!jupyter nbconvert _week-08_selenium.ipynb --to html --template classic
[NbConvertApp] Converting notebook _week-08_selenium.ipynb to html
[NbConvertApp] Writing 471230 bytes to _week-08_selenium.html