Week 09: Acquiring Digital Data: APIs and Selenium
2024-11-05
Solution from last week in class exercise
APIs
A simple API - Trivia API
More complex with Credentials: Yelp
Credentials + Wrapper: Youtube
Selenium for Dynamic Websites and Audit Studies
Scraping: grab information available on websites (Last Week)
Leveraging the structure of a website to grab it’s contents
Using a programming environment (such as R, Python, Java, etc.) to systematically extract that content.
Accomplishing the above in an “unobtrusive” and legal way.
APIs: Data transfer portals built by companies to allow people to access data programmatically. (Today)
APIs: Application Programming Interface
How does it work: online server created to facilitate information exchange between data users and the data holders
Use Cases:
We will focus today on the first component.
An API is just an URL. See the example below:
http://mywebsite.com/endpoint?key¶m_1¶m_2
Main Components:
http://mywebsite.com//: API root. The domain of your api/
endpoint: An endpoint is a server route for retrieving specific data from an API
key: credentials that some websites ask for you to create before you can query the api.
?param_1¶m_2. Those are filters that you can input in apis requests. We call these filters parameters
In order to work with APIs, we need tools to access the web. In Python, the most common library for making requests and working with APIs is the requests library.
get(): to receive information from the API
post(): to send information to the API – think about the use of ChatGPT for classification of text.
When using an API, our work will often involve the following steps:
We will work with three different types of APIs today:
Trivia API: API that does not require authentication and does not have a wrapper
Yelp API: API that does require authentication and does not have a wrapper
Youtube API: API that does require authentication and does have a wrapper
Wrapper:a set of functions or methods (full library) that provide a simplified interface to interact with an underlying API.
Static web pages: when the browser and the source code content match each other. Collect data via:
Dynamic web pages: when the content we are viewing in our browser does not match the content we see in the HTML source code we are retrieving from the site. How to scrape?
Selenium is an open source tool which is used for automating web browser.
works by automating browsers to execute JavaScript to display a web page
allow us to interact with web pages programmatically
collect data as we see in the web (after running the correspondin java script)
Drawback: selenium is a bit of a pain to install. So part of your homework will simply be to go over the selenium setup!
Data science I: Foundations