# Call packages using pacman
#install.packages("pacman")
::p_load(here, jsonlite, tidyverse, academictwitteR) pacman
Workshop Analyzing Social Media Data I : Twitter Data
Introduction
Social media data can come on many flavors and from many different sources. For this reason, it is not possible to cover in-depth in a one-hour workshop many different types of social media data. In addition, social media companies provide different access to researchers to their data.
There is not a “one way to rule them all” when it comes to working with social media data.
For this reason, I decided start this workshop on the most used, and the most easily accessible social media data for researchers: Twitter data. However, even though we will spend most time working with Twitter data, several of the techniques I hope to cover here are no restricted to Twitter data. These are general techniques and can be applied pretty much to any type of social media data.
With Twitter data, we will cover:
- Analyzing and understanding network structure of social media data.
- Some extra endpoints from the Twitter API (timelines, friends list, among others).
To save some time, I stored all the data I am using in this tutorial here. This notebook should run if you place all the data in the same working directory.
Getting Access to the Twitter APIs.
In order to get access to Twitter data, you need to first apply for a Twitter developer account. Once your developer application has been approved, you get access to the standard product track by default. However, if you are an academic researcher and meet certain requirements, you can apply to the academic research product track which will give you elevated access to the Twitter API v2 including access to historical public Tweets for free.
We will be using the academic research access to the Twitter V2 API.
Standard Access
- Search for Tweets from the last 7 days by specifying queries using supported operators (more on building queries in later sections)
- Stream Tweets in real-time as they are happening by specifying rules to filter for Tweets that you are interested in.
- Get Tweets from a user’s timeline (up to 3200 most recent Tweets)
- Build the full Tweet objects from a Tweet ID, or a set of Tweet IDs
- Look up follower relationships
These are just some examples of what you can get from the standard product track, relevant to academics.
Currently, you can get upto 500,000 Tweets per month using the standard product track and this limit does not apply to the sampled stream endpoint, which gives a 1% sample of public Tweets in real-time
Academic Research product track
This track includes:
- Ability to get historical Tweets from the entire archive of public conversation on Twitter, dating back to 2006 (using the full-archive search endpoint)
- Higher monthly Tweet volume cap of 10 million Tweets per month
- More advanced filter options to return relevant data, including a longer query length, support for more concurrent rules (for filtered stream endpoint), and additional operators that are only supported in this product track (more on this later)
For a complete list of available endpoints in the V2 API, check out the Twitter API documentation.
For a complete course on accessing the Twitter V2 API, I suggest you take a look the 101 course prepared by Twitter API team.
Twitter Data: Presidential Elections in Brazil.
To collect Twitter data from the V2 API, we will use the academictwitteR
package developed by Chris Barrie. For R users, this is an amazing package because it allows you to easily query the API, and process the data in easily readable format for R.
If you prefer Python, I suggest you to check the library Twarc.
Access tweets from the archive
Let’s start collecting some data using a textual query through the search endpoint. The academic access to the V2 API allows you to query the Twitter archive, and get access to data way back on time.
Let’s start collecting some data about the recent presidential elections in Brazil.
# Using Academic Twitter to add yourkey
set_bearer()
# Collect data
<-
tweets get_all_tweets(
query = "(eleicoes2022 OR lula OR bolsonaro OR ciro OR tebet)", # query
start_tweets = "2022-10-01T00:00:00Z", #start time
end_tweets = "2022-10-04T00:00:00Z", #end time
file = "br_elections", # file to save
data_path = "data_br/", # folder where all data as jsons will be stores
n = 200000, # number of tweets
lang = "pt",
)
This data is stored as a series of smaller jsons. The academictwitter
has a specific function to easily combine these json files in a single file in the tidy format.
If you are collecting this through other packages or accessing the API directly, you would get long json files as responses. Jsons are basically a set of nested lists, and can be tricky to clean. So the bind_tweets
function can be very handy
Another option, which is very common if you have a consistent data pipeline, is to build your own cleaning function, getting the data and variables in the format your project needs.
# data processing
<- bind_tweets("./data_br", output_format = "tidy")
tweets_tidy glimpse(tweets_tidy)
Rows: 200,062
Columns: 31
$ tweet_id <chr> "1577066499042222080", "1577066498882908161", "…
$ user_username <chr> "cainsworts", "nandamattosbh", "fran51995877", …
$ text <chr> "RT @jinS2me: NAO SE MATEM o lula precisa de vo…
$ possibly_sensitive <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ conversation_id <chr> "1577066499042222080", "1577066498882908161", "…
$ lang <chr> "pt", "pt", "pt", "pt", "pt", "pt", "pt", "pt",…
$ source <chr> "Twitter for Android", "Twitter for Android", "…
$ created_at <chr> "2022-10-03T22:42:08.000Z", "2022-10-03T22:42:0…
$ author_id <chr> "809471355116781568", "42269111", "133561842785…
$ in_reply_to_user_id <chr> NA, NA, NA, NA, "1524416437091192832", NA, NA, …
$ user_name <chr> "felicité", "Fernanda Mattos", "fran", "LULA 13…
$ user_created_at <chr> "2016-12-15T18:53:05.000Z", "2009-05-24T19:47:0…
$ user_location <chr> "⚠️ edtwt", "BH", NA, "konoha", NA, NA, NA, NA, …
$ user_verified <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ user_description <chr> "all I see is what I should be. happier, pretti…
$ user_protected <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ user_pinned_tweet_id <chr> "1578393098018820098", "1386402147642707968", N…
$ user_profile_image_url <chr> "https://pbs.twimg.com/profile_images/156429933…
$ user_url <chr> NA, NA, NA, "https://t.co/ZfQtIL3QFx", NA, NA, …
$ retweet_count <int> 127, 2, 3666, 42709, 0, 3562, 0, 18460, 1376, 0…
$ like_count <int> 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ quote_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ user_tweet_count <int> 2197, 38394, 2943, 12704, 5244, 58983, 360, 279…
$ user_list_count <int> 1, 16, 0, 0, 0, 0, 0, 10, 0, 2, 2, 0, 0, 1, 0, …
$ user_followers_count <int> 220, 1386, 31, 243, 1389, 530, 54, 384, 707, 28…
$ user_following_count <int> 148, 219, 78, 214, 2088, 181, 81, 195, 1720, 45…
$ sourcetweet_type <chr> "retweeted", "retweeted", "retweeted", "retweet…
$ sourcetweet_id <chr> "1576742666754134017", "1576984204713172992", "…
$ sourcetweet_text <chr> "NAO SE MATEM o lula precisa de votos no SEGUND…
$ sourcetweet_lang <chr> "pt", "pt", "pt", "pt", NA, "pt", NA, "pt", "pt…
$ sourcetweet_author_id <chr> "1534722153819643906", "18880621", "26752656", …
A lot of data. But only a portion of what comes through the API. So you can actually process the whole data here. This is a really nice feature of this package. The json data is stored in smaller pieces, which makes it easier for you to process later.
# examing the data
<- bind_tweets("./data_br", output_format = "raw")
tweets_raw str(tweets_raw, max.level=1)
List of 27
$ tweet.entities.mentions : tibble [215,630 × 5] (S3: tbl_df/tbl/data.frame)
$ tweet.entities.annotations : tibble [386,283 × 6] (S3: tbl_df/tbl/data.frame)
$ tweet.entities.urls : tibble [32,703 × 12] (S3: tbl_df/tbl/data.frame)
$ tweet.entities.hashtags : tibble [10,405 × 4] (S3: tbl_df/tbl/data.frame)
$ tweet.entities.cashtags : tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
$ tweet.public_metrics.retweet_count : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
$ tweet.public_metrics.reply_count : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
$ tweet.public_metrics.like_count : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
$ tweet.public_metrics.quote_count : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
$ tweet.attachments.media_keys : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
$ tweet.attachments.poll_ids : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
$ tweet.geo.place_id : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
$ tweet.geo.coordinates : tibble [200,062 × 3] (S3: tbl_df/tbl/data.frame)
$ tweet.withheld.country_codes : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
$ tweet.withheld.copyright : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
$ tweet.edit_history_tweet_ids : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
$ tweet.referenced_tweets : tibble [184,799 × 3] (S3: tbl_df/tbl/data.frame)
$ tweet.main :'data.frame': 200062 obs. of 9 variables:
$ user.public_metrics.followers_count: tibble [337,098 × 2] (S3: tbl_df/tbl/data.frame)
$ user.public_metrics.following_count: tibble [337,098 × 2] (S3: tbl_df/tbl/data.frame)
$ user.public_metrics.tweet_count : tibble [337,098 × 2] (S3: tbl_df/tbl/data.frame)
$ user.public_metrics.listed_count : tibble [337,098 × 2] (S3: tbl_df/tbl/data.frame)
$ user.entities.url : tibble [337,098 × 2] (S3: tbl_df/tbl/data.frame)
$ user.entities.description : tibble [337,098 × 5] (S3: tbl_df/tbl/data.frame)
$ user.withheld.country_codes : tibble [337,098 × 2] (S3: tbl_df/tbl/data.frame)
$ user.main :'data.frame': 337098 obs. of 11 variables:
$ sourcetweet.main :'data.frame': 132813 obs. of 16 variables:
Network Analysis with Twitter Data
There are many different ways you can analyze Twitter data. You can analyze the text, the images, the geolocations, the links shared, among many other things.
A favorite way for computational social scientists to analyze social media data is to look at the user connections using some sort of network models. This is not limited to Twitter data. Pretty much any social media application comprises some sort of network structure.
So let’s start with some basics of network analysis in R.
A network has two core elements: nodes and edges. On Twitter this means:
Nodes are Twitter users
Edges are any sort of connections these users make. A reply, a friendship, or the most common, a retweet.
We will be using retweets and quote tweets to give some examples of network analysis in R. We will use the igraph
package to analyze network data in R. This package has many different features, but it stores data in a very distinct way. So let’s work through it.
Bulding a Network
Step 1: Filter Nodes
# Filter retweets
<- tweets_tidy %>%
tweets_tidy_rt filter(!is.na(sourcetweet_type))
dim(tweets_tidy_rt)
[1] 154583 31
# Visualize the dta
%>%
tweets_tidy_rt select(user_username, sourcetweet_author_id) %>%
head()
# A tibble: 6 × 2
user_username sourcetweet_author_id
<chr> <chr>
1 cainsworts 1534722153819643906
2 nandamattosbh 18880621
3 fran51995877 26752656
4 juliam3ndes 863806721696858112
5 caralho_modesti 2876592790
6 carolfcarneiro 44481447
Step 2: Create a edge list
# Create a edge list
# using the user id on both sides here to keep the same unit
<- cbind(tweets_tidy_rt$author_id, tweets_tidy_rt$sourcetweet_author_id)
data dim(data)
[1] 154583 2
head(data)
[,1] [,2]
[1,] "809471355116781568" "1534722153819643906"
[2,] "42269111" "18880621"
[3,] "1335618427852124163" "26752656"
[4,] "839520909807521793" "863806721696858112"
[5,] "136448124" "2876592790"
[6,] "108719485" "44481447"
Notice here we have two different types of users. Hubs are the users who retweet, and authorities are the user who receive a retweets.
Step 3: Create your network structure
::p_load(igraph)
pacman
# Create an empty network
<- graph.empty()
net
# Add nodes
<- add.vertices(net,
net length(unique(c(data))), # number of nodes
name=as.character(unique(c(data)))) # unique names
# Add edges
<- add.edges(net, t(data))
net
# summary
summary(net)
IGRAPH d8e2b34 DN-- 79886 154583 --
+ attr: name (v/c)
Your output:
- Igraph object
- 79886 unique nodes
- 154583 edges
Step four: Add information to your network object
Information comes in two flavors. Information at the edge level (E(object)
) and at the Node leve (V(object)
). Let’s see how it works.
library(urltools)
# Edges
E(net)$text <- tweets_tidy_rt$text
E(net)$idauth <- tweets_tidy_rt$sourcetweet_author_id
E(net)$namehub <- tweets_tidy_rt$user_username
# Capturing hashtags
E(net)$hash <- str_extract_all(tweets_tidy_rt$text, "#\\S+")
# grab expanded and unwound_url
<- tweets_raw$tweet.entities.urls
entities
<- entities %>%
tidy_entities # get columns we need
select(tweet_id, unwound_url) %>%
#extract domains
mutate(unwound_url=domain(unwound_url)) %>%
# remove nas and combine multiple links
filter(!is.na(unwound_url)) %>%
group_by(tweet_id) %>%
summarise(domain=paste0(unwound_url, collapse=" -- "))
# Merge back with id
<- left_join(tweets_tidy_rt, tidy_entities)
tweets_tidy_rt
# add to the network
E(net)$domain <- tweets_tidy_rt$domain
Network Statistics, Communities and Layout
Two very common concepts in network science are in-degree and out-degree. In-degree refers to how many links pointing to themselves the user has. So in our case it shows how many retweets this user has received. The opposite explains the out-degree. In this case, out-degree means how many retweets the user has given.
A user is called an authority when their in-degree is high. That is, this user received many retweets from others. We call it a hub when its out-degree is high, as this user retweets very often.
Accounst who are considered bots usually have a huge difference between in-degree and out-degree – nobody retweets them, they just retweet a lot, and usually really quick.
# Calculate in degree and out degree
V(net)$outdegree<-degree(net, mode="out")
V(net)$indegree<-degree(net, mode="in")
summary(net)
IGRAPH d8e2b34 DN-- 79886 154583 --
+ attr: name (v/c), outdegree (v/n), indegree (v/n), text (e/c), idauth
| (e/c), namehub (e/c), hash (e/x), domain (e/c)
Layout
Networks are always in the latente space. To visualize them, we usually reccur algorithms that maximize some dynamics of networks and give us layouts for visualization. You can try out different algorithm, but the Fruchterman-Reingold is a popular choice
<- layout_with_fr(net, grid = c("nogrid"))
l #saveRDS(l, "layout.rds")
head(l)
[,1] [,2]
[1,] -102.96401 216.91269
[2,] -178.82523 158.25089
[3,] 52.34920 81.01076
[4,] -11.32539 -139.13169
[5,] 51.89335 -56.94802
[6,] 29.44408 -99.62837
Communities
Community detection is a big part of network analysis. The idea of these techniques is to find clusters of connections across your entire network space. Community detection is a huge subfield of network science. Here is a nice review piece by Porter et al.
My take here is that for large networks, like we usually deal with in social media, most of the algorithms will do the job you need, which in general is identify the core communities in a network. An important point is to always validate these communities, and use some qualitative analysis to verify the results.
We will use an random walk algorithm for community detection
<- walktrap.community(net)
my.com.fast str(my.com.fast, max.level = 1)
Class 'communities' hidden list of 6
$ merges : num [1:77186, 1:2] 58675 61627 60095 58720 58731 ...
$ modularity: num [1:79886] 0 -0.00127 -0.00127 -0.00126 -0.00125 ...
$ membership: num [1:79886] 1689 2004 11 169 175 ...
$ names : chr [1:79886] "809471355116781568" "42269111" "1335618427852124163" "839520909807521793" ...
$ vcount : int 79886
$ algorithm : chr "walktrap"
Add the layout and membership to your igraph object.
V(net)$l1 <- l[,1]
V(net)$l2 <- l[,2]
V(net)$membership <- my.com.fast$membership
What are the largest communities?
<- data_frame(membership=V(net)$membership)
comunidades
%>%
comunidades count(membership) %>%
arrange(desc(n)) %>%
top_n(5)
# A tibble: 5 × 2
membership n
<dbl> <int>
1 11 18272
2 4 18077
3 8 7951
4 13 2923
5 2 1165
Visualizing communities
# A function with the density. Nice to visualize as well.
<- function(l=l,new.color=new.color, ind=ind, legend, color){
my.den.plot library(KernSmooth)
<- bkde2D(l, bandwidth=c(10, 10))
est plot(l,cex=log(ind+1)/4, col=new.color, pch=16, xlim=c(-160,140),ylim=c(-140,160), xlab="", ylab="", axes=FALSE)
legend("topright", c(legend[1],legend[2], legend[3]), pch = 17:19, col=c(color[1], color[2], color[3]))
contour(est$x1, est$x2, est$fhat, col = gray(.6), add=TRUE)
#text(-140,115,paste("ENCG: ",ENCG,sep=""), cex=1, srt=0)
}
# Add colors for each community
# Building a empty container
<- rep(1,length(V(net)$membership))
temp <- "white"
new.color V(net)$membership==11] <- "Yellow" ####
new.color[V(net)$membership==8] <- "pink" ####
new.color[V(net)$membership==4] <- "red" ####
new.color[
# Save as a variable in the network object
V(net)$new.color <- new.color
# Plot
my.den.plot(l=cbind(V(net)$l1,V(net)$l2),new.color=V(net)$new.color, ind=V(net)$indegre, legend =c("Pro-Bolsonaro", "Anti-Bolsonaro I", "Anti-Bolsonaro II"),
color =c("Yellow", "red", "pink"))
Other APIs endpoint
Most of our work with the Twitter API happens with the capacity to query the API with search terms. For this reason, the search (and filter for live data collection) endpoints are the most popular.
However, there are a few other endpoints from the Twitter API that can also be very useful for research puporses. Let’s walk through them briefly.
Getting user id
Imagine a research in which you have the Twitter accounts of elites, and you want to collect their Twitter data. The first step is to collect their ids.
# getting some Twitter Ids
<- get_user_id("SpeakerPelosi") pelosi
Getting whom a user follows
<- get_user_following(pelosi) pelosi_network
Processing 15764644
Total data points: 429
This is the last page for 15764644 : finishing collection.
glimpse(pelosi_network)
Rows: 429
Columns: 14
$ protected <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ profile_image_url <chr> "https://pbs.twimg.com/profile_images/14707589214261…
$ username <chr> "RepShontelBrown", "RepAlGreen", "RepJoeGarcia", "Re…
$ id <chr> "1456381091598700556", "156333623", "937801969", "11…
$ public_metrics <df[,4]> <data.frame[26 x 4]>
$ entities <df[,2]> <data.frame[26 x 2]>
$ name <chr> "Rep. Shontel Brown", "Congressman Al Green", "Re…
$ url <chr> "https://t.co/v695zCnmxN", "https://t.co/4xG26ktT…
$ pinned_tweet_id <chr> "1463532439456952323", "1422610928756043778", NA, NA…
$ verified <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ created_at <chr> "2021-11-04T22:01:33.000Z", "2010-06-16T17:20:23.000…
$ description <chr> "Representative for Ohio’s Eleventh Congressional Di…
$ location <chr> NA, "Houston, TX & Washington, DC", "Miami, Florida"…
$ from_id <chr> "15764644", "15764644", "15764644", "15764644", "157…
Estimate user ideology
#devtools::install_github("pablobarbera/twitter_ideology/pkg/tweetscores")
library(tweetscores)
<- estimateIdeology("SpeakerPelosi", pelosi_network$id, verbose = FALSE)
results plot(results)
User timeline
= get_user_timeline(pelosi,
pelosi_tl start_tweets = "2022-01-01T00:00:00Z",
end_tweets = "2022-10-22T00:00:00Z",
n=100) #limit
user: 15764644
Total pages queried: 1 (tweets captured this page: 100).
Total tweets captured now reach 100 : finishing collection.
glimpse(pelosi_tl)
Rows: 100
Columns: 15
$ created_at <chr> "2022-10-19T02:18:33.000Z", "2022-10-18T22:05:0…
$ text <chr> "Anna May Wong was a dazzling, trailblazing tal…
$ lang <chr> "en", "en", "en", "en", "en", "en", "en", "en",…
$ edit_history_tweet_ids <list> "1582556778608218118", "1582492989015805953", …
$ conversation_id <chr> "1582556778608218118", "1582492989015805953", "…
$ context_annotations <list> [<data.frame[7 x 2]>], [<data.frame[14 x 2]>],…
$ entities <df[,4]> <data.frame[26 x 4]>
$ id <chr> "1582556778608218118", "1582492989015805953"…
$ author_id <chr> "15764644", "15764644", "15764644", "15764644",…
$ referenced_tweets <list> [<data.frame[1 x 2]>], <NULL>, <NULL>, <NULL>, …
$ public_metrics <df[,4]> <data.frame[26 x 4]>
$ possibly_sensitive <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ source <chr> "Twitter for iPhone", "Twitter Web App", "Tw…
$ attachments <df[,1]> <data.frame[26 x 1]>
$ in_reply_to_user_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, "15764644", NA,…
Tweets liked by an user
= get_liked_tweets(pelosi) #limit pelosi_likes
Processing 15764644
Total data points: 20
Total data points: 21
This is the last page for 15764644 : finishing collection.
glimpse(pelosi_likes) # she mostly liked her own tweets
Rows: 21
Columns: 16
$ id <chr> "1554482274430844928", "1554897362299981824", "…
$ edit_history_tweet_ids <list> "1554482274430844928", "1554897362299981824", …
$ created_at <chr> "2022-08-02T15:00:29.000Z", "2022-08-03T18:29:5…
$ public_metrics <df[,4]> <data.frame[21 x 4]>
$ text <chr> "Our delegation’s visit to Taiwan honors Ame…
$ lang <chr> "en", "en", "en", "en", "en", "en", "en", "en",…
$ context_annotations <list> [<data.frame[5 x 2]>], [<data.frame[9 x 2]>], […
$ author_id <chr> "15764644", "15764644", "15764644", "15764644"…
$ conversation_id <chr> "1554482274430844928", "1554897362299981824", "…
$ source <chr> "Twitter for iPhone", "Twitter Media Studio", "…
$ entities <df[,4]> <data.frame[21 x 4]>
$ possibly_sensitive <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ attachments <df[,1]> <data.frame[21 x 1]>
$ in_reply_to_user_id <chr> NA, NA, "15764644", NA, NA, "15764644", "157…
$ referenced_tweets <list> <NULL>, <NULL>, [<data.frame[1 x 2]>], <NULL>, …
$ from_id <chr> "15764644", "15764644", "15764644", "1576464…