Workshop Analyzing Social Media Data I : Twitter Data

Author

Tiago Ventura

Published

October 21, 2020

Introduction

Social media data can come on many flavors and from many different sources. For this reason, it is not possible to cover in-depth in a one-hour workshop many different types of social media data. In addition, social media companies provide different access to researchers to their data.

There is not a “one way to rule them all” when it comes to working with social media data.

For this reason, I decided start this workshop on the most used, and the most easily accessible social media data for researchers: Twitter data. However, even though we will spend most time working with Twitter data, several of the techniques I hope to cover here are no restricted to Twitter data. These are general techniques and can be applied pretty much to any type of social media data.

With Twitter data, we will cover:

Analyzing and understanding network structure of social media data.
Some extra endpoints from the Twitter API (timelines, friends list, among others).

To save some time, I stored all the data I am using in this tutorial here. This notebook should run if you place all the data in the same working directory.

Getting Access to the Twitter APIs.

In order to get access to Twitter data, you need to first apply for a Twitter developer account. Once your developer application has been approved, you get access to the standard product track by default. However, if you are an academic researcher and meet certain requirements, you can apply to the academic research product track which will give you elevated access to the Twitter API v2 including access to historical public Tweets for free.

We will be using the academic research access to the Twitter V2 API.

Standard Access

Search for Tweets from the last 7 days by specifying queries using supported operators (more on building queries in later sections)
Stream Tweets in real-time as they are happening by specifying rules to filter for Tweets that you are interested in.
Get Tweets from a user’s timeline (up to 3200 most recent Tweets)
Build the full Tweet objects from a Tweet ID, or a set of Tweet IDs
Look up follower relationships

These are just some examples of what you can get from the standard product track, relevant to academics.

Currently, you can get upto 500,000 Tweets per month using the standard product track and this limit does not apply to the sampled stream endpoint, which gives a 1% sample of public Tweets in real-time

Academic Research product track

This track includes:

Ability to get historical Tweets from the entire archive of public conversation on Twitter, dating back to 2006 (using the full-archive search endpoint)
Higher monthly Tweet volume cap of 10 million Tweets per month
More advanced filter options to return relevant data, including a longer query length, support for more concurrent rules (for filtered stream endpoint), and additional operators that are only supported in this product track (more on this later)

For a complete list of available endpoints in the V2 API, check out the Twitter API documentation.

For a complete course on accessing the Twitter V2 API, I suggest you take a look the 101 course prepared by Twitter API team.

Twitter Data: Presidential Elections in Brazil.

To collect Twitter data from the V2 API, we will use the academictwitteR package developed by Chris Barrie. For R users, this is an amazing package because it allows you to easily query the API, and process the data in easily readable format for R.

If you prefer Python, I suggest you to check the library Twarc.

Access tweets from the archive

Let’s start collecting some data using a textual query through the search endpoint. The academic access to the V2 API allows you to query the Twitter archive, and get access to data way back on time.

Let’s start collecting some data about the recent presidential elections in Brazil.

# Call packages using pacman
#install.packages("pacman")
pacman::p_load(here, jsonlite, tidyverse, academictwitteR)

# Using Academic Twitter to add yourkey
set_bearer()

# Collect data
tweets <-
  get_all_tweets(
    query = "(eleicoes2022 OR lula OR bolsonaro OR ciro OR tebet)", # query
    start_tweets = "2022-10-01T00:00:00Z", #start time
    end_tweets = "2022-10-04T00:00:00Z", #end time
    file = "br_elections", # file to save
    data_path = "data_br/", # folder where all data as jsons will be stores
    n = 200000, # number of tweets
    lang = "pt",
  )

This data is stored as a series of smaller jsons. The academictwitter has a specific function to easily combine these json files in a single file in the tidy format.

If you are collecting this through other packages or accessing the API directly, you would get long json files as responses. Jsons are basically a set of nested lists, and can be tricky to clean. So the bind_tweets function can be very handy

Another option, which is very common if you have a consistent data pipeline, is to build your own cleaning function, getting the data and variables in the format your project needs.

# data processing
tweets_tidy <- bind_tweets("./data_br", output_format = "tidy")
glimpse(tweets_tidy)

Rows: 200,062
Columns: 31
$ tweet_id               <chr> "1577066499042222080", "1577066498882908161", "…
$ user_username          <chr> "cainsworts", "nandamattosbh", "fran51995877", …
$ text                   <chr> "RT @jinS2me: NAO SE MATEM o lula precisa de vo…
$ possibly_sensitive     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ conversation_id        <chr> "1577066499042222080", "1577066498882908161", "…
$ lang                   <chr> "pt", "pt", "pt", "pt", "pt", "pt", "pt", "pt",…
$ source                 <chr> "Twitter for Android", "Twitter for Android", "…
$ created_at             <chr> "2022-10-03T22:42:08.000Z", "2022-10-03T22:42:0…
$ author_id              <chr> "809471355116781568", "42269111", "133561842785…
$ in_reply_to_user_id    <chr> NA, NA, NA, NA, "1524416437091192832", NA, NA, …
$ user_name              <chr> "felicité", "Fernanda Mattos", "fran", "LULA 13…
$ user_created_at        <chr> "2016-12-15T18:53:05.000Z", "2009-05-24T19:47:0…
$ user_location          <chr> "⚠️ edtwt", "BH", NA, "konoha", NA, NA, NA, NA, …
$ user_verified          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ user_description       <chr> "all I see is what I should be. happier, pretti…
$ user_protected         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ user_pinned_tweet_id   <chr> "1578393098018820098", "1386402147642707968", N…
$ user_profile_image_url <chr> "https://pbs.twimg.com/profile_images/156429933…
$ user_url               <chr> NA, NA, NA, "https://t.co/ZfQtIL3QFx", NA, NA, …
$ retweet_count          <int> 127, 2, 3666, 42709, 0, 3562, 0, 18460, 1376, 0…
$ like_count             <int> 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ quote_count            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ user_tweet_count       <int> 2197, 38394, 2943, 12704, 5244, 58983, 360, 279…
$ user_list_count        <int> 1, 16, 0, 0, 0, 0, 0, 10, 0, 2, 2, 0, 0, 1, 0, …
$ user_followers_count   <int> 220, 1386, 31, 243, 1389, 530, 54, 384, 707, 28…
$ user_following_count   <int> 148, 219, 78, 214, 2088, 181, 81, 195, 1720, 45…
$ sourcetweet_type       <chr> "retweeted", "retweeted", "retweeted", "retweet…
$ sourcetweet_id         <chr> "1576742666754134017", "1576984204713172992", "…
$ sourcetweet_text       <chr> "NAO SE MATEM o lula precisa de votos no SEGUND…
$ sourcetweet_lang       <chr> "pt", "pt", "pt", "pt", NA, "pt", NA, "pt", "pt…
$ sourcetweet_author_id  <chr> "1534722153819643906", "18880621", "26752656", …

A lot of data. But only a portion of what comes through the API. So you can actually process the whole data here. This is a really nice feature of this package. The json data is stored in smaller pieces, which makes it easier for you to process later.

# examing the data
tweets_raw <- bind_tweets("./data_br", output_format = "raw")
str(tweets_raw, max.level=1)

List of 27
 $ tweet.entities.mentions            : tibble [215,630 × 5] (S3: tbl_df/tbl/data.frame)
 $ tweet.entities.annotations         : tibble [386,283 × 6] (S3: tbl_df/tbl/data.frame)
 $ tweet.entities.urls                : tibble [32,703 × 12] (S3: tbl_df/tbl/data.frame)
 $ tweet.entities.hashtags            : tibble [10,405 × 4] (S3: tbl_df/tbl/data.frame)
 $ tweet.entities.cashtags            : tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
 $ tweet.public_metrics.retweet_count : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
 $ tweet.public_metrics.reply_count   : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
 $ tweet.public_metrics.like_count    : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
 $ tweet.public_metrics.quote_count   : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
 $ tweet.attachments.media_keys       : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
 $ tweet.attachments.poll_ids         : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
 $ tweet.geo.place_id                 : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
 $ tweet.geo.coordinates              : tibble [200,062 × 3] (S3: tbl_df/tbl/data.frame)
 $ tweet.withheld.country_codes       : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
 $ tweet.withheld.copyright           : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
 $ tweet.edit_history_tweet_ids       : tibble [200,062 × 2] (S3: tbl_df/tbl/data.frame)
 $ tweet.referenced_tweets            : tibble [184,799 × 3] (S3: tbl_df/tbl/data.frame)
 $ tweet.main                         :'data.frame':    200062 obs. of  9 variables:
 $ user.public_metrics.followers_count: tibble [337,098 × 2] (S3: tbl_df/tbl/data.frame)
 $ user.public_metrics.following_count: tibble [337,098 × 2] (S3: tbl_df/tbl/data.frame)
 $ user.public_metrics.tweet_count    : tibble [337,098 × 2] (S3: tbl_df/tbl/data.frame)
 $ user.public_metrics.listed_count   : tibble [337,098 × 2] (S3: tbl_df/tbl/data.frame)
 $ user.entities.url                  : tibble [337,098 × 2] (S3: tbl_df/tbl/data.frame)
 $ user.entities.description          : tibble [337,098 × 5] (S3: tbl_df/tbl/data.frame)
 $ user.withheld.country_codes        : tibble [337,098 × 2] (S3: tbl_df/tbl/data.frame)
 $ user.main                          :'data.frame':    337098 obs. of  11 variables:
 $ sourcetweet.main                   :'data.frame':    132813 obs. of  16 variables:

Network Analysis with Twitter Data

There are many different ways you can analyze Twitter data. You can analyze the text, the images, the geolocations, the links shared, among many other things.

A favorite way for computational social scientists to analyze social media data is to look at the user connections using some sort of network models. This is not limited to Twitter data. Pretty much any social media application comprises some sort of network structure.

So let’s start with some basics of network analysis in R.

A network has two core elements: nodes and edges. On Twitter this means:

Nodes are Twitter users
Edges are any sort of connections these users make. A reply, a friendship, or the most common, a retweet.

We will be using retweets and quote tweets to give some examples of network analysis in R. We will use the igraph package to analyze network data in R. This package has many different features, but it stores data in a very distinct way. So let’s work through it.

Bulding a Network

Step 1: Filter Nodes

# Filter retweets
tweets_tidy_rt <- tweets_tidy %>%
                          filter(!is.na(sourcetweet_type))

dim(tweets_tidy_rt)

[1] 154583     31

# Visualize the dta
tweets_tidy_rt %>%
  select(user_username, sourcetweet_author_id) %>%
  head()

# A tibble: 6 × 2
  user_username   sourcetweet_author_id
  <chr>           <chr>                
1 cainsworts      1534722153819643906  
2 nandamattosbh   18880621             
3 fran51995877    26752656             
4 juliam3ndes     863806721696858112   
5 caralho_modesti 2876592790           
6 carolfcarneiro  44481447

Step 2: Create a edge list

# Create a edge list 
# using the user id on both sides here to keep the same unit

data <- cbind(tweets_tidy_rt$author_id, tweets_tidy_rt$sourcetweet_author_id)
dim(data)

[1] 154583      2

head(data)

     [,1]                  [,2]                 
[1,] "809471355116781568"  "1534722153819643906"
[2,] "42269111"            "18880621"           
[3,] "1335618427852124163" "26752656"           
[4,] "839520909807521793"  "863806721696858112" 
[5,] "136448124"           "2876592790"         
[6,] "108719485"           "44481447"

Notice here we have two different types of users. Hubs are the users who retweet, and authorities are the user who receive a retweets.

Step 3: Create your network structure

pacman::p_load(igraph)

# Create an empty network

net <- graph.empty() 

# Add nodes
net <- add.vertices(net, 
                    length(unique(c(data))), # number of nodes
                    name=as.character(unique(c(data)))) # unique names

# Add edges
net <- add.edges(net, t(data))

# summary
summary(net)

IGRAPH d8e2b34 DN-- 79886 154583 -- 
+ attr: name (v/c)

Your output:

Igraph object
79886 unique nodes
154583 edges

Step four: Add information to your network object

Information comes in two flavors. Information at the edge level (E(object)) and at the Node leve (V(object)). Let’s see how it works.

library(urltools)

# Edges 
E(net)$text <- tweets_tidy_rt$text
E(net)$idauth <- tweets_tidy_rt$sourcetweet_author_id
E(net)$namehub <- tweets_tidy_rt$user_username


# Capturing hashtags
E(net)$hash <- str_extract_all(tweets_tidy_rt$text, "#\\S+")


# grab expanded and unwound_url
entities <- tweets_raw$tweet.entities.urls

tidy_entities <- entities %>% 
                    # get columns we need
                    select(tweet_id, unwound_url) %>% 
                    #extract domains
                    mutate(unwound_url=domain(unwound_url)) %>%
                    # remove nas and combine multiple links
                    filter(!is.na(unwound_url)) %>%
                    group_by(tweet_id) %>%
                    summarise(domain=paste0(unwound_url, collapse=" -- "))

# Merge back with id
tweets_tidy_rt <- left_join(tweets_tidy_rt, tidy_entities)

# add to the network
E(net)$domain <- tweets_tidy_rt$domain

Network Statistics, Communities and Layout

Two very common concepts in network science are in-degree and out-degree. In-degree refers to how many links pointing to themselves the user has. So in our case it shows how many retweets this user has received. The opposite explains the out-degree. In this case, out-degree means how many retweets the user has given.

A user is called an authority when their in-degree is high. That is, this user received many retweets from others. We call it a hub when its out-degree is high, as this user retweets very often.

Accounst who are considered bots usually have a huge difference between in-degree and out-degree – nobody retweets them, they just retweet a lot, and usually really quick.

# Calculate in degree and out degree
V(net)$outdegree<-degree(net, mode="out")
V(net)$indegree<-degree(net, mode="in")
summary(net)

IGRAPH d8e2b34 DN-- 79886 154583 -- 
+ attr: name (v/c), outdegree (v/n), indegree (v/n), text (e/c), idauth
| (e/c), namehub (e/c), hash (e/x), domain (e/c)

Layout

Networks are always in the latente space. To visualize them, we usually reccur algorithms that maximize some dynamics of networks and give us layouts for visualization. You can try out different algorithm, but the Fruchterman-Reingold is a popular choice

l <- layout_with_fr(net, grid = c("nogrid"))
#saveRDS(l, "layout.rds")
head(l)

           [,1]       [,2]
[1,] -102.96401  216.91269
[2,] -178.82523  158.25089
[3,]   52.34920   81.01076
[4,]  -11.32539 -139.13169
[5,]   51.89335  -56.94802
[6,]   29.44408  -99.62837

Communities

Community detection is a big part of network analysis. The idea of these techniques is to find clusters of connections across your entire network space. Community detection is a huge subfield of network science. Here is a nice review piece by Porter et al.

My take here is that for large networks, like we usually deal with in social media, most of the algorithms will do the job you need, which in general is identify the core communities in a network. An important point is to always validate these communities, and use some qualitative analysis to verify the results.

We will use an random walk algorithm for community detection

my.com.fast <- walktrap.community(net)
str(my.com.fast, max.level = 1)

Class 'communities'  hidden list of 6
 $ merges    : num [1:77186, 1:2] 58675 61627 60095 58720 58731 ...
 $ modularity: num [1:79886] 0 -0.00127 -0.00127 -0.00126 -0.00125 ...
 $ membership: num [1:79886] 1689 2004 11 169 175 ...
 $ names     : chr [1:79886] "809471355116781568" "42269111" "1335618427852124163" "839520909807521793" ...
 $ vcount    : int 79886
 $ algorithm : chr "walktrap"

Add the layout and membership to your igraph object.

V(net)$l1 <- l[,1]
V(net)$l2 <- l[,2]
V(net)$membership <- my.com.fast$membership

What are the largest communities?

comunidades<- data_frame(membership=V(net)$membership)

comunidades %>% 
    count(membership) %>%
    arrange(desc(n)) %>%
    top_n(5)

# A tibble: 5 × 2
  membership     n
       <dbl> <int>
1         11 18272
2          4 18077
3          8  7951
4         13  2923
5          2  1165

Who are the main authorities in each community?

# Create an datafram for the authoritiew
authorities <- data_frame(name=V(net)$name, ind=V(net)$indegree, 
                         membership=V(net)$membership) %>%
                          filter(membership==11| membership==4|membership==8) %>%
                          group_by(membership) %>%
                          arrange(desc(ind)) %>% 
                          slice(1:10)

The V2 API does not return the name of the author of the original tweet. We can collect this information using the function get_user_profile()

# I will get only from the 100 most retweeted to save some time. 
users_most_retweets <-authorities %>%
                      mutate(data_user=map(name, get_user_profile)) %>%
                      unnest()

Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1
Processing from 1 to 1

# Main Communities
ggplot(users_most_retweets %>% filter(membership=="4"), aes(x=reorder(username,
                               ind, 
                               fill=membership),
                     y=ind)) + 
    geom_histogram(stat="identity", width=.5, color="black") +
    coord_flip() +
    xlab("") + ylab("") + 
    theme_minimal(base_size = 12) + 
    theme(plot.title = element_text(size = 22, face = "bold"), 
          axis.title=element_text(size=16), 
          axis.text = element_text(size=12, face="bold")) +
    facet_grid(~membership)

# Main Communities
ggplot(users_most_retweets %>% filter(membership=="11"), aes(x=reorder(username,
                               ind, 
                               fill=membership),
                     y=ind)) + 
    geom_histogram(stat="identity", width=.5, color="black") +
    coord_flip() +
    xlab("") + ylab("") + 
    theme_minimal(base_size = 12) + 
    theme(plot.title = element_text(size = 22, face = "bold"), 
          axis.title=element_text(size=16), 
          axis.text = element_text(size=12, face="bold")) +
    facet_grid(~membership)

# Main Communities
ggplot(users_most_retweets %>% filter(membership=="8"), aes(x=reorder(username,
                               ind, 
                               fill=membership),
                     y=ind)) + 
    geom_histogram(stat="identity", width=.5, color="black") +
    coord_flip() +
    xlab("") + ylab("") + 
    theme_minimal(base_size = 12) + 
    theme(plot.title = element_text(size = 22, face = "bold"), 
          axis.title=element_text(size=16), 
          axis.text = element_text(size=12, face="bold")) +
    facet_grid(~membership)

Visualizing communities

# A function with the density. Nice to visualize as well.
my.den.plot <- function(l=l,new.color=new.color, ind=ind, legend, color){
  library(KernSmooth)
  est <- bkde2D(l, bandwidth=c(10, 10))
  plot(l,cex=log(ind+1)/4, col=new.color, pch=16, xlim=c(-160,140),ylim=c(-140,160), xlab="", ylab="", axes=FALSE)
   legend("topright", c(legend[1],legend[2], legend[3]), pch = 17:19, col=c(color[1], color[2], color[3]))
  contour(est$x1, est$x2, est$fhat,  col = gray(.6), add=TRUE)
  #text(-140,115,paste("ENCG: ",ENCG,sep=""), cex=1, srt=0)
} 

# Add colors for each community
# Building a empty container
temp <- rep(1,length(V(net)$membership))
new.color <- "white"
new.color[V(net)$membership==11] <- "Yellow" ####
new.color[V(net)$membership==8] <- "pink" ####
new.color[V(net)$membership==4] <- "red" ####

# Save as a variable in the network object
V(net)$new.color <- new.color

# Plot
my.den.plot(l=cbind(V(net)$l1,V(net)$l2),new.color=V(net)$new.color, ind=V(net)$indegre, legend =c("Pro-Bolsonaro", "Anti-Bolsonaro I", "Anti-Bolsonaro II"), 
color =c("Yellow", "red", "pink"))

Hashtags by communities

Most Popular Hashtags

hashtags <- tweets_tidy_rt %>% 
            mutate(hashtags=str_extract_all(tweets_tidy_rt$text, "#\\S+")) %>%
            unnest(hashtags) %>%
            count(hashtags) %>%
            arrange(desc(n)) %>% 
            slice(1:30) %>%
            drop_na()
          

# Contando as hashtags
ggplot(hashtags, aes(x=reorder(hashtags,
                               n),
                     y=n)) + 
    geom_histogram(stat="identity", width=.5, color="black", 
                   fill="steelblue") +
  coord_flip() +
    xlab("") + ylab("") + 
  theme_minimal(base_size = 12) + 
  theme(plot.title = element_text(size = 22, face = "bold"), 
        axis.title=element_text(size=16), 
        axis.text = element_text(size=12, face="bold"))

Heterogenous activation of Hashtags

# get communities
auth <- data_frame(name=V(net)$name, membership=V(net)$membership)

# merge

data_hash <- tweets_tidy_rt %>%
                    left_join(auth, by=c("author_id"="name")) %>%
                    mutate(hashtags=str_extract_all(tweets_tidy_rt$text, "#\\S+")) %>%
                    unnest(hashtags) %>%
                    drop_na(hashtags)
      

# Vamos aggrupar por comunidade
hash_by_comm <- data_hash %>%
                           filter(membership==11| membership==8|membership==4) %>%
                           count(membership, hashtags) %>%
                           top_n(20, n)

ggplot(hash_by_comm, aes(x=reorder(hashtags,
                               n),
                     y=n, fill=as.factor(membership))) + 
  geom_histogram(stat="identity", width=.5, color="black") +
  coord_flip() +
    xlab("") + ylab("") + 
  scale_fill_manual(name="Communities", values=c("Red", "pink", "Yellow")) +
  theme_minimal(base_size = 10) + 
  facet_wrap(~membership, scale="free")

Sharing news on Twitter

Let me present you with one final analysis of Twitter data. It comes from a paper I worked together with Ernesto Calvo and Natalia Aruguete, News Sharing, Gatekeeping, and Polarization: A Study of the #Bolsonaro Election. The paper proposes an model to estimate behavioral parameters for news sharing with social media data. But, before we get to the model, we show descriptive how news organizations are activated differently by each community on Twitter. This is a interesting way to visualize the formation of echo chambers on Twitter coming straight from activation of content, instead of sorting.

A lot of this code was developed by Ernesto Calvo. Thanks, Ernesto!

Counting news domains at the node level

summary(net)

IGRAPH d8e2b34 DN-- 79886 154583 -- 
+ attr: name (v/c), outdegree (v/n), indegree (v/n), l1 (v/n), l2
| (v/n), membership (v/n), new.color (v/c), text (e/c), idauth (e/c),
| namehub (e/c), hash (e/x), domain (e/c)

# Vamos primeiro selecionar as mídias mais ativadas
keynews <- head(sort(table(unlist(E(net)$domain)),decreasing=TRUE),20)
keynews.names <- names(keynews)

N<-length(keynews.names)
count.keynews<- array(0,dim=c(length(E(net)),N))
str(count.keynews)

 num [1:154583, 1:20] 0 0 0 0 0 0 0 0 0 0 ...

# Looping 
for(i in 1:N){
  temp<- grepl(keynews.names[i], E(net)$domain, ignore.case = TRUE)
  #temp <- str_match(E(net)$text,'Arangur[A-Za-z]+[A-Za-z0-9_]+')
  count.keynews[temp==TRUE,i]<-1
  Sys.sleep(0.1)

  }


# Setting the names of the media
colnames(count.keynews)<- keynews.names
count.keynews[1:5, 1:3]

     www.gazetadopovo.com.br noticias.uol.com.br revistaoeste.com
[1,]                       0                   0                0
[2,]                       0                   0                0
[3,]                       0                   0                0
[4,]                       0                   0                0
[5,]                       0                   0                0

Recovering the neighhood matrix for nodes

# Recovering all nodes connections
el <- get.adjedgelist(net, mode="all")
al <- get.adjlist(net, mode="all")

Function to estimate the propagation of content at edge level

#Function to detect the spread in the adj list
fomfE<- function(var=var, adjV=adjV,adjE=adjE){
  stemp <- map_dbl(adjE, function(x) sum(var[x]))
  #mstemp <- sapply(adjV, function(x) mean(stemp[x]))
  out<-cbind(stemp)
}

# container
result_hash<- array(0,dim=c(length(V(net)),N))
# Repita para cada uma das hashtags
for(i in 1:N){
  bb<-fomfE(count.keynews[,i],al,el)
  bb[bb[,1]=="NaN"]<-0
  result_hash[,i]<- bb[,1]
}
colnames(result_hash)<- keynews.names

Visualization

par(mfrow=c(2,2))

# Hashtag 
plot(V(net)$l1,V(net)$l2,pch=16,  
       col=V(net)$new.color, 
       cex=log(result_hash[,1]+1),
       xlim=c(-100,150),ylim=c(-100,150), xlab="", ylab="",
       main=colnames(result_hash)[1], cex.main=1)

# Hashtag
plot(V(net)$l1,V(net)$l2,pch=16,  
       col=V(net)$new.color, 
      cex=log(result_hash[,7]+1),
       xlim=c(-100,150),ylim=c(-100,150), xlab="", ylab="",
       main=colnames(result_hash)[7], cex.main=1)
# Hashtag 
plot(V(net)$l1,V(net)$l2,pch=16,  
       col=V(net)$new.color, 
       cex=log(result_hash[,9]+1),
       xlim=c(-100,150),ylim=c(-100,150), xlab="", ylab="",
       main=colnames(result_hash)[9], cex.main=1)

# Hashtag
plot(V(net)$l1,V(net)$l2,pch=16,  
       col=V(net)$new.color, 
       cex=log(result_hash[,10]+1),
       xlim=c(-100,150),ylim=c(-100,150), xlab="", ylab="",
       main=colnames(result_hash)[10], cex.main=1)

null device 
          1

Extra notes

With the matrix of users and links, we can do many things. For example, in this paper News by Popular Demand which I also work together with Ernesto Calvo and Natalia Aruguete, we show how to estimate the importance of ideology, reputation, and attention on sharing behavior on Twitter.

In this working paper by Eady et al, the same matrix is used to estimate ideal points for users and media organization from Twitter. The authors use a sophisticated Bayesian count model to estimate the ideal points. But, as you can see from this paper by Pablo Barbera, one can get to very similar results using a simple singular value decomposition on a matrix of normalized residuals.

Other APIs endpoint

Most of our work with the Twitter API happens with the capacity to query the API with search terms. For this reason, the search (and filter for live data collection) endpoints are the most popular.

However, there are a few other endpoints from the Twitter API that can also be very useful for research puporses. Let’s walk through them briefly.

Getting user id

Imagine a research in which you have the Twitter accounts of elites, and you want to collect their Twitter data. The first step is to collect their ids.

# getting some Twitter Ids
pelosi <- get_user_id("SpeakerPelosi")

Getting whom a user follows

pelosi_network <- get_user_following(pelosi)

Processing 15764644
Total data points:  429 
This is the last page for  15764644 : finishing collection.

glimpse(pelosi_network)

Rows: 429
Columns: 14
$ protected         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ profile_image_url <chr> "https://pbs.twimg.com/profile_images/14707589214261…
$ username          <chr> "RepShontelBrown", "RepAlGreen", "RepJoeGarcia", "Re…
$ id                <chr> "1456381091598700556", "156333623", "937801969", "11…
$ public_metrics    <df[,4]> <data.frame[26 x 4]>
$ entities          <df[,2]> <data.frame[26 x 2]>
$ name              <chr> "Rep. Shontel Brown", "Congressman Al Green", "Re…
$ url               <chr> "https://t.co/v695zCnmxN", "https://t.co/4xG26ktT…
$ pinned_tweet_id   <chr> "1463532439456952323", "1422610928756043778", NA, NA…
$ verified          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ created_at        <chr> "2021-11-04T22:01:33.000Z", "2010-06-16T17:20:23.000…
$ description       <chr> "Representative for Ohio’s Eleventh Congressional Di…
$ location          <chr> NA, "Houston, TX & Washington, DC", "Miami, Florida"…
$ from_id           <chr> "15764644", "15764644", "15764644", "15764644", "157…

Estimate user ideology

#devtools::install_github("pablobarbera/twitter_ideology/pkg/tweetscores")
library(tweetscores)
results <- estimateIdeology("SpeakerPelosi", pelosi_network$id, verbose = FALSE)
plot(results)

User timeline

pelosi_tl = get_user_timeline(pelosi,
                              start_tweets = "2022-01-01T00:00:00Z", 
                               end_tweets = "2022-10-22T00:00:00Z", 
                              n=100) #limit

user:  15764644 
Total pages queried: 1 (tweets captured this page: 100).
Total tweets captured now reach 100 : finishing collection.

glimpse(pelosi_tl)

Rows: 100
Columns: 15
$ created_at             <chr> "2022-10-19T02:18:33.000Z", "2022-10-18T22:05:0…
$ text                   <chr> "Anna May Wong was a dazzling, trailblazing tal…
$ lang                   <chr> "en", "en", "en", "en", "en", "en", "en", "en",…
$ edit_history_tweet_ids <list> "1582556778608218118", "1582492989015805953", …
$ conversation_id        <chr> "1582556778608218118", "1582492989015805953", "…
$ context_annotations    <list> [<data.frame[7 x 2]>], [<data.frame[14 x 2]>],…
$ entities               <df[,4]> <data.frame[26 x 4]>
$ id                     <chr> "1582556778608218118", "1582492989015805953"…
$ author_id              <chr> "15764644", "15764644", "15764644", "15764644",…
$ referenced_tweets      <list> [<data.frame[1 x 2]>], <NULL>, <NULL>, <NULL>, …
$ public_metrics         <df[,4]> <data.frame[26 x 4]>
$ possibly_sensitive     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ source                 <chr> "Twitter for iPhone", "Twitter Web App", "Tw…
$ attachments            <df[,1]> <data.frame[26 x 1]>
$ in_reply_to_user_id    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "15764644", NA,…

Tweets liked by an user

pelosi_likes = get_liked_tweets(pelosi) #limit

Processing 15764644
Total data points:  20 
Total data points:  21 
This is the last page for  15764644 : finishing collection.

glimpse(pelosi_likes) # she mostly liked her own tweets

Rows: 21
Columns: 16
$ id                     <chr> "1554482274430844928", "1554897362299981824", "…
$ edit_history_tweet_ids <list> "1554482274430844928", "1554897362299981824", …
$ created_at             <chr> "2022-08-02T15:00:29.000Z", "2022-08-03T18:29:5…
$ public_metrics         <df[,4]> <data.frame[21 x 4]>
$ text                   <chr> "Our delegation’s visit to Taiwan honors Ame…
$ lang                   <chr> "en", "en", "en", "en", "en", "en", "en", "en",…
$ context_annotations    <list> [<data.frame[5 x 2]>], [<data.frame[9 x 2]>], […
$ author_id              <chr> "15764644", "15764644", "15764644", "15764644"…
$ conversation_id        <chr> "1554482274430844928", "1554897362299981824", "…
$ source                 <chr> "Twitter for iPhone", "Twitter Media Studio", "…
$ entities               <df[,4]> <data.frame[21 x 4]>
$ possibly_sensitive     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ attachments            <df[,1]> <data.frame[21 x 1]>
$ in_reply_to_user_id    <chr> NA, NA, "15764644", NA, NA, "15764644", "157…
$ referenced_tweets      <list> <NULL>, <NULL>, [<data.frame[1 x 2]>], <NULL>, …
$ from_id                <chr> "15764644", "15764644", "15764644", "1576464…