Text Analysis in R: Tidy Text and Dictionary Methods

Introduction

In today’s class, we’re going to take our first steps in using R for text analysis. We will divide the class into the following topics.

  • Opening Texts in R.

  • Text as a “Tidy” data.

  • Sentiment Analysis + Dictionary Models.

I strongly recommend that students read the book Text Mining with R: A Tidy Approach. This tutorial for today is heavily inspired by this beautiful book by Julia Silge and David Robinson.

All data used in this tutorial can be downloaded here. Open this data at the same place you have the rmd, and everything should work smoothly

Review: String Manipulation in R

Before we start this workshop, it would be important for you to review our classes on manipulating strings in R using the stringr package. Take a look at the slides, and the code. It is going to be helpful for you.

Opening Texts in R.

As a programming language, R is pretty flexible on which types of data can be imported into your working environment. When working with digital texts, you will generally access it directly from the internet and save it as an R object. However, there are some other options that we will cover here.

Accessing digital files directly in R.

The easiest way to access text data in R is importing some type of digital text data directly to your environment. For example, we’re going to access data from the Twitter API using the rtweet package.

## Error in create_token(app = app_name, consumer_key = consumer_key, consumer_secret = consumer_secret, : could not find function "create_token"
library(rtweet)
library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.3.0.9000 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()  masks stats::filter()
## x purrr::flatten() masks rtweet::flatten()
## x dplyr::lag()     masks stats::lag()
bolsonaro_tweets<-search_tweets("bolsonaro", n=50, include_rts = TRUE)
## Error: No default authentication found. Pick existing auth with:
## * auth_as('create_token')
colnames(bolsonaro_tweets)
## Error in is.data.frame(x): object 'bolsonaro_tweets' not found
# Select only the text.
bolsonaro_tweets <- bolsonaro_tweets %>%
                        select(id, id_str, full_text) %>%
                        as_tibble()
## Error in select(., id, id_str, full_text): object 'bolsonaro_tweets' not found
# see the data.
bolsonaro_tweets
## Error in eval(expr, envir, enclos): object 'bolsonaro_tweets' not found
# Save as an R object.
save(bolsonaro_tweets,file="bolsonaro_tweets.Rdata")
## Error in save(bolsonaro_tweets, file = "bolsonaro_tweets.Rdata"): object 'bolsonaro_tweets' not found

This process works for basically any dataset you accessed directly through an API.

Accessing Data saved as txt.

In the “data_txt” folder, I saved ten speeches by deputies in the chamber floor in .txt format. Let’s learn how to import them into the R environment.

# see the files

list.files("data_txt")
##  [1] "discurso1.txt"  "discurso10.txt" "discurso2.txt"  "discurso3.txt" 
##  [5] "discurso4.txt"  "discurso5.txt"  "discurso6.txt"  "discurso7.txt" 
##  [9] "discurso8.txt"  "discurso9.txt"
# save names
nomes <- list.files("data_txt")

# create a path
path <- paste0("data_txt/", nomes)

# open
dados <- map_chr(path, read_lines) 

dados <- tibble(file=nomes, texto=dados)
dados
## # A tibble: 10 × 2
##    file           texto                                                         
##    <chr>          <chr>                                                         
##  1 discurso1.txt  "\"1\" \"O SR. GONZAGA PATRIOTA  (PSB-PE. Sem revisão do orad…
##  2 discurso10.txt "\"1\" \"O SR. AUGUSTO CARVALHO  (Bloco/PPS-DF. Sem revisão d…
##  3 discurso2.txt  "\"1\" \"O SR. DR. UBIALI  (PSB-SP. Sem revisão do orador.) -…
##  4 discurso3.txt  "\"1\" \"O SR. DOMINGOS DUTRA  (PT-MA. Sem revisão do orador.…
##  5 discurso4.txt  "\"1\" \"O SR. GONZAGA PATRIOTA  (PSB-PE. Pela ordem. Sem rev…
##  6 discurso5.txt  "\"1\" \"O SR. LEONARDO GADELHA  (PSC-PB. Sem revisão do orad…
##  7 discurso6.txt  "\"1\" \"O SR. DR. UBIALI  (PSB-SP. Pela ordem. Sem revisão d…
##  8 discurso7.txt  "\"1\" \"O SR. DOMINGOS DUTRA  (PT-MA. Pela ordem. Sem revisã…
##  9 discurso8.txt  "\"1\" \"O SR.  LINS  (PSD-AM. Sem revisão do orador.) - Sr. …
## 10 discurso9.txt  "\"1\" \"O SR. IZALCI  (PSDB-DF. Sem revisão do orador.) - Sr…

Accessing via csv.

Sometimes you will find text files saved as .csv Opening them is as simple as accessing any other type of csv file. We will open an example with speeches from the plenary of deputies in Brazil and with this database we will continue in the rest of the classes.

discursos <- read_csv("speeches.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   nome = col_character(),
##   partido = col_character(),
##   uf = col_character(),
##   speech = col_character()
## )

The other popular way to acess data is via pdfs. We ar enot going to cover this process here because it is a bit more cumbersome, but I am happy to send you some code on how to do it in case you are interested.

Text as a “Tidy” bank.

We’ve learned in previous classes about the tidy database concept. The three most important properties that define a tidy database are:

  • Each column is a variable.

  • Each line is an observation.

  • Each value on a line.

As we’ve discussed several times, tidyverse is a language of its own within R. Therefore, there are extensions of the use of tidy data and tidyverse packages to a wide range of areas of Computational Social Science, including modeling, and text analysis.

IMPORTANT: A Tidy text database is dataset organized with one token per line.

A token is a meaningful unit of text, like a word, that we are interested in using for parsing, and tokenization is the process of breaking text into tokens.

For tidy text datasets, a token is usually a word, however you put in an n-gram, a sentence, or even a paragraph.

To convert our text database to tidy format, let’s use the unnest_tokens function from the tidytext package

Tidytext: unnest_tokens

library(tidytext)
# To to tidy text format
tidy_discursos <- discursos %>%
                  mutate(id_discursos=1:nrow(.)) %>%
                   unnest_tokens(words, speech) #(output, input)
tidy_discursos
## # A tibble: 5,861,190 × 5
##    nome         partido uf    id_discursos words  
##    <chr>        <chr>   <chr>        <int> <chr>  
##  1 SIMÃO SESSIM PP      RJ               1 o      
##  2 SIMÃO SESSIM PP      RJ               1 sr     
##  3 SIMÃO SESSIM PP      RJ               1 sim    
##  4 SIMÃO SESSIM PP      RJ               1 sessim 
##  5 SIMÃO SESSIM PP      RJ               1 pp     
##  6 SIMÃO SESSIM PP      RJ               1 rj     
##  7 SIMÃO SESSIM PP      RJ               1 sem    
##  8 SIMÃO SESSIM PP      RJ               1 revisão
##  9 SIMÃO SESSIM PP      RJ               1 do     
## 10 SIMÃO SESSIM PP      RJ               1 orador 
## # … with 5,861,180 more rows

The two basic arguments for unnest_tokens used here are column names. First we have the name of the output column (the name of the new column) that will be created, and then the input column that the text comes from (speech in this case).

Note: punctuation is removed and texts are converted to lowercase.

Other forms of tidy texts.

Sentences

discursos %>%
 unnest_tokens(words, speech, token="sentences") #(output, input)
## # A tibble: 312,821 × 4
##    nome              partido uf    words                                        
##    <chr>             <chr>   <chr> <chr>                                        
##  1 ABEL MESQUITA JR. PDT     RR    o sr.                                        
##  2 ABEL MESQUITA JR. PDT     RR    abel mesquita jr.                            
##  3 ABEL MESQUITA JR. PDT     RR    (pdt-rr.                                     
##  4 ABEL MESQUITA JR. PDT     RR    sem revisão do orador.) - sr.                
##  5 ABEL MESQUITA JR. PDT     RR    presidente, eu queria dar como lido este meu…
##  6 ABEL MESQUITA JR. PDT     RR    peço a v.exa. que receba como lido o meu pro…
##  7 ABEL MESQUITA JR. PDT     RR    muito obrigado.                              
##  8 ABEL MESQUITA JR. PDT     RR    pronunciamento encaminhado pelo orador  sr.  
##  9 ABEL MESQUITA JR. PDT     RR    presidente, sras. e srs.                     
## 10 ABEL MESQUITA JR. PDT     RR    deputados, esta semana nós demos um importan…
## # … with 312,811 more rows

n-gram

discursos %>%
 unnest_tokens(words, speech, token="ngrams", n=2) #(output, input)
## # A tibble: 5,859,418 × 4
##    nome              partido uf    words        
##    <chr>             <chr>   <chr> <chr>        
##  1 ABEL MESQUITA JR. PDT     RR    o sr         
##  2 ABEL MESQUITA JR. PDT     RR    sr abel      
##  3 ABEL MESQUITA JR. PDT     RR    abel mesquita
##  4 ABEL MESQUITA JR. PDT     RR    mesquita jr  
##  5 ABEL MESQUITA JR. PDT     RR    jr pdt       
##  6 ABEL MESQUITA JR. PDT     RR    pdt rr       
##  7 ABEL MESQUITA JR. PDT     RR    rr sem       
##  8 ABEL MESQUITA JR. PDT     RR    sem revisão  
##  9 ABEL MESQUITA JR. PDT     RR    revisão do   
## 10 ABEL MESQUITA JR. PDT     RR    do orador    
## # … with 5,859,408 more rows

Basic Operations with Tidy Texts

The main advantage of having our texts in a tidy format is that it facilitates our effort for cleaning and running basic analysis of the texts. Since each line in our database refers to a token, it is possible to do operations using words as the unit of analysis. For example, to eliminate “stop words”, add information from dictionaries, add word sentiments, just connect (“join”) different databases also in tidy format, and your results will de ready.

Basic Statistics

Let’s calculate some basic statistics based on our prior tidyverse knowledge.

tidy_discursos <- tidy_discursos %>%
                  group_by(id_discursos) %>%
                  mutate(total_palavras=n()) %>%
                  ungroup() 

# Information about the speeches.

partido_st <- discursos %>%
                   group_by(partido) %>%
                   summarise(n_partidos=n()) 

nome_st <- discursos %>%
                   group_by(nome) %>%
                   summarise(n_dep=n())

uf_st <- discursos %>%
          group_by(uf) %>%
            summarise(n_uf=n())

tidy_discursos <- left_join(tidy_discursos, partido_st) %>%
                  left_join(nome_st) %>%
                  left_join(uf_st)
## Joining, by = "partido"
## Joining, by = "nome"
## Joining, by = "uf"

Removing stop words

“Stop Words” are words that we commonly drop in our text analyses. The fundamental idea is that these words (articles, prepositions, punctuations) carry little substantive meaning.

library(stopwords)
stop_words <- tibble(words=stopwords("portuguese"))

tidy_discursos <- tidy_discursos %>%
                    anti_join(stop_words)
## Joining, by = "words"

What can we eliminate? the name of the states.

estados <- tibble(words=unique(str_to_lower(tidy_discursos$uf)))

tidy_discursos <- tidy_discursos %>%
                    anti_join(estados)
## Joining, by = "words"

Functional words

function_names <- tibble(words=c("candidato", "candidata", "brasileira", "brasileiro", 
                                 "câmara", "municipio",
                    "municipal", "eleições", "cidade", "partido",
                    "cidadão", "deputado", "deputada", "caro", "cara", 
                    "plano", "suplementar", 
                    "voto","votar", "eleitor", "querido", 
                    "sim", "não", "dia", "hoje", "amanhã", "amigo", "amiga", 
                    "seção", "emenda", "i", "ii", "iii", "iv", 
                    "colegas", "clausula", "prefeit*", "presidente",
                    "prefeitura", 'proposta','propostas','meta',
                    'metas','plano','governo','municipal','candidato',
                    'diretrizes','programa', "deputados", "federal",
                    'eleição','coligação','município', "senhor", "sr", "dr", 
                    "excelentissimo", "nobre", "deputad*", "srs", "sras", "v.exa",
                    "san", "arial", "sentido", "fim", "minuto", "razão", "v.exa", 
                    "país", "brasil", "tribuna", "congresso", "san", "symbol", "sans", "serif",
                    "ordem", "revisão", "orador", "obrigado", "parte", "líder", "bloco", "esc", 
                    "sra", "oradora", "bloco", "times", "new", "colgano", "pronuncia", "colega", 
                    "presidenta", "pronunciamento", "mesa", "parlamentares", "secretário", "seguinte", 
                    "discurso","mato", "sul", "norte", "nordeste", "sudeste", "centro-oeste", "sul", "grosso",
                    "é", "ser", "casa", "todos", "sobre", "aqui", "nacional"))


tidy_discursos <- tidy_discursos %>%
                    anti_join(function_names)
## Joining, by = "words"

stringr for data cleaning

One of the advantages of keeping your data in Tidy format is the ability to use stringr functions for character manipulation. Let’s look at some examples to clean up the data a little more.

str_remove_all

tidy_discursos <- tidy_discursos %>%
                  mutate(words=str_remove_all(words, "[[:digit:]]"), 
                         words=str_remove_all(words, "[:punct:]")) 

Remove accents and whitespaces

tidy_discursos <- tidy_discursos %>%
                  mutate(words=str_trim(words), 
                         words=str_squish(words), 
                         words=stringi::stri_trans_general(words, "Latin-ASCII"))%>%
                  filter(words!="")

Remove common words

tidy_discursos %>%
  count(words, sort = TRUE) 
## # A tibble: 76,846 × 2
##    words        n
##    <chr>    <int>
##  1 estado   14790
##  2 anos     10169
##  3 grande    9753
##  4 porque    8823
##  5 ainda     7721
##  6 quero     6958
##  7 povo      6877
##  8 fazer     6772
##  9 projeto   6600
## 10 politica  6537
## # … with 76,836 more rows
# Gráfico
tidy_discursos %>%
count(words, sort = TRUE) %>%
  slice(1:25) %>%
  mutate(word = reorder(words, n)) %>%
  ggplot(aes(n, word)) +
  geom_col() +
  labs(y = NULL) +
  theme_minimal()

Comparing most commong words across parties

library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
# Total palavras por partido
total_palavras <- tidy_discursos %>%
                  select(partido, total_palavras) %>%
                  distinct() %>%
                  group_by(partido) %>%
                  summarize(total_words_per_party=sum(total_palavras)) %>%
                  filter(partido%in%c("PT", "PSDB"))

# Soma cada palavra por partido
palavras_partido <- tidy_discursos %>%
                          count(partido, words) %>%
                           filter(partido%in%c("PT", "PSDB"))

# Merge
partidos <- left_join(palavras_partido, total_palavras) %>%
             mutate(prop=n/total_words_per_party) %>%
              #untidy
            select(words, partido, prop) %>%
            pivot_wider(names_from=partido,
                        values_from=prop) %>%
            drop_na() %>%
            mutate(more=ifelse(PT>PSDB, "More PT", "More PSDB"))
## Joining, by = "partido"
# Graph  
ggplot(partidos, aes(x = PSDB, y = PT, 
                     alpha = abs(PT - PSDB), 
                     color=more)) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = words), check_overlap = TRUE, vjust = 1.5, alpha=.8) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_manual(values=c("#5BBCD6","#FF0000"), name="") +
  theme(legend.position="none") +
  labs(y = "Proportion of Words (PT)", x = "Proportion of Words (PSDB)") +
  theme_minimal()

Sentiment Analysis.

As you can imagine, with the data in tidy format, doing dictionary-based sentiment analysis is super intuitive. A database with a sentiment dictionary is enough. There are many options for English dictionaries. In Portuguese, and spanish, you need to look a little deeper, and probably make small adjustments.

# Usaremos este dicionário.
#devtools::install_github("sillasgonzaga/lexiconPT")
library(lexiconPT)

# Ver Dicionario
data("sentiLex_lem_PT02")
sent_pt <- as_tibble(sentiLex_lem_PT02)

# -1 negative +1 positive

tidy_discursos <- left_join(tidy_discursos, sent_pt, by=c("words"="term"))

# clean words with no sentiment
tidy_discursos_sent <- tidy_discursos %>%
                        mutate(polarity=ifelse(is.na(polarity), 0, polarity)) %>%
                        filter(polarity!=7)
          
tidy_discursos_sent
## # A tibble: 2,865,805 × 13
##    nome   partido uf    id_discursos words total_palavras n_partidos n_dep  n_uf
##    <chr>  <chr>   <chr>        <int> <chr>          <int>      <int> <int> <int>
##  1 SIMÃO… PP      RJ               1 sess…            301        493    43   836
##  2 SIMÃO… PP      RJ               1 pp               301        493    43   836
##  3 SIMÃO… PP      RJ               1 gost…            301        493    43   836
##  4 SIMÃO… PP      RJ               1 regi…            301        493    43   836
##  5 SIMÃO… PP      RJ               1 torc…            301        493    43   836
##  6 SIMÃO… PP      RJ               1 prof…            301        493    43   836
##  7 SIMÃO… PP      RJ               1 educ…            301        493    43   836
##  8 SIMÃO… PP      RJ               1 rio              301        493    43   836
##  9 SIMÃO… PP      RJ               1 jane…            301        493    43   836
## 10 SIMÃO… PP      RJ               1 greve            301        493    43   836
## # … with 2,865,795 more rows, and 4 more variables: grammar_category <chr>,
## #   polarity <dbl>, polarity_target <chr>, polarity_classification <chr>
# sentimento por discursos
tidy_dicursos_av <- tidy_discursos_sent %>%
                          group_by(id_discursos) %>%
                          summarize(polarity=mean(polarity)) %>%
                          arrange(polarity)

We therefore have a measure of feelings for speeches. Let’s generate three graphs with this information:

  • Word Cloud with feelings

  • Distribution of feelings over the years

  • Distribution of sentiments according to parties.

Most Negative and Positive Words

library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(wordcloud)
## Loading required package: RColorBrewer
tidy_discursos_sent %>%
  filter(polarity!=0) %>%
  mutate(polarity=ifelse(polarity==1, "Positiva", "Negativa")) %>%
  count(words, polarity, sort = TRUE) %>%
  acast(words ~ polarity, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 200)

Sentiment Over Time

part_pol <- discursos %>%
  mutate(id_discursos=1:nrow(.)) %>%
  left_join(tidy_dicursos_av) %>%
  mutate(polarity_binary=ifelse(polarity>0,"Positivo", "Negativo"),)%>%
  count(partido, polarity_binary) %>%
  mutate(n=ifelse(polarity_binary=="Negativo", -1*n, n)) %>%
  filter(partido!="\n", 
         n!=0) %>%
  arrange(polarity_binary, n) %>%
  mutate(partido=fct_inorder(partido))
## Joining, by = "id_discursos"
# Graph
ggplot(part_pol,
       aes(x = partido, y = n, fill = polarity_binary)) + 
    geom_col(alpha=.6, color="black") +
    coord_flip() +
    scale_fill_manual(values=c("#5BBCD6","#FF0000"), 
                       name="Polaridade em \n Discursos Legislativos") +
    labs(x="Partidos", y="Numero de Discursos") +
  theme_bw() +
  theme(legend.position = "bottom") 

## Other ways to analyze text in R.

There are several other packages for doing text analysis in R. The most famous and most useful of all is quanteda. Quanteda is very complete and allows you to do very complex analysis, and run statistical models on text data in a very intuitive way.

Why then don’t we learn quanteda? Because Quanteda has its own way of organizing data (corpus and document feature matrices) and as we are here taking our initial steps using tidy, my choice was to keep our learning consistent.

If we have time, I will show you a little bit of topic modeling, and we will use quanteda for this task.