Text Analysis in R: Tidy Text and Dictionary Methods
Introduction
In today’s class, we’re going to take our first steps in using R for text analysis. We will divide the class into the following topics.
Opening Texts in R.
Text as a “Tidy” data.
Sentiment Analysis + Dictionary Models.
I strongly recommend that students read the book Text Mining with R: A Tidy Approach. This tutorial for today is heavily inspired by this beautiful book by Julia Silge and David Robinson.
All data used in this tutorial can be downloaded here. Open this data at the same place you have the rmd, and everything should work smoothly
Review: String Manipulation in R
Before we start this workshop, it would be important for you to review our classes on manipulating strings in R using the stringr package. Take a look at the slides, and the code. It is going to be helpful for you.
Opening Texts in R.
As a programming language, R is pretty flexible on which types of data can be imported into your working environment. When working with digital texts, you will generally access it directly from the internet and save it as an R object. However, there are some other options that we will cover here.
Accessing digital files directly in R.
The easiest way to access text data in R is importing some type of digital text data directly to your environment. For example, we’re going to access data from the Twitter API using the rtweet package.
## Error in create_token(app = app_name, consumer_key = consumer_key, consumer_secret = consumer_secret, : could not find function "create_token"
library(rtweet)
library(tidyverse)## ── Attaching packages ────────────────────────────────── tidyverse 1.3.0.9000 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x purrr::flatten() masks rtweet::flatten()
## x dplyr::lag() masks stats::lag()
bolsonaro_tweets<-search_tweets("bolsonaro", n=50, include_rts = TRUE)## Error: No default authentication found. Pick existing auth with:
## * auth_as('create_token')
colnames(bolsonaro_tweets)## Error in is.data.frame(x): object 'bolsonaro_tweets' not found
# Select only the text.
bolsonaro_tweets <- bolsonaro_tweets %>%
select(id, id_str, full_text) %>%
as_tibble()## Error in select(., id, id_str, full_text): object 'bolsonaro_tweets' not found
# see the data.
bolsonaro_tweets## Error in eval(expr, envir, enclos): object 'bolsonaro_tweets' not found
# Save as an R object.
save(bolsonaro_tweets,file="bolsonaro_tweets.Rdata")## Error in save(bolsonaro_tweets, file = "bolsonaro_tweets.Rdata"): object 'bolsonaro_tweets' not found
This process works for basically any dataset you accessed directly through an API.
Accessing Data saved as txt.
In the “data_txt” folder, I saved ten speeches by deputies in the chamber floor in .txt format. Let’s learn how to import them into the R environment.
# see the files
list.files("data_txt")## [1] "discurso1.txt" "discurso10.txt" "discurso2.txt" "discurso3.txt"
## [5] "discurso4.txt" "discurso5.txt" "discurso6.txt" "discurso7.txt"
## [9] "discurso8.txt" "discurso9.txt"
# save names
nomes <- list.files("data_txt")
# create a path
path <- paste0("data_txt/", nomes)
# open
dados <- map_chr(path, read_lines)
dados <- tibble(file=nomes, texto=dados)
dados## # A tibble: 10 × 2
## file texto
## <chr> <chr>
## 1 discurso1.txt "\"1\" \"O SR. GONZAGA PATRIOTA (PSB-PE. Sem revisão do orad…
## 2 discurso10.txt "\"1\" \"O SR. AUGUSTO CARVALHO (Bloco/PPS-DF. Sem revisão d…
## 3 discurso2.txt "\"1\" \"O SR. DR. UBIALI (PSB-SP. Sem revisão do orador.) -…
## 4 discurso3.txt "\"1\" \"O SR. DOMINGOS DUTRA (PT-MA. Sem revisão do orador.…
## 5 discurso4.txt "\"1\" \"O SR. GONZAGA PATRIOTA (PSB-PE. Pela ordem. Sem rev…
## 6 discurso5.txt "\"1\" \"O SR. LEONARDO GADELHA (PSC-PB. Sem revisão do orad…
## 7 discurso6.txt "\"1\" \"O SR. DR. UBIALI (PSB-SP. Pela ordem. Sem revisão d…
## 8 discurso7.txt "\"1\" \"O SR. DOMINGOS DUTRA (PT-MA. Pela ordem. Sem revisã…
## 9 discurso8.txt "\"1\" \"O SR. LINS (PSD-AM. Sem revisão do orador.) - Sr. …
## 10 discurso9.txt "\"1\" \"O SR. IZALCI (PSDB-DF. Sem revisão do orador.) - Sr…
Accessing via csv.
Sometimes you will find text files saved as .csv Opening them is as simple as accessing any other type of csv file. We will open an example with speeches from the plenary of deputies in Brazil and with this database we will continue in the rest of the classes.
discursos <- read_csv("speeches.csv")##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## nome = col_character(),
## partido = col_character(),
## uf = col_character(),
## speech = col_character()
## )
The other popular way to acess data is via pdfs. We ar enot going to cover this process here because it is a bit more cumbersome, but I am happy to send you some code on how to do it in case you are interested.
Text as a “Tidy” bank.
We’ve learned in previous classes about the tidy database concept. The three most important properties that define a tidy database are:
Each column is a variable.
Each line is an observation.
Each value on a line.
As we’ve discussed several times, tidyverse is a language of its own within R. Therefore, there are extensions of the use of tidy data and tidyverse packages to a wide range of areas of Computational Social Science, including modeling, and text analysis.
IMPORTANT: A Tidy text database is dataset organized with one token per line.
A token is a meaningful unit of text, like a word, that we are interested in using for parsing, and tokenization is the process of breaking text into tokens.
For tidy text datasets, a token is usually a word, however you put in an n-gram, a sentence, or even a paragraph.
To convert our text database to tidy format, let’s use the unnest_tokens function from the tidytext package
Tidytext: unnest_tokens
library(tidytext)
# To to tidy text format
tidy_discursos <- discursos %>%
mutate(id_discursos=1:nrow(.)) %>%
unnest_tokens(words, speech) #(output, input)
tidy_discursos## # A tibble: 5,861,190 × 5
## nome partido uf id_discursos words
## <chr> <chr> <chr> <int> <chr>
## 1 SIMÃO SESSIM PP RJ 1 o
## 2 SIMÃO SESSIM PP RJ 1 sr
## 3 SIMÃO SESSIM PP RJ 1 sim
## 4 SIMÃO SESSIM PP RJ 1 sessim
## 5 SIMÃO SESSIM PP RJ 1 pp
## 6 SIMÃO SESSIM PP RJ 1 rj
## 7 SIMÃO SESSIM PP RJ 1 sem
## 8 SIMÃO SESSIM PP RJ 1 revisão
## 9 SIMÃO SESSIM PP RJ 1 do
## 10 SIMÃO SESSIM PP RJ 1 orador
## # … with 5,861,180 more rows
The two basic arguments for unnest_tokens used here are column names. First we have the name of the output column (the name of the new column) that will be created, and then the input column that the text comes from (speech in this case).
Note: punctuation is removed and texts are converted to lowercase.
Other forms of tidy texts.
Sentences
discursos %>%
unnest_tokens(words, speech, token="sentences") #(output, input)## # A tibble: 312,821 × 4
## nome partido uf words
## <chr> <chr> <chr> <chr>
## 1 ABEL MESQUITA JR. PDT RR o sr.
## 2 ABEL MESQUITA JR. PDT RR abel mesquita jr.
## 3 ABEL MESQUITA JR. PDT RR (pdt-rr.
## 4 ABEL MESQUITA JR. PDT RR sem revisão do orador.) - sr.
## 5 ABEL MESQUITA JR. PDT RR presidente, eu queria dar como lido este meu…
## 6 ABEL MESQUITA JR. PDT RR peço a v.exa. que receba como lido o meu pro…
## 7 ABEL MESQUITA JR. PDT RR muito obrigado.
## 8 ABEL MESQUITA JR. PDT RR pronunciamento encaminhado pelo orador sr.
## 9 ABEL MESQUITA JR. PDT RR presidente, sras. e srs.
## 10 ABEL MESQUITA JR. PDT RR deputados, esta semana nós demos um importan…
## # … with 312,811 more rows
n-gram
discursos %>%
unnest_tokens(words, speech, token="ngrams", n=2) #(output, input)## # A tibble: 5,859,418 × 4
## nome partido uf words
## <chr> <chr> <chr> <chr>
## 1 ABEL MESQUITA JR. PDT RR o sr
## 2 ABEL MESQUITA JR. PDT RR sr abel
## 3 ABEL MESQUITA JR. PDT RR abel mesquita
## 4 ABEL MESQUITA JR. PDT RR mesquita jr
## 5 ABEL MESQUITA JR. PDT RR jr pdt
## 6 ABEL MESQUITA JR. PDT RR pdt rr
## 7 ABEL MESQUITA JR. PDT RR rr sem
## 8 ABEL MESQUITA JR. PDT RR sem revisão
## 9 ABEL MESQUITA JR. PDT RR revisão do
## 10 ABEL MESQUITA JR. PDT RR do orador
## # … with 5,859,408 more rows
Basic Operations with Tidy Texts
The main advantage of having our texts in a tidy format is that it facilitates our effort for cleaning and running basic analysis of the texts. Since each line in our database refers to a token, it is possible to do operations using words as the unit of analysis. For example, to eliminate “stop words”, add information from dictionaries, add word sentiments, just connect (“join”) different databases also in tidy format, and your results will de ready.
Basic Statistics
Let’s calculate some basic statistics based on our prior tidyverse knowledge.
tidy_discursos <- tidy_discursos %>%
group_by(id_discursos) %>%
mutate(total_palavras=n()) %>%
ungroup()
# Information about the speeches.
partido_st <- discursos %>%
group_by(partido) %>%
summarise(n_partidos=n())
nome_st <- discursos %>%
group_by(nome) %>%
summarise(n_dep=n())
uf_st <- discursos %>%
group_by(uf) %>%
summarise(n_uf=n())
tidy_discursos <- left_join(tidy_discursos, partido_st) %>%
left_join(nome_st) %>%
left_join(uf_st)## Joining, by = "partido"
## Joining, by = "nome"
## Joining, by = "uf"
Removing stop words
“Stop Words” are words that we commonly drop in our text analyses. The fundamental idea is that these words (articles, prepositions, punctuations) carry little substantive meaning.
library(stopwords)
stop_words <- tibble(words=stopwords("portuguese"))
tidy_discursos <- tidy_discursos %>%
anti_join(stop_words)## Joining, by = "words"
What can we eliminate? the name of the states.
estados <- tibble(words=unique(str_to_lower(tidy_discursos$uf)))
tidy_discursos <- tidy_discursos %>%
anti_join(estados)## Joining, by = "words"
Functional words
function_names <- tibble(words=c("candidato", "candidata", "brasileira", "brasileiro",
"câmara", "municipio",
"municipal", "eleições", "cidade", "partido",
"cidadão", "deputado", "deputada", "caro", "cara",
"plano", "suplementar",
"voto","votar", "eleitor", "querido",
"sim", "não", "dia", "hoje", "amanhã", "amigo", "amiga",
"seção", "emenda", "i", "ii", "iii", "iv",
"colegas", "clausula", "prefeit*", "presidente",
"prefeitura", 'proposta','propostas','meta',
'metas','plano','governo','municipal','candidato',
'diretrizes','programa', "deputados", "federal",
'eleição','coligação','município', "senhor", "sr", "dr",
"excelentissimo", "nobre", "deputad*", "srs", "sras", "v.exa",
"san", "arial", "sentido", "fim", "minuto", "razão", "v.exa",
"país", "brasil", "tribuna", "congresso", "san", "symbol", "sans", "serif",
"ordem", "revisão", "orador", "obrigado", "parte", "líder", "bloco", "esc",
"sra", "oradora", "bloco", "times", "new", "colgano", "pronuncia", "colega",
"presidenta", "pronunciamento", "mesa", "parlamentares", "secretário", "seguinte",
"discurso","mato", "sul", "norte", "nordeste", "sudeste", "centro-oeste", "sul", "grosso",
"é", "ser", "casa", "todos", "sobre", "aqui", "nacional"))
tidy_discursos <- tidy_discursos %>%
anti_join(function_names)## Joining, by = "words"
stringr for data cleaning
One of the advantages of keeping your data in Tidy format is the ability to use stringr functions for character manipulation. Let’s look at some examples to clean up the data a little more.
str_remove_all
tidy_discursos <- tidy_discursos %>%
mutate(words=str_remove_all(words, "[[:digit:]]"),
words=str_remove_all(words, "[:punct:]")) Remove accents and whitespaces
tidy_discursos <- tidy_discursos %>%
mutate(words=str_trim(words),
words=str_squish(words),
words=stringi::stri_trans_general(words, "Latin-ASCII"))%>%
filter(words!="")Remove common words
tidy_discursos %>%
count(words, sort = TRUE) ## # A tibble: 76,846 × 2
## words n
## <chr> <int>
## 1 estado 14790
## 2 anos 10169
## 3 grande 9753
## 4 porque 8823
## 5 ainda 7721
## 6 quero 6958
## 7 povo 6877
## 8 fazer 6772
## 9 projeto 6600
## 10 politica 6537
## # … with 76,836 more rows
# Gráfico
tidy_discursos %>%
count(words, sort = TRUE) %>%
slice(1:25) %>%
mutate(word = reorder(words, n)) %>%
ggplot(aes(n, word)) +
geom_col() +
labs(y = NULL) +
theme_minimal()Comparing most commong words across parties
library(scales)##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
# Total palavras por partido
total_palavras <- tidy_discursos %>%
select(partido, total_palavras) %>%
distinct() %>%
group_by(partido) %>%
summarize(total_words_per_party=sum(total_palavras)) %>%
filter(partido%in%c("PT", "PSDB"))
# Soma cada palavra por partido
palavras_partido <- tidy_discursos %>%
count(partido, words) %>%
filter(partido%in%c("PT", "PSDB"))
# Merge
partidos <- left_join(palavras_partido, total_palavras) %>%
mutate(prop=n/total_words_per_party) %>%
#untidy
select(words, partido, prop) %>%
pivot_wider(names_from=partido,
values_from=prop) %>%
drop_na() %>%
mutate(more=ifelse(PT>PSDB, "More PT", "More PSDB"))## Joining, by = "partido"
# Graph
ggplot(partidos, aes(x = PSDB, y = PT,
alpha = abs(PT - PSDB),
color=more)) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = words), check_overlap = TRUE, vjust = 1.5, alpha=.8) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_manual(values=c("#5BBCD6","#FF0000"), name="") +
theme(legend.position="none") +
labs(y = "Proportion of Words (PT)", x = "Proportion of Words (PSDB)") +
theme_minimal()Sentiment Analysis.
As you can imagine, with the data in tidy format, doing dictionary-based sentiment analysis is super intuitive. A database with a sentiment dictionary is enough. There are many options for English dictionaries. In Portuguese, and spanish, you need to look a little deeper, and probably make small adjustments.
# Usaremos este dicionário.
#devtools::install_github("sillasgonzaga/lexiconPT")
library(lexiconPT)
# Ver Dicionario
data("sentiLex_lem_PT02")
sent_pt <- as_tibble(sentiLex_lem_PT02)
# -1 negative +1 positive
tidy_discursos <- left_join(tidy_discursos, sent_pt, by=c("words"="term"))
# clean words with no sentiment
tidy_discursos_sent <- tidy_discursos %>%
mutate(polarity=ifelse(is.na(polarity), 0, polarity)) %>%
filter(polarity!=7)
tidy_discursos_sent## # A tibble: 2,865,805 × 13
## nome partido uf id_discursos words total_palavras n_partidos n_dep n_uf
## <chr> <chr> <chr> <int> <chr> <int> <int> <int> <int>
## 1 SIMÃO… PP RJ 1 sess… 301 493 43 836
## 2 SIMÃO… PP RJ 1 pp 301 493 43 836
## 3 SIMÃO… PP RJ 1 gost… 301 493 43 836
## 4 SIMÃO… PP RJ 1 regi… 301 493 43 836
## 5 SIMÃO… PP RJ 1 torc… 301 493 43 836
## 6 SIMÃO… PP RJ 1 prof… 301 493 43 836
## 7 SIMÃO… PP RJ 1 educ… 301 493 43 836
## 8 SIMÃO… PP RJ 1 rio 301 493 43 836
## 9 SIMÃO… PP RJ 1 jane… 301 493 43 836
## 10 SIMÃO… PP RJ 1 greve 301 493 43 836
## # … with 2,865,795 more rows, and 4 more variables: grammar_category <chr>,
## # polarity <dbl>, polarity_target <chr>, polarity_classification <chr>
# sentimento por discursos
tidy_dicursos_av <- tidy_discursos_sent %>%
group_by(id_discursos) %>%
summarize(polarity=mean(polarity)) %>%
arrange(polarity)We therefore have a measure of feelings for speeches. Let’s generate three graphs with this information:
Word Cloud with feelings
Distribution of feelings over the years
Distribution of sentiments according to parties.
Most Negative and Positive Words
library(reshape2)##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(wordcloud)## Loading required package: RColorBrewer
tidy_discursos_sent %>%
filter(polarity!=0) %>%
mutate(polarity=ifelse(polarity==1, "Positiva", "Negativa")) %>%
count(words, polarity, sort = TRUE) %>%
acast(words ~ polarity, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 200)Sentiment Over Time
part_pol <- discursos %>%
mutate(id_discursos=1:nrow(.)) %>%
left_join(tidy_dicursos_av) %>%
mutate(polarity_binary=ifelse(polarity>0,"Positivo", "Negativo"),)%>%
count(partido, polarity_binary) %>%
mutate(n=ifelse(polarity_binary=="Negativo", -1*n, n)) %>%
filter(partido!="\n",
n!=0) %>%
arrange(polarity_binary, n) %>%
mutate(partido=fct_inorder(partido))## Joining, by = "id_discursos"
# Graph
ggplot(part_pol,
aes(x = partido, y = n, fill = polarity_binary)) +
geom_col(alpha=.6, color="black") +
coord_flip() +
scale_fill_manual(values=c("#5BBCD6","#FF0000"),
name="Polaridade em \n Discursos Legislativos") +
labs(x="Partidos", y="Numero de Discursos") +
theme_bw() +
theme(legend.position = "bottom") ## Other ways to analyze text in R.
There are several other packages for doing text analysis in R. The most famous and most useful of all is quanteda. Quanteda is very complete and allows you to do very complex analysis, and run statistical models on text data in a very intuitive way.
Why then don’t we learn quanteda? Because Quanteda has its own way of organizing data (corpus and document feature matrices) and as we are here taking our initial steps using tidy, my choice was to keep our learning consistent.
If we have time, I will show you a little bit of topic modeling, and we will use quanteda for this task.