In the class today, we will start learning about working with text as data in Python. This notebook will cover:
Descriptive analysis of text
Unsupervised Learning:
Computer Scientists often refer to Natural Language Processing (NLP) as the field that focuses on developing tools to process natural language (text, audio, videos) using computers. On the other side, social scientists and applied data scientists will often use terms as text-as-data, or even computational linguistics, as the field that focus on developing tools/models to incorporate textual data in their data analysis pipelines.
These fields are very closely connected, share similar methodological tools, and develop solutions to similar problems. Some of the applications of these fields are different due to the nature of each field. For example:
Computer Scientists (NLP) are often more interested in taks such as: machine Translation, chatbots and virtual assistants, generative AI, speech recognition, among others.
Social Scientists (text-as-data) are often more interested in tasks as: document similarity, topic discovery, content analysis, text classification.
My approach is to consider these perspectives more as a integrated disciplinary field that focus on different tasks than as two separate perspectives. For this reason, I will often use the terms NLP and text-as-data interchangeably, even though most of the application we will see in the next two week are closer to a social scientist applied perspective to work with text data.
In the Spring, I will teach a full semester in Text-as-Data. You can see the syllabus, and there is a mix of tasks/methods from each field.
The Natural Language Toolkit (NLTK) is a popular open-source library for natural language processing (NLP) in Python. According to the NLTK textbook, the library works under the following principles:
Simplicity: To provide an intuitive framework along with substantial building blocks, giving users a practical knowledge of NLP without getting bogged down in the tedious house-keeping usually associated with processing annotated language data
Consistency: To provide a uniform framework with consistent interfaces and data structures, and easily-guessable method names
Extensibility: To provide a structure into which new software modules can be easily accommodated, including alternative implementations and competing approaches to the same task
Modularity: To provide components that can be used independently without needing to understand the rest of the toolkit
NLTK is widely used by researchers, developers, and data scientists worldwide to develop NLP applications and analyze text data. We will use NLTK for the most basic steps on NLP, particularly, pre-processing and converting texts to matrices.
# !pip install nltk
import nltk
# download nltk and close
# nltk.download()
We will see three different ways to import textual data:
Read more here about text data available with nltk: https://www.nltk.org/book/ch02.html
Let's see some examples below
# import nltk guttenberg books
# see all books available
nltk.corpus.gutenberg.fileids()
# openning jane austen emma as words
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
type(emma)
# converting to a list
emma_list = [w.lower() for w in emma]
# print
emma_list[:10]
To open text files we have saved locally, we should use the file connections tools we learned early in the course.
I have in my working directory the first chapter of the Red Rising book I was reading early this year. Let's open it.
# open red_rising.txt
# Using 'with' to open the file
with open('red_rising.txt', 'r') as file:
# Read the content of the file
content = file.read()
# open as a string
content[0:1000]
# convert to a list using string methods
rr_lines = [c for c in content.split("\n") if c != ""]
rr_lines[0:5]
# By word
rr_words = content.split(" ")
rr_words[0:10]
# a bit more to remove line breaks
rr_words_ = [i for el in rr_words for i in el.split("\n")]
rr_words_
We will use Twitter data. This dataset has a collection of Twitter timelines of all members of the 117th Congress for the year of 2021. It is a rich dataset, and interesting to play with for some descriptive text analysis.
We will work mostly with the columns text
import pandas as pd
import numpy as np
# Open data
tweets_data = pd.read_csv("tweets_congress.csv")
tweets_data.head()
tweets_data.shape
# reduce the size of the data a bit
import random
authors = tweets_data["author"].unique()[random.sample(range(1, 425), 10)]
tweets_data_ = tweets_data[tweets_data['author'].str.contains("|".join(authors))].copy()
tweets_data_.shape
Almost every data science task using text requires data to be preprocessed before running any type of analysis. These tasks often consists on reducing noise on text data - making the the data more informative and less complex - and converting the data to formats computer understand.
The most commong pre-processing steps are:
tokenization: splitting text into words or tokens.
normalization: convert text to all lowercase and removing punctuation
stop word removal: remove noise, words with little meaning. Usually involves a pre-defined set of words + some domain knowledge/context dependet words
stemming: removing the suffixes from words, such as "ing" or "ed," to reduce them to their base form
lemmatization: relies on accurately determining the intended part-of-speech and the meaning of a word based on its context.
Important: pre-processing steps can profoundly change what your text looks like. See this article here to understand more in-depth some trade-offs associated with pre-processing steps: https://www.cambridge.org/core/journals/political-analysis/article/abs/text-preprocessing-for-unsupervised-learning-why-it-matters-when-it-misleads-and-what-to-do-about-it/AA7D4DE0AA6AB208502515AE3EC6989E
The implementation of these steps consists of a mix of string
methods and nltk
methods. Let's see examples with the Politicians tweets datasets.
# import nltk methods
# stopwords
from nltk.corpus import stopwords
# tokenizer
from nltk.tokenize import word_tokenize
# lemmatizer
from nltk.stem import WordNetLemmatizer
# stemming
from nltk.stem.porter import PorterStemmer
word_tokenize()
from nltk# apply as a dataframe # with half of the dataframe
import time
tweets_data_["tokens"] = tweets_data_["text"].apply(word_tokenize)
# see
tweets_data_["tokens"]
isalpha()
- string methods to remove punctuationlower()
- string methods to convert text to lower# normalization
tweets_data_["tokens"] = tweets_data_["tokens"].apply(lambda x: [word.lower() for word in x if word.isalpha()])
tweets_data_["tokens"].head()
stopwords.words('english')
from nltk# import stopword first
stop_words = stopwords.words('english')
print(stop_words)
Want to add some more?
stop_words = stop_words + (["dr", "mr", "miss","congressman","congresswomen", "http", "rt"])
# remove
tweets_data_["tokens"] = tweets_data_["tokens"].apply(lambda x: [word for word in x if word not in stop_words])
tweets_data_["tokens"]
We stem the tokens using nltk.stem.porter.PorterStemmer
to get the stemmed tokens.
# instatiate the stemmer
porter = PorterStemmer()
# run
tweets_data_["tokens"] = tweets_data_["tokens"].apply(lambda x: [porter.stem(word) for word in x if word])
# see
tweets_data_["tokens"].head()
We will lemmatize the tokens using WordNetLemmatizer()
from nltk
# import
from nltk.stem import WordNetLemmatizer
# instantiate
lemmatizer = WordNetLemmatizer()
# run (doenst' make much sense to run on a stemm, but just for your reference)
tweets_data_["tokens"] = tweets_data_["tokens"].apply(lambda x: [lemmatizer.lemmatize(word) for word in x if word])
# see
tweets_data_["tokens"].tail()
As we saw in the lecture, our next step is to represent text numerically. We will do so by using the Bag of Words assumption. This assumption states that we represent text as an unordered set of words in a document.
Remember, the idea here is to represent text data as numbers. We do so by breaking the text in words, and counting them. A standard way to do so is by using a Document-Feature Matrix (DFM)
To create a DFM, we will use the CountVectorizer()
method from sklearn
from sklearn.feature_extraction.text import CountVectorizer
# combine the pre-processed data
tweets_data_['tokens_join'] = tweets_data_['tokens'].apply(' '.join)
# instantiate a vectorizer
vectorizer = CountVectorizer()
# transform the data
dfm = vectorizer.fit_transform(tweets_data_['tokens_join'])
# oput is a matrix
type(dfm)
# Convert the matrix to an array and display it
feature_matrix = dfm.todense()
# super sparse matrix
feature_matrix
# Get feature names to use as dataframe column headers
feature_names = vectorizer.get_feature_names_out()
# Create a DataFrame with the feature matrix
df = pd.DataFrame(feature_matrix, columns=feature_names)
df
hugely sparse data!!
With this representation, we can actually start visualizing some interesting patterns in the data.
For example, we can visualize the most distictive words tweets by each politician. In this case, we need:
tweets_data.head()
# change unit of analysis
tweets_data_g = tweets_data.groupby(["author","State", "Party"])["text"].apply(lambda x: "".join(x)).reset_index().copy()
tweets_data_g
# see
authors = ["RepAOC", "Ilhan", "SpeakerPelosi", "marcorubio", "SenatorTimScott",
"SenTedCruz", "Jim_Jordan", "GOPLeader"]
# make a copy
reps = tweets_data_g[tweets_data_g["author"].str.contains("|".join(authors))].copy()
reps
stop_words = stop_words + ["new", "https", "rt"]
# pre-process
# tokenize
reps["tokens"] = reps["text"].apply(word_tokenize)
# normalize
reps["tokens"] = reps["tokens"].apply(lambda x: [word.lower() for word in x if word.isalpha()])
# stem and stopwords
reps["tokens"] = reps["tokens"].apply(lambda x: [porter.stem(word) for word in x if word not in stop_words])
## Create dfm
# combine the pre-processed data
reps['tokens_join'] = reps['tokens'].apply(' '.join)
# instantiate a vectorizer
vectorizer = CountVectorizer()
# transform the data
dfm = vectorizer.fit_transform(reps['tokens_join'])
# convert df
dfm_d = pd.DataFrame(dfm.toarray(),
columns=vectorizer.get_feature_names_out(),
index=reps["author"])
# see the dataset
dfm_d
# overall most important features
index = dfm_d.sum().sort_values(ascending=False).index
index
# see the most important features
dfm_d[index]
# most features words by candidate
# clean to capture top 10 terms
dfm_d.index.name = "author_tweet"
# contained
df_list = list()
# get top terms by group
for id, row in dfm_d.groupby("author_tweet"):
idx = row.sum().sort_values(ascending=False).index
temp = row.loc[:, idx].reset_index().melt(id_vars=["author_tweet"]).iloc[0:10, :]
df_list.append(temp)
# concat
top_terms = pd.concat(df_list, axis=0)
top_terms
# visualize
from plotnine import *
# plot
(ggplot(top_terms, aes(x='variable', y='value')) +
geom_bar(stat='identity') +
facet_wrap('~author_tweet', scales='free') +
coord_flip() + # To make horizontal bar plots
theme(subplots_adjust={'wspace': 0.25}, # Adjust the space between plots
axis_text_y=element_text(size=10), # Adjust text size for y axis
figure_size=(15, 10)) + # Adjust the figure size
labs(x='Frequency', y='') +
theme_minimal()
)
Counts of simple frequencies is a bit silly. Let's look at other ways to count that retrieve more information:
N-grams: count words that appear together with a N-size window
It is a weighted measure of counts by the number of times the term appears in other documents.
Inverse Document Frequency (IDF): $$ \text{IDF}(t, D) = \log\left(\frac{\text{Total number of documents } |D|}{\text{Number of documents with term } t \text{ in it}}\right) $$
TF-IDF: $$ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) $$
# bi-grams
# instantiate a vectorizer
vectorizer = CountVectorizer()
# get bigrams
vectorizer = CountVectorizer(
lowercase=True,
stop_words='english',
ngram_range=(2,2), ## see here is the main difference
# max_features=N # Optionally restricts to top N tokens
)
text_bi = vectorizer.fit_transform(reps['tokens_join'])
# Convert matrix to DataFrame with bigram columns
# convert df
text_bi_pd = pd.DataFrame(text_bi.toarray(),
columns=vectorizer.get_feature_names_out(),
index=reps["author"])
# see
text_bi_pd
# clean to capture top 10 terms
text_bi_pd.index.name = "author_tweet"
# contained
df_list = list()
# get top terms by group
for id, row in text_bi_pd.groupby("author_tweet"):
idx = row.sum().sort_values(ascending=False).index
temp = row.loc[:, idx].reset_index().melt(id_vars=["author_tweet"]).iloc[0:10, :]
df_list.append(temp)
# concat
top_terms = pd.concat(df_list, axis=0)
# see it
top_terms.head()
# visualize
from plotnine import *
# plot
(ggplot(top_terms, aes(x='variable', y='value')) +
geom_bar(stat='identity') +
facet_wrap('~author_tweet', scales='free') +
coord_flip() + # To make horizontal bar plots
theme(subplots_adjust={'wspace': 0.25}, # Adjust the space between plots
axis_text_y=element_text(size=10), # Adjust text size for y axis
figure_size=(15, 10)) + # Adjust the figure size
labs(x='Frequency', y='') +
theme_minimal()
)
# Term Frequency - Inverse Document Frequency (TF-IDF):
from sklearn.feature_extraction.text import TfidfVectorizer
# instantiate a vectorizer
vectorizer = TfidfVectorizer()
# get tfidf
vectorizer = TfidfVectorizer(
lowercase=True,
stop_words='english',
# max_features=N # Optionally restricts to top N tokens
)
# transform
text_tfidf = vectorizer.fit_transform(reps['tokens_join'])
# convert df
text_tfidf_pd = pd.DataFrame(text_tfidf.toarray(),
columns=vectorizer.get_feature_names_out(),
index=reps["author"])
# clean to capture top 10 terms
text_tfidf_pd.index.name = "author_tweet"
# contained
df_list = list()
# get top terms by group
for id, row in text_tfidf_pd.groupby("author_tweet"):
idx = row.sum().sort_values(ascending=False).index
temp = row.loc[:, idx].reset_index().melt(id_vars=["author_tweet"]).iloc[0:10, :]
df_list.append(temp)
# concat
top_terms = pd.concat(df_list, axis=0)
top_terms
# visualize
from plotnine import *
# plot
(ggplot(top_terms, aes(x='variable', y='value')) +
geom_bar(stat='identity') +
facet_wrap('~author_tweet', scales='free') +
coord_flip() + # To make horizontal bar plots
theme(subplots_adjust={'wspace': 0.25}, # Adjust the space between plots
axis_text_y=element_text(size=10), # Adjust text size for y axis
figure_size=(15, 10)) + # Adjust the figure size
labs(x='Frequency', y='') +
theme_minimal()
)
Repeate the process described above, but using a different grouping variable. In this case, you can:
either group using other variables in the data (day, party, state)
use other politicians.
Use one of the metric below (count, tfidf or bigrams) to understand the most important words for each group.
# your code here
Let's now calculate measures of similarity between the authors of the tweets. Notice, this could be done for each tweet, or for all the politicians. We will focus on the latter just to make things more interesting.
Here is our similarity measure:
$$\text{Sim}(A, B) = \frac{{A \cdot B}}{{\|A\| \|B\|}}$$Where:
We will use as an input the tf-idf matrix! The function (which is similar to the one you wrote in problem set 2) is implemented with the sklearn
library
# import
from sklearn.metrics.pairwise import cosine_similarity
# re-estimate tf-idf
vectorizer = TfidfVectorizer()
# transform
text_tfidf = vectorizer.fit_transform(reps['tokens_join'])
# Calculate the cosine similarity between all pairs in the matrix
cosine_sim = cosine_similarity(text_tfidf, text_tfidf )
# Display the cosine similarity matrix
cosine_sim
# convert to a df
author = reps["author"]
similarity_df = pd.DataFrame(cosine_sim, columns=reps["author"], index=reps["author"])
# similarity
similarity_df
# AOC closest to?
similarity_df["RepAOC"].sort_values(ascending=False)
# Jim Jordan closest to?
similarity_df["Jim_Jordan"].sort_values(ascending=False)
# Convert to tidy
df_tidy = similarity_df.reset_index().melt(id_vars='author', var_name='related_author', value_name='correlation')
df_tidy = df_tidy.sort_values(["author", "correlation"], ascending=False).copy()
# get order
order = df_tidy.tail(7).related_author
# Creating the heatmap
(ggplot(df_tidy, aes(x='author', y='related_author', fill='correlation'))
+ geom_tile()
+ scale_fill_gradient(low="white", high="blue",
limits=(.4, 1.01))
+ scale_x_discrete(limits=order)
+ scale_y_discrete(limits=order)
+ theme(axis_text_x=element_text(angle=90, hjust=1))
+ labs(title='Correlation Tile Matrix', x='Author', y='Related Author', fill='Correlation')
)
To estimate topic models, we will use the gensim
library. gensim
is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. It is the main library to retrieve pre-trained word embeddings, or to train word embeddings using the famous word2vec algorithm.
This is a step-by-step of estimate LDA using gensim
:
Preprocess the Text: Follow most of the steps we saw before, including tokenization, removing stopwords, normalization, etc..
Create a dictionary: gensim requires you to create a dictionary of all stemmed/preprocessed words in the corpus (collection of documents); the method Dictionary
from gensim
will crete this data structure for us.
Filter out words from the dictionary that appear in either a very low proportion of documents (lower bound) or a very high proportion of documents (upper bound).
Create a bag-of-words representation of the documents: maps words from the dictionary representation to each document.
Estimate the topic model: use LDA model within gensim
# get a sample
td = tweets_data.iloc[random.sample(range(1, tweets_data.shape[0]), 1000)].copy()
Write a function with all our previous steps
# Write a preprocessing function
def preprocess_text(text):
# increase stop words
stop_words = stopwords.words('english')
stop_words = stop_words + ["http"]
# tokenization
tokens_ = word_tokenize(text)
# Generate a list of tokens after preprocessing
# normalize
tokens_ = [word.lower() for word in tokens_ if word.isalpha()]
# stem and stopwords
tokens_ = [porter.stem(word) for word in tokens_ if word not in stop_words]
# Return the preprocessed tokens as a string
return tokens_
# apply
td["tokens"] = td["text"].apply(preprocess_text)
# import dictionar
from gensim.corpora import Dictionary
# convert to a list
tokens = td["tokens"].tolist()
# let's look what this input is.
# should be a list of list for each document split by tokens
tokens[1]
# Create a dictionary representation of the documents
dictionary = Dictionary(tokens)
# see
dictionary.token2id
This is a additional pre-processing task. More meaningful topics comes when we remove rare and overly common words`
# Filter out words that occur in less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)
# Create a bag-of-words representation of the documents
# notice here you are just inputign every doc in a .doc2bow methods
corpus = [dictionary.doc2bow(doc) for doc in tokens]
# see case by case
# tuple with (id for every token, frequency)
dictionary.doc2bow(tokens[0])
from gensim.models.ldamodel import LdaModel
# Make an index to word dictionary
temp = dictionary[0] # This is only to "load" the dictionary
id2word = dictionary.id2token
# Train the LDA model
lda_model = LdaModel(
corpus=corpus,
id2word=dictionary,
num_topics=10,
eval_every=False
)
# Print the Keyword in the 10 topics
lda_model.print_topics()
# Extract topics from each documenct
td['topic'] = [sorted(lda_model[corpus][text]) for text in range(len(td["text"]))]
# expand the dataframe
df_exploded = td["topic"].explode().reset_index()
# separate information
df_exploded[["topic", "probability"]] = pd.DataFrame(df_exploded['topic'].tolist(), index=df_exploded.index)
# data frame with the distribution for each topic vs document
df_exploded
# merge
df_exploded = pd.merge(df_exploded, td.reset_index(), on="index")
# topic prevalence
tp_prev = df_exploded.groupby("topic_x")["probability"].mean().reset_index()
tp_prev.sort_values("probability", ascending=False)
# Get the most important words for each topic
topic_words = list()
for i in range(lda_model.num_topics):
# Get the top words for the topic
words = lda_model.show_topic(i, topn=10)
topic_words.append(", ".join([word for word, prob in words]))
topic_words
tp_prev["words"] = topic_words
tp_prev
Very nice representation of the topics. you can merge this back with the core data set and see different distributions for candidates, parties, time of the day, any other group variable you have
!jupyter nbconvert _week_11_nlp_I.ipynb --to html --template classic