PPOL 5203 Data Science I: Foundations

Working with Text as Data

Tiago Ventura

Learning Goals¶

In the class today, we will start learning about working with text as data in Python. This notebook will cover:

Opening text data in Python
Strategies to pre-process data
Converting text to a numerical representation (sparse matrix)
Descriptive analysis of text
- Bigrams and TF-IDF
- Similarity of documents
Unsupervised Learning:
- topic models

Natural Language Processing (NLP) vs Text-as-Data¶

Computer Scientists often refer to Natural Language Processing (NLP) as the field that focuses on developing tools to process natural language (text, audio, videos) using computers. On the other side, social scientists and applied data scientists will often use terms as text-as-data, or even computational linguistics, as the field that focus on developing tools/models to incorporate textual data in their data analysis pipelines.

These fields are very closely connected, share similar methodological tools, and develop solutions to similar problems. Some of the applications of these fields are different due to the nature of each field. For example:

Computer Scientists (NLP) are often more interested in taks such as: machine Translation, chatbots and virtual assistants, generative AI, speech recognition, among others.
Social Scientists (text-as-data) are often more interested in tasks as: document similarity, topic discovery, content analysis, text classification.

My approach is to consider these perspectives more as a integrated disciplinary field that focus on different tasks than as two separate perspectives. For this reason, I will often use the terms NLP and text-as-data interchangeably, even though most of the application we will see in the next two week are closer to a social scientist applied perspective to work with text data.

In the Spring, I will teach a full semester in Text-as-Data. You can see the syllabus, and there is a mix of tasks/methods from each field.

NLTK¶

The Natural Language Toolkit (NLTK) is a popular open-source library for natural language processing (NLP) in Python. According to the NLTK textbook, the library works under the following principles:

Simplicity: To provide an intuitive framework along with substantial building blocks, giving users a practical knowledge of NLP without getting bogged down in the tedious house-keeping usually associated with processing annotated language data
Consistency: To provide a uniform framework with consistent interfaces and data structures, and easily-guessable method names
Extensibility: To provide a structure into which new software modules can be easily accommodated, including alternative implementations and competing approaches to the same task
Modularity: To provide components that can be used independently without needing to understand the rest of the toolkit

NLTK is widely used by researchers, developers, and data scientists worldwide to develop NLP applications and analyze text data. We will use NLTK for the most basic steps on NLP, particularly, pre-processing and converting texts to matrices.

NLTK Setup¶

# !pip install nltk
import nltk

# download nltk and close
# nltk.download()

Importing Text Data¶

We will see three different ways to import textual data:

Working with textual data from NLTK
Importing as .txt locally
Importing text as a column in Pandas

nltk data libraries¶

Read more here about text data available with nltk: https://www.nltk.org/book/ch02.html

Let's see some examples below

# import nltk guttenberg books
# see all books available
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

# openning jane austen emma as words
emma = nltk.corpus.gutenberg.words('austen-emma.txt')

type(emma)

nltk.corpus.reader.util.StreamBackedCorpusView

# converting to a list
emma_list = [w.lower() for w in emma]

# print
emma_list[:10]

['[', 'emma', 'by', 'jane', 'austen', '1816', ']', 'volume', 'i', 'chapter']

Open local text files¶

To open text files we have saved locally, we should use the file connections tools we learned early in the course.

I have in my working directory the first chapter of the Red Rising book I was reading early this year. Let's open it.

# open red_rising.txt

# Using 'with' to open the file
with open('red_rising.txt', 'r') as file:
    # Read the content of the file
    content = file.read()

# open as a string
content[0:1000]

'Helldiver\n\nThe first thing you should know about me is I am my father’s son. And when they came for him, I did as he asked. I did not cry. Not when the Society televised the arrest. Not when the Golds tried him. Not when the Grays hanged him. Mother hit me for that. My brother Kieran was supposed to be the stoic one. He was the elder, I the younger. I was supposed to cry. Instead, Kieran bawled like a girl when Little Eo tucked a haemanthus into Father’s left workboot and ran back to her own father’s side. My sister Leanna murmured a lament beside me. I just watched and thought it a shame that he died dancing but without his dancing shoes.\n\nOn Mars there is not much gravity. So you have to pull the feet to break the neck. They let the loved ones do it.\n\nI smell my own stink inside my frysuit. The suit is some kind of nanoplastic and is hot as its name suggests. It insulates me toe to head. Nothing gets in. Nothing gets out. Especially not the heat. Worst part is you can’t wipe the swea'

# convert to a list using string methods
rr_lines = [c for c in content.split("\n") if c != ""]
rr_lines[0:5]

['Helldiver',
 'The first thing you should know about me is I am my father’s son. And when they came for him, I did as he asked. I did not cry. Not when the Society televised the arrest. Not when the Golds tried him. Not when the Grays hanged him. Mother hit me for that. My brother Kieran was supposed to be the stoic one. He was the elder, I the younger. I was supposed to cry. Instead, Kieran bawled like a girl when Little Eo tucked a haemanthus into Father’s left workboot and ran back to her own father’s side. My sister Leanna murmured a lament beside me. I just watched and thought it a shame that he died dancing but without his dancing shoes.',
 'On Mars there is not much gravity. So you have to pull the feet to break the neck. They let the loved ones do it.',
 'I smell my own stink inside my frysuit. The suit is some kind of nanoplastic and is hot as its name suggests. It insulates me toe to head. Nothing gets in. Nothing gets out. Especially not the heat. Worst part is you can’t wipe the sweat from your eyes. Bloodydamn stings as it goes through the headband to puddle at the heels. Not to mention the stink when you piss. Which you always do. Gotta take in a load of water through the drinktube. I guess you could be fit with a catheter. We choose the stink.',
 'The drillers of my clan chatter some gossip over the comm in my ear as I ride atop the clawDrill. I’m alone in this deep tunnel on a machine built like a titanic metal hand, one that grasps and gnaws at the ground. I control its rockmelting digits from the holster seat atop the drill, just where the elbow joint would be. There, my fingers fit into control gloves that manipulate the many tentacle-like drills some ninety meters below my perch. To be a Helldiver, they say your fingers must flicker fast as tongues of fire. Mine flicker faster.']

# By word
rr_words = content.split(" ")
rr_words[0:10]

['Helldiver\n\nThe',
 'first',
 'thing',
 'you',
 'should',
 'know',
 'about',
 'me',
 'is',
 'I']

# a bit more to remove line breaks
rr_words_ = [i for el in rr_words for i in el.split("\n")]
rr_words_

['Helldiver',
 '',
 'The',
 'first',
 'thing',
 'you',
 'should',
 'know',
 'about',
 'me',
 'is',
 'I',
 'am',
 'my',
 'father’s',
 'son.',
 'And',
 'when',
 'they',
 'came',
 'for',
 'him,',
 'I',
 'did',
 'as',
 'he',
 'asked.',
 'I',
 'did',
 'not',
 'cry.',
 'Not',
 'when',
 'the',
 'Society',
 'televised',
 'the',
 'arrest.',
 'Not',
 'when',
 'the',
 'Golds',
 'tried',
 'him.',
 'Not',
 'when',
 'the',
 'Grays',
 'hanged',
 'him.',
 'Mother',
 'hit',
 'me',
 'for',
 'that.',
 'My',
 'brother',
 'Kieran',
 'was',
 'supposed',
 'to',
 'be',
 'the',
 'stoic',
 'one.',
 'He',
 'was',
 'the',
 'elder,',
 'I',
 'the',
 'younger.',
 'I',
 'was',
 'supposed',
 'to',
 'cry.',
 'Instead,',
 'Kieran',
 'bawled',
 'like',
 'a',
 'girl',
 'when',
 'Little',
 'Eo',
 'tucked',
 'a',
 'haemanthus',
 'into',
 'Father’s',
 'left',
 'workboot',
 'and',
 'ran',
 'back',
 'to',
 'her',
 'own',
 'father’s',
 'side.',
 'My',
 'sister',
 'Leanna',
 'murmured',
 'a',
 'lament',
 'beside',
 'me.',
 'I',
 'just',
 'watched',
 'and',
 'thought',
 'it',
 'a',
 'shame',
 'that',
 'he',
 'died',
 'dancing',
 'but',
 'without',
 'his',
 'dancing',
 'shoes.',
 '',
 'On',
 'Mars',
 'there',
 'is',
 'not',
 'much',
 'gravity.',
 'So',
 'you',
 'have',
 'to',
 'pull',
 'the',
 'feet',
 'to',
 'break',
 'the',
 'neck.',
 'They',
 'let',
 'the',
 'loved',
 'ones',
 'do',
 'it.',
 '',
 'I',
 'smell',
 'my',
 'own',
 'stink',
 'inside',
 'my',
 'frysuit.',
 'The',
 'suit',
 'is',
 'some',
 'kind',
 'of',
 'nanoplastic',
 'and',
 'is',
 'hot',
 'as',
 'its',
 'name',
 'suggests.',
 'It',
 'insulates',
 'me',
 'toe',
 'to',
 'head.',
 'Nothing',
 'gets',
 'in.',
 'Nothing',
 'gets',
 'out.',
 'Especially',
 'not',
 'the',
 'heat.',
 'Worst',
 'part',
 'is',
 'you',
 'can’t',
 'wipe',
 'the',
 'sweat',
 'from',
 'your',
 'eyes.',
 'Bloodydamn',
 'stings',
 'as',
 'it',
 'goes',
 'through',
 'the',
 'headband',
 'to',
 'puddle',
 'at',
 'the',
 'heels.',
 'Not',
 'to',
 'mention',
 'the',
 'stink',
 'when',
 'you',
 'piss.',
 'Which',
 'you',
 'always',
 'do.',
 'Gotta',
 'take',
 'in',
 'a',
 'load',
 'of',
 'water',
 'through',
 'the',
 'drinktube.',
 'I',
 'guess',
 'you',
 'could',
 'be',
 'fit',
 'with',
 'a',
 'catheter.',
 'We',
 'choose',
 'the',
 'stink.',
 '',
 'The',
 'drillers',
 'of',
 'my',
 'clan',
 'chatter',
 'some',
 'gossip',
 'over',
 'the',
 'comm',
 'in',
 'my',
 'ear',
 'as',
 'I',
 'ride',
 'atop',
 'the',
 'clawDrill.',
 'I’m',
 'alone',
 'in',
 'this',
 'deep',
 'tunnel',
 'on',
 'a',
 'machine',
 'built',
 'like',
 'a',
 'titanic',
 'metal',
 'hand,',
 'one',
 'that',
 'grasps',
 'and',
 'gnaws',
 'at',
 'the',
 'ground.',
 'I',
 'control',
 'its',
 'rockmelting',
 'digits',
 'from',
 'the',
 'holster',
 'seat',
 'atop',
 'the',
 'drill,',
 'just',
 'where',
 'the',
 'elbow',
 'joint',
 'would',
 'be.',
 'There,',
 'my',
 'fingers',
 'fit',
 'into',
 'control',
 'gloves',
 'that',
 'manipulate',
 'the',
 'many',
 'tentacle-like',
 'drills',
 'some',
 'ninety',
 'meters',
 'below',
 'my',
 'perch.',
 'To',
 'be',
 'a',
 'Helldiver,',
 'they',
 'say',
 'your',
 'fingers',
 'must',
 'flicker',
 'fast',
 'as',
 'tongues',
 'of',
 'fire.',
 'Mine',
 'flicker',
 'faster.',
 '',
 'Despite',
 'the',
 'voices',
 'in',
 'my',
 'ear,',
 'I',
 'am',
 'alone',
 'in',
 'the',
 'deep',
 'tunnel.',
 'My',
 'existence',
 'is',
 'vibration,',
 'the',
 'echo',
 'of',
 'my',
 'own',
 'breath,',
 'and',
 'heat',
 'so',
 'thick',
 'and',
 'noxious',
 'it',
 'feels',
 'like',
 'I’m',
 'swaddled',
 'in',
 'a',
 'heavy',
 'quilt',
 'of',
 'hot',
 'piss.',
 '',
 'A',
 'new',
 'river',
 'of',
 'sweat',
 'breaks',
 'through',
 'the',
 'scarlet',
 'sweatband',
 'tied',
 'around',
 'my',
 'forehead',
 'and',
 'slips',
 'into',
 'my',
 'eyes,',
 'burning',
 'them',
 'till',
 'they’re',
 'as',
 'red',
 'as',
 'my',
 'rusty',
 'hair.',
 'I',
 'used',
 'to',
 'reach',
 'and',
 'try',
 'to',
 'wipe',
 'the',
 'sweat',
 'away,',
 'only',
 'to',
 'scratch',
 'futilely',
 'at',
 'the',
 'faceplate',
 'of',
 'my',
 'frysuit.',
 'I',
 'still',
 'want',
 'to.',
 'Even',
 'after',
 'three',
 'years,',
 'the',
 'tickle',
 'and',
 'sting',
 'of',
 'the',
 'sweat',
 'is',
 'a',
 'raw',
 'misery.',
 '',
 'The',
 'tunnel',
 'walls',
 'around',
 'my',
 'holster',
 'seat',
 'are',
 'bathed',
 'a',
 'sulfurous',
 'yellow',
 'by',
 'a',
 'corona',
 'of',
 'lights.',
 'The',
 'reach',
 'of',
 'the',
 'light',
 'fades',
 'as',
 'I',
 'look',
 'up',
 'the',
 'thin',
 'vertical',
 'shaft',
 'I’ve',
 'carved',
 'today.',
 'Above,',
 'precious',
 'helium-3',
 'glimmers',
 'like',
 'liquid',
 'silver,',
 'but',
 'I’m',
 'looking',
 'at',
 'the',
 'shadows,',
 'looking',
 'for',
 'the',
 'pitvipers',
 'that',
 'curl',
 'through',
 'the',
 'darkness',
 'seeking',
 'the',
 'warmth',
 'of',
 'my',
 'drill.',
 'They’ll',
 'eat',
 'into',
 'your',
 'suit',
 'too,',
 'bite',
 'through',
 'the',
 'shell',
 'and',
 'then',
 'try',
 'to',
 'burrow',
 'into',
 'the',
 'warmest',
 'place',
 'they',
 'find,',
 'usually',
 'your',
 'belly,',
 'so',
 'they',
 'can',
 'lay',
 'their',
 'eggs.',
 'I’ve',
 'been',
 'bitten',
 'before.',
 'Still',
 'dream',
 'of',
 'the',
 'beast—black,',
 'like',
 'a',
 'thick',
 'tendril',
 'of',
 'oil.',
 'They',
 'can',
 'get',
 'as',
 'wide',
 'as',
 'a',
 'thigh',
 'and',
 'long',
 'as',
 'three',
 'men,',
 'but',
 'it’s',
 'the',
 'babies',
 'we',
 'fear.',
 'They',
 'don’t',
 'know',
 'how',
 'to',
 'ration',
 'their',
 'poison.',
 'Like',
 'me,',
 'their',
 'ancestors',
 'came',
 'from',
 'Earth,',
 'then',
 'Mars',
 'and',
 'the',
 'deep',
 'tunnels',
 'changed',
 'them.',
 '',
 'It',
 'is',
 'eerie',
 'in',
 'the',
 'deep',
 'tunnels.',
 'Lonely.',
 'Beyond',
 'the',
 'roar',
 'of',
 'the',
 'drill,',
 'I',
 'hear',
 'the',
 'voices',
 'of',
 'my',
 'friends,',
 'all',
 'older.',
 'But',
 'I',
 'cannot',
 'see',
 'them',
 'a',
 'half',
 'klick',
 'above',
 'me',
 'in',
 'the',
 'darkness.',
 'They',
 'drill',
 'high',
 'above,',
 'near',
 'the',
 'mouth',
 'of',
 'the',
 'tunnel',
 'that',
 'I’ve',
 'carved,',
 'descending',
 'with',
 'hooks',
 'and',
 'lines',
 'to',
 'dangle',
 'along',
 'the',
 'sides',
 'of',
 'the',
 'tunnel',
 'to',
 'get',
 'at',
 'the',
 'small',
 'veins',
 'of',
 'helium-3.',
 'They',
 'mine',
 'with',
 'meter-long',
 'drills,',
 'gobbling',
 'up',
 'the',
 'chaff.',
 'The',
 'work',
 'still',
 'requires',
 'mad',
 'dexterity',
 'of',
 'foot',
 'and',
 'hand,',
 'but',
 'I’m',
 'the',
 'earner',
 'in',
 'this',
 'crew.',
 'I',
 'am',
 'the',
 'Helldiver.',
 'It',
 'takes',
 'a',
 'certain',
 'kind—and',
 'I’m',
 'the',
 'youngest',
 'anyone',
 'can',
 'remember.',
 '',
 'I’ve',
 'been',
 'in',
 'the',
 'mines',
 'for',
 'three',
 'years.',
 'You',
 'start',
 'at',
 'thirteen.',
 'Old',
 'enough',
 'to',
 'screw,',
 'old',
 'enough',
 'to',
 'crew.',
 'At',
 'least',
 'that’s',
 'what',
 'Uncle',
 'Narol',
 'said.',
 'Except',
 'I',
 'didn’t',
 'get',
 'married',
 'till',
 'six',
 'months',
 'back,',
 'so',
 'I',
 'don’t',
 'know',
 'why',
 'he',
 'said',
 'it.',
 '',
 'Eo',
 'dances',
 'through',
 'my',
 'thoughts',
 'as',
 'I',
 'peer',
 'into',
 'my',
 'control',
 'display',
 'and',
 'slip',
 'the',
 'clawDrill’s',
 'fingers',
 'around',
 'a',
 'fresh',
 'vein.',
 'Eo.',
 'Sometimes',
 'it’s',
 'difficult',
 'to',
 'think',
 'of',
 'her',
 'as',
 'anything',
 'but',
 'what',
 'we',
 'used',
 'to',
 'call',
 'her',
 'as',
 'children.',
 '',
 'Little',
 'Eo—a',
 'tiny',
 'girl',
 'hidden',
 'beneath',
 'a',
 'mane',
 'of',
 'red.',
 'Red',
 'like',
 'the',
 'rock',
 'around',
 'me,',
 'not',
 'true',
 'red,',
 'rust-red.',
 'Red',
 'like',
 'our',
 'home,',
 'like',
 'Mars.',
 'Eo',
 'is',
 'sixteen',
 'too.',
 'And',
 'she',
 'may',
 'be',
 'like',
 'me—from',
 'a',
 'clan',
 'of',
 'Red',
 'earth',
 'diggers,',
 'a',
 'clan',
 'of',
 'song',
 'and',
 'dance',
 'and',
 'soil—but',
 'she',
 'could',
 'be',
 'made',
 'from',
 'air,',
 'from',
 'the',
 'ether',
 'that',
 'binds',
 'the',
 'stars',
 'in',
 'a',
 'patchwork.',
 'Not',
 'that',
 'I’ve',
 'ever',
 'seen',
 'stars.',
 'No',
 'Red',
 'from',
 'the',
 'mining',
 'colonies',
 'sees',
 'the',
 'stars.',
 '',
 'Little',
 'Eo.',
 'They',
 'wanted',
 'to',
 'marry',
 'her',
 'off',
 'when',
 'she',
 'turned',
 'fourteen,',
 'like',
 'all',
 'girls',
 'of',
 'the',
 'clans.',
 'But',
 'she',
 'took',
 'the',
 'short',
 'rations',
 'and',
 'waited',
 'for',
 'me',
 'to',
 'reach',
 'sixteen,',
 'wedAge',
 'for',
 'men,',
 'before',
 'slipping',
 'that',
 'cord',
 'around',
 'her',
 'finger.',
 'She',
 'said',
 'she',
 'knew',
 'we’d',
 'marry',
 'since',
 'we',
 'were',
 'children.',
 'I',
 'didn’t.',
 '',
 '“Hold.',
 'Hold.',
 'Hold!”',
 'Uncle',
 'Narol',
 'snaps',
 'over',
 'the',
 'comm',
 'channel.',
 '“Darrow,',
 'hold,',
 'boy!”',
 'My',
 'fingers',
 'freeze.',
 'He’s',
 'high',
 'above',
 'with',
 'the',
 'rest',
 'of',
 'them,',
 'watching',
 'my',
 'progress',
 'on',
 'his',
 'head',
 'unit.',
 '',
 '“What’s',
 'the',
 'burn?”',
 'I',
 'ask,',
 'annoyed.',
 'I',
 'don’t',
 'like',
 'being',
 'interrupted.',
 '',
 '“What’s',
 'the',
 'burn,',
 'the',
 'little',
 'Helldiver',
 'asks.”',
 'Old',
 'Barlow',
 ...]

Open from a Pandas Data Frame¶

We will use Twitter data. This dataset has a collection of Twitter timelines of all members of the 117th Congress for the year of 2021. It is a rich dataset, and interesting to play with for some descriptive text analysis.

We will work mostly with the columns text

import pandas as pd
import numpy as np

Download the data¶

Get the data here

# Open data
tweets_data = pd.read_csv("tweets_congress.csv")
tweets_data.head()

tweets_data.shape

(1266542, 10)

# reduce the size of the data a bit
import random
authors = tweets_data["author"].unique()[random.sample(range(1, 425), 10)]
tweets_data_ = tweets_data[tweets_data['author'].str.contains("|".join(authors))].copy()

tweets_data_.shape

(32403, 10)

Pre-Processing Steps¶

Almost every data science task using text requires data to be preprocessed before running any type of analysis. These tasks often consists on reducing noise on text data - making the the data more informative and less complex - and converting the data to formats computer understand.

The most commong pre-processing steps are:

tokenization: splitting text into words or tokens.
normalization: convert text to all lowercase and removing punctuation
stop word removal: remove noise, words with little meaning. Usually involves a pre-defined set of words + some domain knowledge/context dependet words
stemming: removing the suffixes from words, such as "ing" or "ed," to reduce them to their base form
lemmatization: relies on accurately determining the intended part-of-speech and the meaning of a word based on its context.

Important: pre-processing steps can profoundly change what your text looks like. See this article here to understand more in-depth some trade-offs associated with pre-processing steps: https://www.cambridge.org/core/journals/political-analysis/article/abs/text-preprocessing-for-unsupervised-learning-why-it-matters-when-it-misleads-and-what-to-do-about-it/AA7D4DE0AA6AB208502515AE3EC6989E

The implementation of these steps consists of a mix of string methods and nltk methods. Let's see examples with the Politicians tweets datasets.

# import nltk methods

# stopwords
from nltk.corpus import stopwords

# tokenizer
from nltk.tokenize import word_tokenize

# lemmatizer
from nltk.stem import WordNetLemmatizer

# stemming
from nltk.stem.porter import PorterStemmer

Tokenizetion¶

word_tokenize() from nltk

# apply as a dataframe # with half of the dataframe
import time
tweets_data_["tokens"] = tweets_data_["text"].apply(word_tokenize)

# see
tweets_data_["tokens"]

9727       [Some, sage, words, of, wisdom, from, the, FDR...
9728       [We, made, it, together, so, let, ’, s, celebr...
9729       [RT, @, BillPascrell, :, 🚨, Starting, tomorrow...
9730       [358, days, ago, terrorists, ransacked, the, U...
9731            [RT, @, NJGov, :, NEW, YEAR, ,, NEW, JERSEY]
                                 ...                        
1263289    [Border, security, is, national, security, ., ...
1263290    [Failure, to, address, the, root, cause, will,...
1263291    [Rationally, ,, greater, uncertainty, and, gre...
1263292    [🥺🤕, America, ’, s, central, bank, is, destroy...
1263293    [After, failing, to, drag, America, and, @, re...
Name: tokens, Length: 32403, dtype: object

Normalization¶

isalpha() - string methods to remove punctuation
lower() - string methods to convert text to lower

# normalization
tweets_data_["tokens"] = tweets_data_["tokens"].apply(lambda x: [word.lower() for word in x if word.isalpha()])
tweets_data_["tokens"].head()

9727    [some, sage, words, of, wisdom, from, the, fdr...
9728    [we, made, it, together, so, let, s, celebrate...
9729    [rt, billpascrell, starting, tomorrow, large, ...
9730    [days, ago, terrorists, ransacked, the, us, ca...
9731                  [rt, njgov, new, year, new, jersey]
Name: tokens, dtype: object

Remove stop words¶

stopwords.words('english') from nltk

# import stopword first
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Want to add some more?

stop_words = stop_words + (["dr", "mr", "miss","congressman","congresswomen", "http", "rt"])

# remove
tweets_data_["tokens"] = tweets_data_["tokens"].apply(lambda x: [word for word in x if word not in stop_words])
tweets_data_["tokens"]

9727       [sage, words, wisdom, fdr, memorial, enter, ne...
9728       [made, together, let, celebrate, bottom, heart...
9729       [billpascrell, starting, tomorrow, large, surp...
9730       [days, ago, terrorists, ransacked, us, capitol...
9731                         [njgov, new, year, new, jersey]
                                 ...                        
1263289    [border, security, national, security, secure,...
1263290    [failure, address, root, cause, correct, probl...
1263291    [rationally, greater, uncertainty, greater, ri...
1263292    [america, central, bank, destroying, value, mo...
1263293    [failing, drag, america, realdonaldtrump, wars...
Name: tokens, Length: 32403, dtype: object

stemming¶

We stem the tokens using nltk.stem.porter.PorterStemmer to get the stemmed tokens.

# instatiate the stemmer
porter = PorterStemmer()

# run
tweets_data_["tokens"] = tweets_data_["tokens"].apply(lambda x: [porter.stem(word) for word in x if word])

# see
tweets_data_["tokens"].head()

9727    [sage, word, wisdom, fdr, memori, enter, new, ...
9728    [made, togeth, let, celebr, bottom, heart, wis...
9729    [billpascrel, start, tomorrow, larg, surpris, ...
9730    [day, ago, terrorist, ransack, us, capitol, ho...
9731                      [njgov, new, year, new, jersey]
Name: tokens, dtype: object

lemmatization¶

We will lemmatize the tokens using WordNetLemmatizer() from nltk

# import
from nltk.stem import WordNetLemmatizer

# instantiate
lemmatizer = WordNetLemmatizer()

# run (doenst' make much sense to run on a stemm, but just for your reference)
tweets_data_["tokens"] = tweets_data_["tokens"].apply(lambda x: [lemmatizer.lemmatize(word) for word in x if word])

# see
tweets_data_["tokens"].tail()

1263289    [border, secur, nation, secur, secur, border, ...
1263290    [failur, address, root, caus, correct, problem...
1263291    [ration, greater, uncertainti, greater, risk, ...
1263292    [america, central, bank, destroy, valu, money,...
1263293    [fail, drag, america, realdonaldtrump, war, pl...
Name: tokens, dtype: object

Bag-of-Words: Document-Feature Matrix Representation¶

As we saw in the lecture, our next step is to represent text numerically. We will do so by using the Bag of Words assumption. This assumption states that we represent text as an unordered set of words in a document.

Order is ignored;
frequency matters.

Remember, the idea here is to represent text data as numbers. We do so by breaking the text in words, and counting them. A standard way to do so is by using a Document-Feature Matrix (DFM)

Rows: documents of the corpus
Columns: feature or tokens or words
Cell: number of times a word j occurs in document i

To create a DFM, we will use the CountVectorizer() method from sklearn

from sklearn.feature_extraction.text import CountVectorizer

# combine the pre-processed data
tweets_data_['tokens_join'] = tweets_data_['tokens'].apply(' '.join)

# instantiate a vectorizer
vectorizer = CountVectorizer()

# transform the data
dfm = vectorizer.fit_transform(tweets_data_['tokens_join'])

# oput is a matrix
type(dfm)

scipy.sparse._csr.csr_matrix

# Convert the matrix to an array and display it
feature_matrix = dfm.todense()

# super sparse matrix
feature_matrix

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

# Get feature names to use as dataframe column headers
feature_names = vectorizer.get_feature_names_out()

# Create a DataFrame with the feature matrix
df = pd.DataFrame(feature_matrix, columns=feature_names)
df

hugely sparse data!!

Visualizing most used words¶

With this representation, we can actually start visualizing some interesting patterns in the data.

For example, we can visualize the most distictive words tweets by each politician. In this case, we need:

Change unit of analysis from tweets to politicians
Join all tweets by politician
Pre-process the text
Build the dfm
Estimate some type of distictiveness measure

tweets_data.head()

# change unit of analysis
tweets_data_g = tweets_data.groupby(["author","State", "Party"])["text"].apply(lambda x: "".join(x)).reset_index().copy()

tweets_data_g

# see
authors = ["RepAOC", "Ilhan", "SpeakerPelosi", "marcorubio", "SenatorTimScott", 
           "SenTedCruz", "Jim_Jordan", "GOPLeader"]

# make a copy
reps = tweets_data_g[tweets_data_g["author"].str.contains("|".join(authors))].copy()

reps

stop_words = stop_words + ["new", "https", "rt"]

# pre-process

# tokenize
reps["tokens"] = reps["text"].apply(word_tokenize)

# normalize
reps["tokens"] = reps["tokens"].apply(lambda x: [word.lower() for word in x if word.isalpha()])

# stem and stopwords
reps["tokens"] = reps["tokens"].apply(lambda x: [porter.stem(word) for word in x if word not in stop_words])

## Create dfm
# combine the pre-processed data
reps['tokens_join'] = reps['tokens'].apply(' '.join)

# instantiate a vectorizer
vectorizer = CountVectorizer()

# transform the data
dfm = vectorizer.fit_transform(reps['tokens_join'])

# convert df
dfm_d = pd.DataFrame(dfm.toarray(), 
                     columns=vectorizer.get_feature_names_out(), 
                     index=reps["author"])

# see the dataset
dfm_d

# overall most important features
index = dfm_d.sum().sort_values(ascending=False).index

index

Index(['biden', 'american', 'democrat', 'amp', 'presid', 'border', 'today',
       'hous', 'peopl', 'year',
       ...
       'kermit', 'kerik', 'kenya', 'kentuckymbb', 'kenpaxtontx',
       'kendilaniannbc', 'kelseykoberg', 'kellynashradio', 'kellymakena',
       '𝚄𝙿𝙳𝙰𝚃𝙴'],
      dtype='object', length=11979)

# see the most important features
dfm_d[index]

# most features words by candidate

# clean to capture top 10 terms
dfm_d.index.name = "author_tweet"

# contained
df_list = list()


# get top terms by group
for id, row in dfm_d.groupby("author_tweet"):
    idx = row.sum().sort_values(ascending=False).index
    temp = row.loc[:, idx].reset_index().melt(id_vars=["author_tweet"]).iloc[0:10, :]
    df_list.append(temp)

# concat
top_terms = pd.concat(df_list, axis=0)

top_terms

# visualize
from plotnine import *

# plot
(ggplot(top_terms, aes(x='variable', y='value')) +
    geom_bar(stat='identity') +
    facet_wrap('~author_tweet', scales='free') +
    coord_flip() +  # To make horizontal bar plots
    theme(subplots_adjust={'wspace': 0.25},  # Adjust the space between plots
          axis_text_y=element_text(size=10),  # Adjust text size for y axis
          figure_size=(15, 10)) +  # Adjust the figure size
    labs(x='Frequency', y='') +
 theme_minimal()
)

<Figure Size: (640 x 480)>

Other ways to count better than simple frequencies¶

Counts of simple frequencies is a bit silly. Let's look at other ways to count that retrieve more information:

N-Grams¶

N-grams: count words that appear together with a N-size window

TF-IDF:¶

It is a weighted measure of counts by the number of times the term appears in other documents.

Term-Frequency¶

$$\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in a document } d}{\text{Total number of terms in the document } d}$$

Inverse Document Frequency (IDF): $$ \text{IDF}(t, D) = \log\left(\frac{\text{Total number of documents } |D|}{\text{Number of documents with term } t \text{ in it}}\right) $$

TF-IDF: $$ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) $$

# bi-grams

# instantiate a vectorizer
vectorizer = CountVectorizer()

# get bigrams
vectorizer = CountVectorizer(
    lowercase=True,
    stop_words='english',
    ngram_range=(2,2), ## see here is the main difference
    # max_features=N  # Optionally restricts to top N tokens
)

text_bi = vectorizer.fit_transform(reps['tokens_join'])

# Convert matrix to DataFrame with bigram columns
# convert df
text_bi_pd = pd.DataFrame(text_bi.toarray(), 
                     columns=vectorizer.get_feature_names_out(), 
                     index=reps["author"])

# see
text_bi_pd

# clean to capture top 10 terms
text_bi_pd.index.name = "author_tweet"

# contained
df_list = list()


# get top terms by group
for id, row in text_bi_pd.groupby("author_tweet"):
    idx = row.sum().sort_values(ascending=False).index
    temp = row.loc[:, idx].reset_index().melt(id_vars=["author_tweet"]).iloc[0:10, :]
    df_list.append(temp)

# concat
top_terms = pd.concat(df_list, axis=0)

# see it
top_terms.head()

# visualize
from plotnine import *

# plot
(ggplot(top_terms, aes(x='variable', y='value')) +
    geom_bar(stat='identity') +
    facet_wrap('~author_tweet', scales='free') +
    coord_flip() +  # To make horizontal bar plots
    theme(subplots_adjust={'wspace': 0.25},  # Adjust the space between plots
          axis_text_y=element_text(size=10),  # Adjust text size for y axis
          figure_size=(15, 10)) +  # Adjust the figure size
    labs(x='Frequency', y='') +
 theme_minimal()
)

<Figure Size: (640 x 480)>

# Term Frequency - Inverse Document Frequency (TF-IDF):
from sklearn.feature_extraction.text import TfidfVectorizer

# instantiate a vectorizer
vectorizer = TfidfVectorizer()

# get tfidf
vectorizer = TfidfVectorizer(
    lowercase=True,
    stop_words='english',
    # max_features=N  # Optionally restricts to top N tokens
)

# transform
text_tfidf = vectorizer.fit_transform(reps['tokens_join'])

# convert df
text_tfidf_pd = pd.DataFrame(text_tfidf.toarray(), 
                     columns=vectorizer.get_feature_names_out(), 
                     index=reps["author"])


# clean to capture top 10 terms
text_tfidf_pd.index.name = "author_tweet"

# contained
df_list = list()

# get top terms by group
for id, row in text_tfidf_pd.groupby("author_tweet"):
    idx = row.sum().sort_values(ascending=False).index
    temp = row.loc[:, idx].reset_index().melt(id_vars=["author_tweet"]).iloc[0:10, :]
    df_list.append(temp)

# concat
top_terms = pd.concat(df_list, axis=0)

top_terms

# visualize
from plotnine import *

# plot
(ggplot(top_terms, aes(x='variable', y='value')) +
    geom_bar(stat='identity') +
    facet_wrap('~author_tweet', scales='free') +
    coord_flip() +  # To make horizontal bar plots
    theme(subplots_adjust={'wspace': 0.25},  # Adjust the space between plots
          axis_text_y=element_text(size=10),  # Adjust text size for y axis
          figure_size=(15, 10)) +  # Adjust the figure size
    labs(x='Frequency', y='') +
 theme_minimal()
)

<Figure Size: (640 x 480)>

Practice:¶

Repeate the process described above, but using a different grouping variable. In this case, you can:

either group using other variables in the data (day, party, state)
use other politicians.

Use one of the metric below (count, tfidf or bigrams) to understand the most important words for each group.

# your code here

Similarity between documents¶

Let's now calculate measures of similarity between the authors of the tweets. Notice, this could be done for each tweet, or for all the politicians. We will focus on the latter just to make things more interesting.

Here is our similarity measure:

$$\text{Sim}(A, B) = \frac{{A \cdot B}}{{\|A\| \|B\|}}$$

Where:

The $\cdot$ here means a dot product: $\sum_j \mathbf{a_j} \cdot \mathbf{b_j}$
The vector norm $\mathbf{||A||} = \sqrt{\sum \mathbf{{a_j}^2}}$

We will use as an input the tf-idf matrix! The function (which is similar to the one you wrote in problem set 2) is implemented with the sklearn library

# import
from sklearn.metrics.pairwise import cosine_similarity

# re-estimate tf-idf
vectorizer = TfidfVectorizer()

# transform
text_tfidf = vectorizer.fit_transform(reps['tokens_join'])

# Calculate the cosine similarity between all pairs in the matrix
cosine_sim = cosine_similarity(text_tfidf, text_tfidf )

# Display the cosine similarity matrix
cosine_sim

array([[1.        , 0.46860774, 0.53976422, 0.37072597, 0.68188297,
        0.46312371, 0.53468698],
       [0.46860774, 1.        , 0.32128238, 0.59587565, 0.39043828,
        0.53491185, 0.67609217],
       [0.53976422, 0.32128238, 1.        , 0.25702161, 0.43712255,
        0.31630066, 0.35650099],
       [0.37072597, 0.59587565, 0.25702161, 1.        , 0.33853562,
        0.46278013, 0.55208889],
       [0.68188297, 0.39043828, 0.43712255, 0.33853562, 1.        ,
        0.39513423, 0.44274346],
       [0.46312371, 0.53491185, 0.31630066, 0.46278013, 0.39513423,
        1.        , 0.56210807],
       [0.53468698, 0.67609217, 0.35650099, 0.55208889, 0.44274346,
        0.56210807, 1.        ]])

# convert to a df
author = reps["author"]
similarity_df = pd.DataFrame(cosine_sim, columns=reps["author"], index=reps["author"])

# similarity
similarity_df

# AOC closest to?
similarity_df["RepAOC"].sort_values(ascending=False)

author
RepAOC             1.000000
Ilhan              0.595876
SpeakerPelosi      0.552089
SenatorTimScott    0.462780
GOPLeader          0.370726
SenTedCruz         0.338536
Jim_Jordan         0.257022
Name: RepAOC, dtype: float64

# Jim Jordan closest to?
similarity_df["Jim_Jordan"].sort_values(ascending=False)

author
Jim_Jordan         1.000000
GOPLeader          0.539764
SenTedCruz         0.437123
SpeakerPelosi      0.356501
Ilhan              0.321282
SenatorTimScott    0.316301
RepAOC             0.257022
Name: Jim_Jordan, dtype: float64

# Convert to tidy
df_tidy = similarity_df.reset_index().melt(id_vars='author', var_name='related_author', value_name='correlation')
df_tidy = df_tidy.sort_values(["author", "correlation"], ascending=False).copy()

# get order
order = df_tidy.tail(7).related_author

# Creating the heatmap
(ggplot(df_tidy, aes(x='author', y='related_author', fill='correlation'))
 + geom_tile()
 +  scale_fill_gradient(low="white", high="blue", 
                       limits=(.4, 1.01)) 
 +  scale_x_discrete(limits=order) 
 +  scale_y_discrete(limits=order) 
 + theme(axis_text_x=element_text(angle=90, hjust=1))
 + labs(title='Correlation Tile Matrix', x='Author', y='Related Author', fill='Correlation')
)

<Figure Size: (640 x 480)>

Topic Model: LDA Implementation¶

To estimate topic models, we will use the gensim library. gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. It is the main library to retrieve pre-trained word embeddings, or to train word embeddings using the famous word2vec algorithm.

This is a step-by-step of estimate LDA using gensim:

Preprocess the Text: Follow most of the steps we saw before, including tokenization, removing stopwords, normalization, etc..
Create a dictionary: gensim requires you to create a dictionary of all stemmed/preprocessed words in the corpus (collection of documents); the method Dictionary from gensim will crete this data structure for us.
Filter out words from the dictionary that appear in either a very low proportion of documents (lower bound) or a very high proportion of documents (upper bound).
Create a bag-of-words representation of the documents: maps words from the dictionary representation to each document.
Estimate the topic model: use LDA model within gensim

# get a sample
td = tweets_data.iloc[random.sample(range(1, tweets_data.shape[0]), 1000)].copy()

Step 1 - Pre-Processing¶

Write a function with all our previous steps

# Write a preprocessing function
def preprocess_text(text):
    
    # increase stop words
    stop_words = stopwords.words('english')
    stop_words = stop_words + ["http"]
    
    # tokenization 
    tokens_ = word_tokenize(text)
    
    # Generate a list of tokens after preprocessing
 
    # normalize
    tokens_ = [word.lower() for word in tokens_ if word.isalpha()]

    # stem and stopwords
    tokens_ =  [porter.stem(word) for word in tokens_ if word not in stop_words]
    # Return the preprocessed tokens as a string
    return tokens_

# apply
td["tokens"] = td["text"].apply(preprocess_text)

Step 2: Create a Dictionary¶

# import dictionar
from gensim.corpora import Dictionary

# convert to a list
tokens = td["tokens"].tolist()

# let's look what this input is. 
# should be a list of list for each document split by tokens
tokens[1]

['year',
 'ago',
 'civilian',
 'conserv',
 'corp',
 'put',
 'million',
 'work',
 'build',
 'road',
 'public',
 'land',
 'http']

# Create a dictionary representation of the documents
dictionary = Dictionary(tokens)

# see
dictionary.token2id

{'american': 0,
 'flag': 1,
 'flagday': 2,
 'http': 3,
 'justic': 4,
 'let': 5,
 'liberti': 6,
 'recogn': 7,
 'repres': 8,
 'truli': 9,
 'us': 10,
 'ago': 11,
 'build': 12,
 'civilian': 13,
 'conserv': 14,
 'corp': 15,
 'land': 16,
 'million': 17,
 'public': 18,
 'put': 19,
 'road': 20,
 'work': 21,
 'year': 22,
 'basebal': 23,
 'cece': 24,
 'cheer': 25,
 'congression': 26,
 'enjoy': 27,
 'game': 28,
 'last': 29,
 'night': 30,
 'penc': 31,
 'republican': 32,
 'vp': 33,
 'home': 34,
 'honor': 35,
 'housegop': 36,
 'join': 37,
 'never': 38,
 'pleas': 39,
 'prison': 40,
 'return': 41,
 'rt': 42,
 'today': 43,
 'war': 44,
 'action': 45,
 'demand': 46,
 'endgunviol': 47,
 'housedemocrat': 48,
 'march': 49,
 'mcconnel': 50,
 'offic': 51,
 'senatemajldr': 52,
 'administr': 53,
 'bathroom': 54,
 'execut': 55,
 'misguid': 56,
 'obama': 57,
 'school': 58,
 'statement': 59,
 'democrat': 60,
 'edlaborgop': 61,
 'educ': 62,
 'higher': 63,
 'leader': 64,
 'legisl': 65,
 'partisan': 66,
 'react': 67,
 'reform': 68,
 'repsmuck': 69,
 'virginiafoxx': 70,
 'back': 71,
 'congress': 72,
 'eventu': 73,
 'get': 74,
 'need': 75,
 'plan': 76,
 'speaker': 77,
 'washington': 78,
 'corbett': 79,
 'epitom': 80,
 'exemplari': 81,
 'petti': 82,
 'rachel': 83,
 'semper': 84,
 'servic': 85,
 'spirit': 86,
 'uscg': 87,
 'blaze': 88,
 'entrepreneur': 89,
 'innov': 90,
 'other': 91,
 'risk': 92,
 'take': 93,
 'thank': 94,
 'trail': 95,
 'alway': 96,
 'christma': 97,
 'great': 98,
 'holiday': 99,
 'jefferson': 100,
 'kick': 101,
 'parad': 102,
 'season': 103,
 'way': 104,
 'west': 105,
 'behind': 106,
 'close': 107,
 'door': 108,
 'even': 109,
 'hous': 110,
 'impeach': 111,
 'oper': 112,
 'polit': 113,
 'process': 114,
 'rule': 115,
 'amp': 116,
 'came': 117,
 'counti': 118,
 'discuss': 119,
 'dr': 120,
 'ed': 121,
 'give': 122,
 'lake': 123,
 'lakeschool': 124,
 'moxley': 125,
 'superintend': 126,
 'susan': 127,
 'updat': 128,
 'compani': 129,
 'constitu': 130,
 'make': 131,
 'medic': 132,
 'pharmaceut': 133,
 'profit': 134,
 'ration': 135,
 'record': 136,
 'sick': 137,
 'tire': 138,
 'andrew': 139,
 'capitol': 140,
 'dc': 141,
 'middl': 142,
 'rain': 143,
 'stop': 144,
 'time': 145,
 'tour': 146,
 'captain': 147,
 'collin': 148,
 'command': 149,
 'hunt': 150,
 'met': 151,
 'new': 152,
 'pnsi': 153,
 'shipyard': 154,
 'welcom': 155,
 'ballot': 156,
 'help': 157,
 'measur': 158,
 'receiv': 159,
 'sole': 160,
 'voter': 161,
 'wa': 162,
 'airbnb': 163,
 'avail': 164,
 'enrol': 165,
 'evacue': 166,
 'learn': 167,
 'lodg': 168,
 'may': 169,
 'sccounti': 170,
 'visit': 171,
 'daughter': 172,
 'equal': 173,
 'law': 174,
 'mother': 175,
 'opportun': 176,
 'realiti': 177,
 'right': 178,
 'committe': 179,
 'intellig': 180,
 'terror': 181,
 'threat': 182,
 'day': 183,
 'goldstandard': 184,
 'happi': 185,
 'mapl': 186,
 'syrup': 187,
 'vermont': 188,
 'countri': 189,
 'ilhan': 190,
 'introduc': 191,
 'meal': 192,
 'richest': 193,
 'sensand': 194,
 'sweep': 195,
 'true': 196,
 'univers': 197,
 'busi': 198,
 'closur': 199,
 'govandybeshear': 200,
 'govern': 201,
 'guidanc': 202,
 'base': 203,
 'case': 204,
 'dismiss': 205,
 'previou': 206,
 'roevwad': 207,
 'scotu': 208,
 'abdic': 209,
 'away': 210,
 'cap': 211,
 'choic': 212,
 'low': 213,
 'number': 214,
 'popul': 215,
 'refuge': 216,
 'trump': 217,
 'turn': 218,
 'vulner': 219,
 'break': 220,
 'campaign': 221,
 'model': 222,
 'period': 223,
 'prohibit': 224,
 'promot': 225,
 'repadamschiff': 226,
 'resourc': 227,
 'taxpay': 228,
 'use': 229,
 'fauci': 230,
 'listen': 231,
 'member': 232,
 'pardonmytak': 233,
 'recommend': 234,
 'staff': 235,
 'think': 236,
 'affect': 237,
 'assist': 238,
 'butt': 239,
 'california': 240,
 'elig': 241,
 'fema': 242,
 'follow': 243,
 'resid': 244,
 'wildfir': 245,
 'counsel': 246,
 'counterintellig': 247,
 'formal': 248,
 'invit': 249,
 'mueller': 250,
 'special': 251,
 'testifi': 252,
 'appli': 253,
 'blue': 254,
 'deadlin': 255,
 'extend': 256,
 'keep': 257,
 'news': 258,
 'oct': 259,
 'roof': 260,
 'come': 261,
 'condemn': 262,
 'judiciari': 263,
 'reject': 264,
 'togeth': 265,
 'watch': 266,
 'democraci': 267,
 'freedom': 268,
 'import': 269,
 'men': 270,
 'protect': 271,
 'sacrif': 272,
 'want': 273,
 'women': 274,
 'aca': 275,
 'access': 276,
 'condit': 277,
 'healthcar': 278,
 'preexist': 279,
 'repeal': 280,
 'repjaredpoli': 281,
 'respond': 282,
 'support': 283,
 'usrepk': 284,
 'end': 285,
 'first': 286,
 'open': 287,
 'sign': 288,
 'bartkowiak': 289,
 'denni': 290,
 'eric': 291,
 'fountain': 292,
 'leiker': 293,
 'mitchellbyar': 294,
 'neven': 295,
 'old': 296,
 'rikki': 297,
 'stanis': 298,
 'stong': 299,
 'suzann': 300,
 'teri': 301,
 'tralona': 302,
 'actonclim': 303,
 'advoc': 304,
 'awesom': 305,
 'cleanairmom': 306,
 'forc': 307,
 'ri': 308,
 'act': 309,
 'alloc': 310,
 'amidst': 311,
 'care': 312,
 'concern': 313,
 'ct': 314,
 'elect': 315,
 'ensur': 316,
 'health': 317,
 'prep': 318,
 'aumentando': 319,
 'ayudar': 320,
 'de': 321,
 'debemo': 322,
 'en': 323,
 'infeccion': 324,
 'la': 325,
 'losangel': 326,
 'nuestra': 327,
 'para': 328,
 'part': 329,
 'poner': 330,
 'siguen': 331,
 'todo': 332,
 'aeromed': 333,
 'airnatlguard': 334,
 'evacu': 335,
 'nasfortworthjrb': 336,
 'select': 337,
 'usairforc': 338,
 'everywher': 339,
 'father': 340,
 'fathersday': 341,
 'live': 342,
 'love': 343,
 'play': 344,
 'role': 345,
 'birthday': 346,
 'centuri': 347,
 'nation': 348,
 'oldest': 349,
 'organ': 350,
 'veteran': 351,
 'vfwhq': 352,
 'amen': 353,
 'repcleav': 354,
 'lost': 355,
 'moment': 356,
 'noon': 357,
 'rememb': 358,
 'silenc': 359,
 'corpor': 360,
 'deficit': 361,
 'explod': 362,
 'goptaxscam': 363,
 'step': 364,
 'tax': 365,
 'wealthiest': 366,
 'buse': 367,
 'color': 368,
 'commun': 369,
 'folk': 370,
 'known': 371,
 'lifelin': 372,
 'long': 373,
 'vital': 374,
 'anyth': 375,
 'deni': 376,
 'repbarbarale': 377,
 'state': 378,
 'thereidout': 379,
 'activist': 380,
 'dedic': 381,
 'meet': 382,
 'partnership': 383,
 'readi': 384,
 'announc': 385,
 'gap': 386,
 'nearli': 387,
 'pandem': 388,
 'proud': 389,
 'significantli': 390,
 'worsen': 391,
 'afghan': 392,
 'beaten': 393,
 'kill': 394,
 'offici': 395,
 'partner': 396,
 'silent': 397,
 'taliban': 398,
 'age': 399,
 'cdc': 400,
 'children': 401,
 'head': 402,
 'vaccin': 403,
 'week': 404,
 'feder': 405,
 'locat': 406,
 'manchinmobilemonday': 407,
 'chr': 408,
 'eastern': 409,
 'employe': 410,
 'facilit': 411,
 'instrument': 412,
 'rehab': 413,
 'serv': 414,
 'town': 415,
 'clyburn': 416,
 'defund': 417,
 'dem': 418,
 'jim': 419,
 'polic': 420,
 'will': 421,
 'author': 422,
 'bush': 423,
 'iraq': 424,
 'presid': 425,
 'send': 426,
 'andoverohpublib': 427,
 'congratul': 428,
 'endow': 429,
 'grant': 430,
 'human': 431,
 'debt': 432,
 'fulli': 433,
 'oblig': 434,
 'owe': 435,
 'repay': 436,
 'servicemen': 437,
 'construct': 438,
 'decad': 439,
 'fund': 440,
 'got': 441,
 'inact': 442,
 'move': 443,
 'project': 444,
 'repmoolenaar': 445,
 'three': 446,
 'direct': 447,
 'inquiri': 448,
 'ohdeptofhealth': 449,
 'question': 450,
 'certainli': 451,
 'one': 452,
 'recoveri': 453,
 'across': 454,
 'buckey': 455,
 'celebr': 456,
 'famili': 457,
 'hope': 458,
 'peac': 459,
 'wish': 460,
 'believ': 461,
 'jahim': 462,
 'lie': 463,
 'major': 464,
 'peopl': 465,
 'still': 466,
 'therecount': 467,
 'arkansa': 468,
 'cool': 469,
 'evolv': 470,
 'field': 471,
 'rice': 472,
 'see': 473,
 'techniqu': 474,
 'autom': 475,
 'fraud': 476,
 'impostor': 477,
 'messag': 478,
 'notic': 479,
 'overpay': 480,
 'sent': 481,
 'victim': 482,
 'august': 483,
 'hero': 484,
 'late': 485,
 'must': 486,
 'pass': 487,
 'provid': 488,
 'rise': 489,
 'senat': 490,
 'america': 491,
 'black': 492,
 'brown': 493,
 'buy': 494,
 'indianapoli': 495,
 'like': 496,
 'look': 497,
 'good': 498,
 'grindstonecaf': 499,
 'kingdom': 500,
 'newport': 501,
 'next': 502,
 'northeast': 503,
 'talk': 504,
 'vter': 505,
 'vtsnek': 506,
 'address': 507,
 'bill': 508,
 'bipartisan': 509,
 'michigan': 510,
 'dstdreamgirl': 511,
 'border': 512,
 'chao': 513,
 'crisi': 514,
 'expand': 515,
 'possibl': 516,
 'southern': 517,
 'worst': 518,
 'would': 519,
 'call': 520,
 'complaint': 521,
 'houseintel': 522,
 'immedi': 523,
 'unless': 524,
 'whistleblow': 525,
 'decemb': 526,
 'enough': 527,
 'payment': 528,
 'budget': 529,
 'est': 530,
 'facebook': 531,
 'govwast': 532,
 'relief': 533,
 'repkevinhern': 534,
 'thomasaschatz': 535,
 'tomorrow': 536,
 'around': 537,
 'denysenko': 538,
 'disast': 539,
 'sad': 540,
 'tresja': 541,
 'usaid': 542,
 'worker': 543,
 'world': 544,
 'biden': 545,
 'democracyendur': 546,
 'harri': 547,
 'vice': 548,
 'charg': 549,
 'conveni': 550,
 'focu': 551,
 'network': 552,
 'pay': 553,
 'reliabl': 554,
 'attain': 555,
 'clear': 556,
 'consequ': 557,
 'lead': 558,
 'mani': 559,
 'militari': 560,
 'mistak': 561,
 'object': 562,
 'strategi': 563,
 'syria': 564,
 'unforeseen': 565,
 'card': 566,
 'collect': 567,
 'credit': 568,
 'davidcicillin': 569,
 'group': 570,
 'interest': 571,
 'led': 572,
 'waiv': 573,
 'crackdown': 574,
 'houseforeign': 575,
 'journalist': 576,
 'protest': 577,
 'russian': 578,
 'strongest': 579,
 'term': 580,
 'cemeteri': 581,
 'deliv': 582,
 'gettysburg': 583,
 'lincoln': 584,
 'monument': 585,
 'farmwork': 586,
 'essenti': 587,
 'internet': 588,
 'parkersburg': 589,
 'sentinel': 590,
 'forest': 591,
 'histor': 592,
 'invest': 593,
 'pois': 594,
 'reduc': 595,
 'watersh': 596,
 'freestyl': 597,
 'madden': 598,
 'mobil': 599,
 'paig': 600,
 'place': 601,
 'swim': 602,
 'abl': 603,
 'formerli': 604,
 'hundr': 605,
 'pipelin': 606,
 'train': 607,
 'underemploy': 608,
 'unemploy': 609,
 'hear': 610,
 'procedur': 611,
 'releas': 612,
 'resolut': 613,
 'vote': 614,
 'bargain': 615,
 'chip': 616,
 'court': 617,
 'nomin': 618,
 'reproduct': 619,
 'suprem': 620,
 'alce': 621,
 'alongsid': 622,
 'colleagu': 623,
 'friend': 624,
 'hast': 625,
 'privileg': 626,
 'ca': 627,
 'dh': 628,
 'given': 629,
 'respons': 630,
 'secur': 631,
 'settl': 632,
 'chair': 633,
 'examin': 634,
 'farm': 635,
 'subcommitte': 636,
 'tighten': 637,
 'cancer': 638,
 'fellow': 639,
 'fowler': 640,
 'longer': 641,
 'share': 642,
 'stori': 643,
 'survivor': 644,
 'openenrol': 645,
 'start': 646,
 'china': 647,
 'everi': 648,
 'privaci': 649,
 'sanction': 650,
 'boot': 651,
 'foxbusi': 652,
 'ground': 653,
 'reprwilliam': 654,
 'coast': 655,
 'cut': 656,
 'forget': 657,
 'guard': 658,
 'simonwdc': 659,
 'tsa': 660,
 'attend': 661,
 'director': 662,
 'gold': 663,
 'manufactur': 664,
 'recept': 665,
 'robert': 666,
 'technolog': 667,
 'umain': 668,
 'cornwatch': 669,
 'acr': 670,
 'calfireczu': 671,
 'contain': 672,
 'czulightningcomplex': 673,
 'begin': 674,
 'faith': 675,
 'list': 676,
 'negoti': 677,
 'white': 678,
 'dreamandpromis': 679,
 'nydiavelazquez': 680,
 'reproybalallard': 681,
 'word': 682,
 'accur': 683,
 'aim': 684,
 'depart': 685,
 'determin': 686,
 'method': 687,
 'pyrrhotit': 688,
 'research': 689,
 'test': 690,
 'underway': 691,
 'applaud': 692,
 'drewbueno': 693,
 'everifi': 694,
 'numbersusa': 695,
 'repmobrook': 696,
 'effort': 697,
 'event': 698,
 'explor': 699,
 'sdchamber': 700,
 'solut': 701,
 'yesterday': 702,
 'convers': 703,
 'icymi': 704,
 'mica': 705,
 'washtim': 706,
 'addit': 707,
 'background': 708,
 'check': 709,
 'cosponsor': 710,
 'fix': 711,
 'improv': 712,
 'nic': 713,
 'qualiti': 714,
 'anoth': 715,
 'continu': 716,
 'fairytal': 717,
 'morn': 718,
 'mythic': 719,
 'repjerrynadl': 720,
 'covid': 721,
 'español': 722,
 'find': 723,
 'getvax': 724,
 'janschakowski': 725,
 'text': 726,
 'vacuna': 727,
 'zipcod': 728,
 'disarmh': 729,
 'gun': 730,
 'particip': 731,
 'repbecerra': 732,
 'roundtabl': 733,
 'violenc': 734,
 'encourag': 735,
 'nevadan': 736,
 'stand': 737,
 'goe': 738,
 'latina': 739,
 'lgbtqhistorymonth': 740,
 'often': 741,
 'repchuygarcia': 742,
 'rivera': 743,
 'sylvia': 744,
 'trailblaz': 745,
 'unrecogn': 746,
 'crucial': 747,
 'impact': 748,
 'implement': 749,
 'throughout': 750,
 'delet': 751,
 'tweet': 752,
 'vox': 753,
 'becom': 754,
 'firefight': 755,
 'firsthand': 756,
 'idaho': 757,
 'youth': 758,
 'broadband': 759,
 'emc': 760,
 'gachamb': 761,
 'govkemp': 762,
 'legislatur': 763,
 'belong': 764,
 'complex': 765,
 'second': 766,
 'could': 767,
 'ferc': 768,
 'read': 769,
 'report': 770,
 'saw': 771,
 'truth': 772,
 'hospit': 773,
 'loan': 774,
 'ppp': 775,
 'sbagov': 776,
 'small': 777,
 'ustreasuri': 778,
 'breakthrough': 779,
 'discov': 780,
 'excit': 781,
 'incred': 782,
 'nanotechnolog': 783,
 'present': 784,
 'scientif': 785,
 'drug': 786,
 'far': 787,
 'fight': 788,
 'restor': 789,
 'academi': 790,
 'graduat': 791,
 'hs': 792,
 'inform': 793,
 'monday': 794,
 'remind': 795,
 'student': 796,
 'aumf': 797,
 'cm': 798,
 'repgregorymeek': 799,
 'emerg': 800,
 'ongo': 801,
 'violat': 802,
 'exactli': 803,
 'lay': 804,
 'appl': 805,
 'entir': 806,
 'epic': 807,
 'exist': 808,
 'four': 809,
 'revenu': 810,
 'trial': 811,
 'chairman': 812,
 'councilman': 813,
 'councilwoman': 814,
 'hopkin': 815,
 'owen': 816,
 'sisseton': 817,
 'tribal': 818,
 'wahpeton': 819,
 'ceremoni': 820,
 'medal': 821,
 'urbandal': 822,
 'account': 823,
 'assault': 824,
 'best': 825,
 'brightest': 826,
 'cost': 827,
 'demwomencaucu': 828,
 'dod': 829,
 'failur': 830,
 'hold': 831,
 'perpetr': 832,
 'sexual': 833,
 'bether': 834,
 'know': 835,
 'mental': 836,
 'vetaffairsdem': 837,
 'alli': 838,
 'alliesact': 839,
 'bring': 840,
 'safeti': 841,
 'spoke': 842,
 'bellair': 843,
 'program': 844,
 'child': 845,
 'establish': 846,
 'leav': 847,
 'paid': 848,
 'coordin': 849,
 'repmarktakano': 850,
 'anniversari': 851,
 'enact': 852,
 'mark': 853,
 'chávez': 854,
 'césar': 855,
 'ignit': 856,
 'life': 857,
 'movement': 858,
 'everyth': 859,
 'ga': 860,
 'hatr': 861,
 'irrat': 862,
 'oil': 863,
 'produc': 864,
 'refer': 865,
 'said': 866,
 'everyon': 867,
 'galvestonferri': 868,
 'safe': 869,
 'thanksgiv': 870,
 'consid': 871,
 'footnot': 872,
 'histori': 873,
 'revolutionari': 874,
 'free': 875,
 'parent': 876,
 'senatorhaywood': 877,
 'summer': 878,
 'virtual': 879,
 'weekli': 880,
 'workshop': 881,
 'lose': 882,
 'pain': 883,
 'terribl': 884,
 'didyouknow': 885,
 'exchang': 886,
 'housebudgetgop': 887,
 'limit': 888,
 'obamacar': 889,
 'sold': 890,
 'advanc': 891,
 'amount': 892,
 'began': 893,
 'claim': 894,
 'half': 895,
 'individu': 896,
 'juli': 897,
 'monthli': 898,
 'total': 899,
 'activ': 900,
 'exploit': 901,
 'lack': 902,
 'system': 903,
 'transpar': 904,
 'box': 905,
 'commfoodshar': 906,
 'distribut': 907,
 'food': 908,
 'fresh': 909,
 'louisvil': 910,
 'pack': 911,
 'repjoenegus': 912,
 'volunt': 913,
 'anacabrera': 914,
 'damn': 915,
 'doctor': 916,
 'liter': 917,
 'shot': 918,
 'among': 919,
 'district': 920,
 'repjohncurti': 921,
 'twwpioneer': 922,
 'utah': 923,
 'contract': 924,
 'reagan': 925,
 'alreadi': 926,
 'contracept': 927,
 'deserv': 928,
 'womenvet': 929,
 'block': 930,
 'critic': 931,
 'cruz': 932,
 'race': 933,
 'theori': 934,
 'ilhanmn': 935,
 'rockstar': 936,
 'wait': 937,
 'j': 938,
 'paus': 939,
 'though': 940,
 'two': 941,
 'abbi': 942,
 'crystal': 943,
 'dahlkemp': 944,
 'dunn': 945,
 'jessica': 946,
 'mcdonald': 947,
 'mewi': 948,
 'sam': 949,
 'thenccourag': 950,
 'fact': 951,
 'waysmeanscmt': 952,
 'moon': 953,
 'nasa': 954,
 'set': 955,
 'stevescalis': 956,
 'big': 957,
 'content': 958,
 'editori': 959,
 'held': 960,
 'media': 961,
 'newsandsentinel': 962,
 'social': 963,
 'tech': 964,
 'billion': 965,
 'class': 966,
 'mail': 967,
 'piec': 968,
 'usp': 969,
 'fresno': 970,
 'fresnohsngceo': 971,
 'recent': 972,
 'excess': 973,
 'includ': 974,
 'intern': 975,
 'unit': 976,
 'decis': 977,
 'repchiproy': 978,
 'agre': 979,
 'dramat': 980,
 'order': 981,
 'top': 982,
 'accept': 983,
 'area': 984,
 'snap': 985,
 'upinngil': 986,
 'cover': 987,
 'insur': 988,
 'loss': 989,
 'provis': 990,
 'requir': 991,
 'urg': 992,
 'amend': 993,
 'disord': 994,
 'encroach': 995,
 'govt': 996,
 'stood': 997,
 'forward': 998,
 'imag': 999,
 ...}

Step 3: Filter out words¶

This is a additional pre-processing task. More meaningful topics comes when we remove rare and overly common words`

# Filter out words that occur in less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

Step 4: Create a bag-of-words representation of the documents¶

# Create a bag-of-words representation of the documents

# notice here you are just inputign every doc in a .doc2bow methods
corpus = [dictionary.doc2bow(doc) for doc in tokens]

# see case by case
# tuple with (id for every token, frequency) 
dictionary.doc2bow(tokens[0])

[(0, 1), (1, 1)]

Step 5 - Fit the model¶

from gensim.models.ldamodel import LdaModel

# Make an index to word dictionary
temp = dictionary[0]  # This is only to "load" the dictionary
id2word = dictionary.id2token

# Train the LDA model
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=10,
    eval_every=False
)

Visualizing results¶

# Print the Keyword in the 10 topics
lda_model.print_topics()

[(0,
  '0.074*"amp" + 0.074*"live" + 0.072*"proud" + 0.063*"vaccin" + 0.063*"make" + 0.052*"rt" + 0.052*"act" + 0.052*"must" + 0.040*"help" + 0.035*"right"'),
 (1,
  '0.129*"rt" + 0.129*"amp" + 0.076*"great" + 0.061*"today" + 0.040*"time" + 0.036*"feder" + 0.032*"work" + 0.032*"help" + 0.032*"get" + 0.027*"make"'),
 (2,
  '0.154*"rt" + 0.122*"today" + 0.046*"work" + 0.041*"amp" + 0.041*"trump" + 0.036*"honor" + 0.034*"american" + 0.028*"need" + 0.023*"presid" + 0.023*"last"'),
 (3,
  '0.214*"rt" + 0.094*"thank" + 0.071*"week" + 0.054*"hous" + 0.045*"pass" + 0.032*"stop" + 0.029*"help" + 0.027*"amp" + 0.027*"great" + 0.027*"health"'),
 (4,
  '0.088*"american" + 0.069*"bill" + 0.069*"nation" + 0.057*"great" + 0.051*"hous" + 0.047*"rt" + 0.044*"trump" + 0.044*"continu" + 0.038*"vote" + 0.035*"congress"'),
 (5,
  '0.124*"year" + 0.084*"famili" + 0.084*"one" + 0.052*"million" + 0.042*"today" + 0.040*"pandem" + 0.036*"talk" + 0.036*"us" + 0.030*"day" + 0.030*"countri"'),
 (6,
  '0.151*"peopl" + 0.093*"rt" + 0.093*"protect" + 0.053*"act" + 0.047*"support" + 0.041*"american" + 0.030*"first" + 0.030*"trump" + 0.024*"famili" + 0.024*"work"'),
 (7,
  '0.069*"rt" + 0.067*"right" + 0.067*"new" + 0.057*"biden" + 0.055*"thank" + 0.050*"discuss" + 0.040*"american" + 0.040*"job" + 0.040*"happi" + 0.040*"support"'),
 (8,
  '0.086*"amp" + 0.073*"state" + 0.053*"join" + 0.051*"need" + 0.046*"thank" + 0.041*"rt" + 0.041*"help" + 0.036*"live" + 0.036*"pandem" + 0.031*"million"'),
 (9,
  '0.108*"join" + 0.072*"presid" + 0.065*"thank" + 0.051*"biden" + 0.047*"health" + 0.040*"today" + 0.037*"care" + 0.037*"bill" + 0.029*"get" + 0.029*"support"')]

Estimate Topic Prevalence¶

# Extract topics from each documenct
td['topic'] = [sorted(lda_model[corpus][text]) for text in range(len(td["text"]))]

# expand the dataframe
df_exploded = td["topic"].explode().reset_index()

# separate information
df_exploded[["topic", "probability"]] = pd.DataFrame(df_exploded['topic'].tolist(), index=df_exploded.index)

# data frame with the distribution for each topic vs document
df_exploded

# merge
df_exploded = pd.merge(df_exploded, td.reset_index(), on="index")

# topic prevalence
tp_prev = df_exploded.groupby("topic_x")["probability"].mean().reset_index()
tp_prev.sort_values("probability", ascending=False)

Bringing the words back¶

# Get the most important words for each topic
topic_words = list()
for i in range(lda_model.num_topics):
    # Get the top words for the topic
    words = lda_model.show_topic(i, topn=10)
    topic_words.append(", ".join([word for word, prob in words]))

topic_words

['state, amp, nation, today, live, year, work, us, famili, rt',
 'rt, american, support, must, live, famili, take, great, administr, help',
 'thank, one, must, year, act, congress, bill, presid, amp, hous',
 'work, trump, presid, act, famili, rt, honor, amp, american, million',
 'time, american, help, today, call, job, week, rt, need, year',
 'today, rt, vote, great, congress, member, right, need, thank, health',
 'continu, great, rt, commun, act, year, american, day, today, amp',
 'peopl, hous, pass, student, protect, act, help, american, join, year',
 'rt, amp, tax, senat, act, bill, health, join, today, get',
 'amp, today, make, busi, nation, help, discuss, rt, join, live']

tp_prev["words"] = topic_words

tp_prev

Very nice representation of the topics. you can merge this back with the core data set and see different distributions for candidates, parties, time of the day, any other group variable you have

We just touch the surface of unsupervised model and topic modeling here. If you want see more, take my computational linguistics class next spring!¶

!jupyter nbconvert _week_11_nlp_I.ipynb --to html --template classic

[NbConvertApp] Converting notebook _week_11_nlp_I.ipynb to html
[NbConvertApp] Writing 1098851 bytes to _week_11_nlp_I.html

	author	text	date	bios	retweet_author	Name	Link	State	Party	congress
0	AustinScottGA08	It is my team’s privilege to help our constitu...	Fri Dec 31 18:24:51 +0000 2021	I am proud to represent the 8th Congressional ...	NaN	Scott, Austin	https://twitter.com/AustinScottGA08	GA	R	House
1	AustinScottGA08	I am proud to have sponsored this amendment wh...	Wed Dec 29 20:14:48 +0000 2021	I am proud to represent the 8th Congressional ...	NaN	Scott, Austin	https://twitter.com/AustinScottGA08	GA	R	House
2	AustinScottGA08	From my family to yours, we wish you peace, jo...	Sat Dec 25 16:48:16 +0000 2021	I am proud to represent the 8th Congressional ...	NaN	Scott, Austin	https://twitter.com/AustinScottGA08	GA	R	House
3	AustinScottGA08	President Biden and Congress have a responsibi...	Wed Dec 22 19:14:13 +0000 2021	I am proud to represent the 8th Congressional ...	NaN	Scott, Austin	https://twitter.com/AustinScottGA08	GA	R	House
4	AustinScottGA08	Happy second birthday to @SpaceForceDoD!\n\nSe...	Mon Dec 20 15:37:11 +0000 2021	I am proud to represent the 8th Congressional ...	NaN	Scott, Austin	https://twitter.com/AustinScottGA08	GA	R	House

	author	text	date	bios	retweet_author	Name	Link	State	Party	congress
0	AustinScottGA08	It is my team’s privilege to help our constitu...	Fri Dec 31 18:24:51 +0000 2021	I am proud to represent the 8th Congressional ...	NaN	Scott, Austin	https://twitter.com/AustinScottGA08	GA	R	House
1	AustinScottGA08	I am proud to have sponsored this amendment wh...	Wed Dec 29 20:14:48 +0000 2021	I am proud to represent the 8th Congressional ...	NaN	Scott, Austin	https://twitter.com/AustinScottGA08	GA	R	House
2	AustinScottGA08	From my family to yours, we wish you peace, jo...	Sat Dec 25 16:48:16 +0000 2021	I am proud to represent the 8th Congressional ...	NaN	Scott, Austin	https://twitter.com/AustinScottGA08	GA	R	House
3	AustinScottGA08	President Biden and Congress have a responsibi...	Wed Dec 22 19:14:13 +0000 2021	I am proud to represent the 8th Congressional ...	NaN	Scott, Austin	https://twitter.com/AustinScottGA08	GA	R	House
4	AustinScottGA08	Happy second birthday to @SpaceForceDoD!\n\nSe...	Mon Dec 20 15:37:11 +0000 2021	I am proud to represent the 8th Congressional ...	NaN	Scott, Austin	https://twitter.com/AustinScottGA08	GA	R	House

	author	State	Party	text
0	AustinScottGA08	GA	R	It is my team’s privilege to help our constitu...
1	BennieGThompson	MS	D	RT @DerrickNAACP: If you can afford to pause s...
2	BettyMcCollum04	MN	D	Happy New Year! May 2022 bring you peace &...
3	BillPascrell	NJ	D	Some sage words of wisdom from the FDR Memoria...
4	BobbyScott	VA	D	RT @EdLaborCmte: At 12am, we will finally say ...
...	...	...	...	...
420	replouiegohmert	TX	R	I recently had the honor of guest hosting the ...
421	repmarkpocan	WI	D	Betty White could get anyone to laugh. An Amer...
422	rosadelauro	CT	D	May you and your family have a joyful, happy, ...
423	senrobportman	OH	R	It is clear that with record levels of unlawfu...
424	virginiafoxx	NC	R	Happy New Year!\n\nWho’s ready for the #RedWav...

	aaa ga	aadaw sahanjourn	aalftwinc drive	aapi commun	aapi equal	aapi hero	aapi mayor	aapi week	aapi women	aaron personifi	...	𝙘𝙤𝙢𝙥𝙧𝙚𝙝𝙚𝙣𝙨𝙞𝙫𝙚 invest	𝙘𝙧𝙚𝙖𝙩𝙚 𝙗𝙤𝙧𝙙𝙚𝙧	𝙘𝙧𝙞𝙨𝙞𝙨 stop	𝙢𝙤𝙧𝙚 govern	𝙢𝙤𝙧𝙚 spend	𝙢𝙤𝙧𝙚 tax	𝙩𝙤 𝙘𝙧𝙚𝙖𝙩𝙚	𝙰𝚙𝚛𝚒𝚕 loan	𝙿𝙿𝙿 𝚄𝙿𝙳𝙰𝚃𝙴	𝚄𝙿𝙳𝙰𝚃𝙴 𝙰𝚙𝚛𝚒𝚕
author
GOPLeader	1	0	0	0	0	0	0	0	0	0	...	1	1	1	1	1	1	1	0	0	0
Ilhan	0	1	1	3	0	0	0	0	2	0	...	0	0	0	0	0	0	0	0	0	0
Jim_Jordan	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
RepAOC	0	0	0	1	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
SenTedCruz	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
SenatorTimScott	0	0	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	1	1	1
SpeakerPelosi	0	0	0	3	1	1	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	index	topic	probability
0	1224340	0	0.050002
1	1224340	1	0.050008
2	1224340	2	0.050000
3	1224340	3	0.050004
4	1224340	4	0.050003
...	...	...	...
9995	672779	5	0.050000
9996	672779	6	0.050002
9997	672779	7	0.050004
9998	672779	8	0.050000
9999	672779	9	0.050001

	aa	aaa	aaahct	aacf	aadcwv	aaf	aahomecar	aamd	aan	aanmemb	...	𝙻𝚘𝚠𝚎𝚜𝚝	𝚁𝙴𝙿𝙾𝚁𝚃	𝚁𝚎𝚌𝚘𝚛𝚍	𝚂𝙴𝙿𝚃𝙴𝙼𝙱𝙴𝚁	𝚏𝚘𝚛	𝚒𝚗	𝚕𝚘𝚠	𝚛𝚊𝚝𝚎	𝚞𝚗𝚎𝚖𝚙𝚕𝚘𝚢𝚖𝚎𝚗𝚝	𝚢𝚎𝚊𝚛𝚜
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
32398	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
32399	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
32400	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
32401	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
32402	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	author	State	Party	text
23	GOPLeader	CA	R	Happy new year! https://t.co/GQA2zzlmWLRT @Rep...
27	Ilhan	MN	D	Now would be a good time to cancel student deb...
31	Jim_Jordan	OH	R	Happy New Year. God Bless America!RT @GOPLeade...
53	RepAOC	NY	D	Millions of Yemenis are facing famine due to c...
359	SenTedCruz	TX	R	Heidi & I wish you all a happy, healthy, a...
388	SenatorTimScott	SC	R	Happy New Year! \n \nWishing everyone peace, p...
390	SpeakerPelosi	CA	D	May this New Year usher in a time of joy, pros...

	author_tweet	variable	value
0	GOPLeader	biden	766
1	GOPLeader	border	523
2	GOPLeader	democrat	520
3	GOPLeader	american	365
4	GOPLeader	presid	361
...	...	...	...
5	SpeakerPelosi	famili	219
6	SpeakerPelosi	capitol	217
7	SpeakerPelosi	trump	204
8	SpeakerPelosi	act	193
9	SpeakerPelosi	year	185

	author_tweet	variable	value
0	GOPLeader	presid biden	267
1	GOPLeader	southern border	99
2	GOPLeader	border crisi	92
3	GOPLeader	biden administr	89
4	GOPLeader	hous democrat	68

	author_tweet	variable	value
0	GOPLeader	biden	0.442720
1	GOPLeader	border	0.342638
2	GOPLeader	democrat	0.300541
3	GOPLeader	american	0.210957
4	GOPLeader	presid	0.208645
...	...	...	...
5	SpeakerPelosi	famili	0.151202
6	SpeakerPelosi	capitol	0.149821
7	SpeakerPelosi	trump	0.140845
8	SpeakerPelosi	act	0.133251
9	SpeakerPelosi	year	0.127727

author	GOPLeader	Ilhan	Jim_Jordan	RepAOC	SenTedCruz	SenatorTimScott	SpeakerPelosi
author
GOPLeader	1.000000	0.468608	0.539764	0.370726	0.681883	0.463124	0.534687
Ilhan	0.468608	1.000000	0.321282	0.595876	0.390438	0.534912	0.676092
Jim_Jordan	0.539764	0.321282	1.000000	0.257022	0.437123	0.316301	0.356501
RepAOC	0.370726	0.595876	0.257022	1.000000	0.338536	0.462780	0.552089
SenTedCruz	0.681883	0.390438	0.437123	0.338536	1.000000	0.395134	0.442743
SenatorTimScott	0.463124	0.534912	0.316301	0.462780	0.395134	1.000000	0.562108
SpeakerPelosi	0.534687	0.676092	0.356501	0.552089	0.442743	0.562108	1.000000

	topic_x	probability
8	8	0.137033
3	3	0.110278
5	5	0.108009
7	7	0.102183
9	9	0.098605
0	0	0.094215
1	1	0.089792
4	4	0.089726
2	2	0.088123
6	6	0.082036

	topic_x	probability	words
0	0	0.094215	state, amp, nation, today, live, year, work, u...
1	1	0.089792	rt, american, support, must, live, famili, tak...
2	2	0.088123	thank, one, must, year, act, congress, bill, p...
3	3	0.110278	work, trump, presid, act, famili, rt, honor, a...
4	4	0.089726	time, american, help, today, call, job, week, ...
5	5	0.108009	today, rt, vote, great, congress, member, righ...
6	6	0.082036	continu, great, rt, commun, act, year, america...
7	7	0.102183	peopl, hous, pass, student, protect, act, help...
8	8	0.137033	rt, amp, tax, senat, act, bill, health, join, ...
9	9	0.098605	amp, today, make, busi, nation, help, discuss,...

	biden	american	democrat	amp	presid	border	today	hous	peopl	year	...	kermit	kerik	kenya	kentuckymbb	kenpaxtontx	kendilaniannbc	kelseykoberg	kellynashradio	kellymakena	𝚄𝙿𝙳𝙰𝚃𝙴
author
GOPLeader	766	365	520	201	361	523	141	234	125	146	...	0	0	0	0	0	1	1	0	0	0
Ilhan	25	190	15	92	108	19	165	188	222	134	...	0	0	1	0	0	0	0	0	0	0
Jim_Jordan	434	173	334	69	275	168	96	103	66	59	...	1	1	0	0	1	0	0	0	0	0
RepAOC	11	29	9	101	9	0	77	57	45	42	...	0	0	0	0	0	0	0	0	1	0
SenTedCruz	758	197	262	267	157	309	114	27	89	91	...	0	0	0	0	0	0	0	0	0	0
SenatorTimScott	93	202	159	168	87	49	247	58	104	152	...	0	0	0	1	0	0	0	1	0	1
SpeakerPelosi	61	454	154	510	290	2	176	300	180	185	...	0	0	0	0	0	0	0	0	0	0

PPOL 5203 Data Science I: Foundations Working with Text as Data Tiago Ventura