PPOL 5203 Data Science I: Foundations

Working with Text as Data

Tiago Ventura


Learning Goals

In the class today, we will start learning about working with text as data in Python. This notebook will cover:

  • Opening text data in Python
  • Strategies to pre-process data
  • Converting text to a numerical representation (sparse matrix)
  • Descriptive analysis of text

    • Bigrams and TF-IDF
    • Similarity of documents
  • Unsupervised Learning:

    • topic models

Natural Language Processing (NLP) vs Text-as-Data

Computer Scientists often refer to Natural Language Processing (NLP) as the field that focuses on developing tools to process natural language (text, audio, videos) using computers. On the other side, social scientists and applied data scientists will often use terms as text-as-data, or even computational linguistics, as the field that focus on developing tools/models to incorporate textual data in their data analysis pipelines.

These fields are very closely connected, share similar methodological tools, and develop solutions to similar problems. Some of the applications of these fields are different due to the nature of each field. For example:

  • Computer Scientists (NLP) are often more interested in taks such as: machine Translation, chatbots and virtual assistants, generative AI, speech recognition, among others.

  • Social Scientists (text-as-data) are often more interested in tasks as: document similarity, topic discovery, content analysis, text classification.

My approach is to consider these perspectives more as a integrated disciplinary field that focus on different tasks than as two separate perspectives. For this reason, I will often use the terms NLP and text-as-data interchangeably, even though most of the application we will see in the next two week are closer to a social scientist applied perspective to work with text data.

In the Spring, I will teach a full semester in Text-as-Data. You can see the syllabus, and there is a mix of tasks/methods from each field.

NLTK

The Natural Language Toolkit (NLTK) is a popular open-source library for natural language processing (NLP) in Python. According to the NLTK textbook, the library works under the following principles:

  • Simplicity: To provide an intuitive framework along with substantial building blocks, giving users a practical knowledge of NLP without getting bogged down in the tedious house-keeping usually associated with processing annotated language data

  • Consistency: To provide a uniform framework with consistent interfaces and data structures, and easily-guessable method names

  • Extensibility: To provide a structure into which new software modules can be easily accommodated, including alternative implementations and competing approaches to the same task

  • Modularity: To provide components that can be used independently without needing to understand the rest of the toolkit

NLTK is widely used by researchers, developers, and data scientists worldwide to develop NLP applications and analyze text data. We will use NLTK for the most basic steps on NLP, particularly, pre-processing and converting texts to matrices.

NLTK Setup

In [1]:
# !pip install nltk
import nltk

# download nltk and close
# nltk.download()

Importing Text Data

We will see three different ways to import textual data:

  • Working with textual data from NLTK
  • Importing as .txt locally
  • Importing text as a column in Pandas

nltk data libraries

Read more here about text data available with nltk: https://www.nltk.org/book/ch02.html

Let's see some examples below

In [2]:
# import nltk guttenberg books
# see all books available
nltk.corpus.gutenberg.fileids()
Out[2]:
['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']
In [3]:
# openning jane austen emma as words
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
In [4]:
type(emma)
Out[4]:
nltk.corpus.reader.util.StreamBackedCorpusView
In [5]:
# converting to a list
emma_list = [w.lower() for w in emma]

# print
emma_list[:10]
Out[5]:
['[', 'emma', 'by', 'jane', 'austen', '1816', ']', 'volume', 'i', 'chapter']

Open local text files

To open text files we have saved locally, we should use the file connections tools we learned early in the course.

I have in my working directory the first chapter of the Red Rising book I was reading early this year. Let's open it.

In [6]:
# open red_rising.txt

# Using 'with' to open the file
with open('red_rising.txt', 'r') as file:
    # Read the content of the file
    content = file.read()
In [7]:
# open as a string
content[0:1000]
Out[7]:
'Helldiver\n\nThe first thing you should know about me is I am my father’s son. And when they came for him, I did as he asked. I did not cry. Not when the Society televised the arrest. Not when the Golds tried him. Not when the Grays hanged him. Mother hit me for that. My brother Kieran was supposed to be the stoic one. He was the elder, I the younger. I was supposed to cry. Instead, Kieran bawled like a girl when Little Eo tucked a haemanthus into Father’s left workboot and ran back to her own father’s side. My sister Leanna murmured a lament beside me. I just watched and thought it a shame that he died dancing but without his dancing shoes.\n\nOn Mars there is not much gravity. So you have to pull the feet to break the neck. They let the loved ones do it.\n\nI smell my own stink inside my frysuit. The suit is some kind of nanoplastic and is hot as its name suggests. It insulates me toe to head. Nothing gets in. Nothing gets out. Especially not the heat. Worst part is you can’t wipe the swea'
In [8]:
# convert to a list using string methods
rr_lines = [c for c in content.split("\n") if c != ""]
rr_lines[0:5]
Out[8]:
['Helldiver',
 'The first thing you should know about me is I am my father’s son. And when they came for him, I did as he asked. I did not cry. Not when the Society televised the arrest. Not when the Golds tried him. Not when the Grays hanged him. Mother hit me for that. My brother Kieran was supposed to be the stoic one. He was the elder, I the younger. I was supposed to cry. Instead, Kieran bawled like a girl when Little Eo tucked a haemanthus into Father’s left workboot and ran back to her own father’s side. My sister Leanna murmured a lament beside me. I just watched and thought it a shame that he died dancing but without his dancing shoes.',
 'On Mars there is not much gravity. So you have to pull the feet to break the neck. They let the loved ones do it.',
 'I smell my own stink inside my frysuit. The suit is some kind of nanoplastic and is hot as its name suggests. It insulates me toe to head. Nothing gets in. Nothing gets out. Especially not the heat. Worst part is you can’t wipe the sweat from your eyes. Bloodydamn stings as it goes through the headband to puddle at the heels. Not to mention the stink when you piss. Which you always do. Gotta take in a load of water through the drinktube. I guess you could be fit with a catheter. We choose the stink.',
 'The drillers of my clan chatter some gossip over the comm in my ear as I ride atop the clawDrill. I’m alone in this deep tunnel on a machine built like a titanic metal hand, one that grasps and gnaws at the ground. I control its rockmelting digits from the holster seat atop the drill, just where the elbow joint would be. There, my fingers fit into control gloves that manipulate the many tentacle-like drills some ninety meters below my perch. To be a Helldiver, they say your fingers must flicker fast as tongues of fire. Mine flicker faster.']
In [9]:
# By word
rr_words = content.split(" ")
rr_words[0:10]
Out[9]:
['Helldiver\n\nThe',
 'first',
 'thing',
 'you',
 'should',
 'know',
 'about',
 'me',
 'is',
 'I']
In [10]:
# a bit more to remove line breaks
rr_words_ = [i for el in rr_words for i in el.split("\n")]
rr_words_
Out[10]:
['Helldiver',
 '',
 'The',
 'first',
 'thing',
 'you',
 'should',
 'know',
 'about',
 'me',
 'is',
 'I',
 'am',
 'my',
 'father’s',
 'son.',
 'And',
 'when',
 'they',
 'came',
 'for',
 'him,',
 'I',
 'did',
 'as',
 'he',
 'asked.',
 'I',
 'did',
 'not',
 'cry.',
 'Not',
 'when',
 'the',
 'Society',
 'televised',
 'the',
 'arrest.',
 'Not',
 'when',
 'the',
 'Golds',
 'tried',
 'him.',
 'Not',
 'when',
 'the',
 'Grays',
 'hanged',
 'him.',
 'Mother',
 'hit',
 'me',
 'for',
 'that.',
 'My',
 'brother',
 'Kieran',
 'was',
 'supposed',
 'to',
 'be',
 'the',
 'stoic',
 'one.',
 'He',
 'was',
 'the',
 'elder,',
 'I',
 'the',
 'younger.',
 'I',
 'was',
 'supposed',
 'to',
 'cry.',
 'Instead,',
 'Kieran',
 'bawled',
 'like',
 'a',
 'girl',
 'when',
 'Little',
 'Eo',
 'tucked',
 'a',
 'haemanthus',
 'into',
 'Father’s',
 'left',
 'workboot',
 'and',
 'ran',
 'back',
 'to',
 'her',
 'own',
 'father’s',
 'side.',
 'My',
 'sister',
 'Leanna',
 'murmured',
 'a',
 'lament',
 'beside',
 'me.',
 'I',
 'just',
 'watched',
 'and',
 'thought',
 'it',
 'a',
 'shame',
 'that',
 'he',
 'died',
 'dancing',
 'but',
 'without',
 'his',
 'dancing',
 'shoes.',
 '',
 'On',
 'Mars',
 'there',
 'is',
 'not',
 'much',
 'gravity.',
 'So',
 'you',
 'have',
 'to',
 'pull',
 'the',
 'feet',
 'to',
 'break',
 'the',
 'neck.',
 'They',
 'let',
 'the',
 'loved',
 'ones',
 'do',
 'it.',
 '',
 'I',
 'smell',
 'my',
 'own',
 'stink',
 'inside',
 'my',
 'frysuit.',
 'The',
 'suit',
 'is',
 'some',
 'kind',
 'of',
 'nanoplastic',
 'and',
 'is',
 'hot',
 'as',
 'its',
 'name',
 'suggests.',
 'It',
 'insulates',
 'me',
 'toe',
 'to',
 'head.',
 'Nothing',
 'gets',
 'in.',
 'Nothing',
 'gets',
 'out.',
 'Especially',
 'not',
 'the',
 'heat.',
 'Worst',
 'part',
 'is',
 'you',
 'can’t',
 'wipe',
 'the',
 'sweat',
 'from',
 'your',
 'eyes.',
 'Bloodydamn',
 'stings',
 'as',
 'it',
 'goes',
 'through',
 'the',
 'headband',
 'to',
 'puddle',
 'at',
 'the',
 'heels.',
 'Not',
 'to',
 'mention',
 'the',
 'stink',
 'when',
 'you',
 'piss.',
 'Which',
 'you',
 'always',
 'do.',
 'Gotta',
 'take',
 'in',
 'a',
 'load',
 'of',
 'water',
 'through',
 'the',
 'drinktube.',
 'I',
 'guess',
 'you',
 'could',
 'be',
 'fit',
 'with',
 'a',
 'catheter.',
 'We',
 'choose',
 'the',
 'stink.',
 '',
 'The',
 'drillers',
 'of',
 'my',
 'clan',
 'chatter',
 'some',
 'gossip',
 'over',
 'the',
 'comm',
 'in',
 'my',
 'ear',
 'as',
 'I',
 'ride',
 'atop',
 'the',
 'clawDrill.',
 'I’m',
 'alone',
 'in',
 'this',
 'deep',
 'tunnel',
 'on',
 'a',
 'machine',
 'built',
 'like',
 'a',
 'titanic',
 'metal',
 'hand,',
 'one',
 'that',
 'grasps',
 'and',
 'gnaws',
 'at',
 'the',
 'ground.',
 'I',
 'control',
 'its',
 'rockmelting',
 'digits',
 'from',
 'the',
 'holster',
 'seat',
 'atop',
 'the',
 'drill,',
 'just',
 'where',
 'the',
 'elbow',
 'joint',
 'would',
 'be.',
 'There,',
 'my',
 'fingers',
 'fit',
 'into',
 'control',
 'gloves',
 'that',
 'manipulate',
 'the',
 'many',
 'tentacle-like',
 'drills',
 'some',
 'ninety',
 'meters',
 'below',
 'my',
 'perch.',
 'To',
 'be',
 'a',
 'Helldiver,',
 'they',
 'say',
 'your',
 'fingers',
 'must',
 'flicker',
 'fast',
 'as',
 'tongues',
 'of',
 'fire.',
 'Mine',
 'flicker',
 'faster.',
 '',
 'Despite',
 'the',
 'voices',
 'in',
 'my',
 'ear,',
 'I',
 'am',
 'alone',
 'in',
 'the',
 'deep',
 'tunnel.',
 'My',
 'existence',
 'is',
 'vibration,',
 'the',
 'echo',
 'of',
 'my',
 'own',
 'breath,',
 'and',
 'heat',
 'so',
 'thick',
 'and',
 'noxious',
 'it',
 'feels',
 'like',
 'I’m',
 'swaddled',
 'in',
 'a',
 'heavy',
 'quilt',
 'of',
 'hot',
 'piss.',
 '',
 'A',
 'new',
 'river',
 'of',
 'sweat',
 'breaks',
 'through',
 'the',
 'scarlet',
 'sweatband',
 'tied',
 'around',
 'my',
 'forehead',
 'and',
 'slips',
 'into',
 'my',
 'eyes,',
 'burning',
 'them',
 'till',
 'they’re',
 'as',
 'red',
 'as',
 'my',
 'rusty',
 'hair.',
 'I',
 'used',
 'to',
 'reach',
 'and',
 'try',
 'to',
 'wipe',
 'the',
 'sweat',
 'away,',
 'only',
 'to',
 'scratch',
 'futilely',
 'at',
 'the',
 'faceplate',
 'of',
 'my',
 'frysuit.',
 'I',
 'still',
 'want',
 'to.',
 'Even',
 'after',
 'three',
 'years,',
 'the',
 'tickle',
 'and',
 'sting',
 'of',
 'the',
 'sweat',
 'is',
 'a',
 'raw',
 'misery.',
 '',
 'The',
 'tunnel',
 'walls',
 'around',
 'my',
 'holster',
 'seat',
 'are',
 'bathed',
 'a',
 'sulfurous',
 'yellow',
 'by',
 'a',
 'corona',
 'of',
 'lights.',
 'The',
 'reach',
 'of',
 'the',
 'light',
 'fades',
 'as',
 'I',
 'look',
 'up',
 'the',
 'thin',
 'vertical',
 'shaft',
 'I’ve',
 'carved',
 'today.',
 'Above,',
 'precious',
 'helium-3',
 'glimmers',
 'like',
 'liquid',
 'silver,',
 'but',
 'I’m',
 'looking',
 'at',
 'the',
 'shadows,',
 'looking',
 'for',
 'the',
 'pitvipers',
 'that',
 'curl',
 'through',
 'the',
 'darkness',
 'seeking',
 'the',
 'warmth',
 'of',
 'my',
 'drill.',
 'They’ll',
 'eat',
 'into',
 'your',
 'suit',
 'too,',
 'bite',
 'through',
 'the',
 'shell',
 'and',
 'then',
 'try',
 'to',
 'burrow',
 'into',
 'the',
 'warmest',
 'place',
 'they',
 'find,',
 'usually',
 'your',
 'belly,',
 'so',
 'they',
 'can',
 'lay',
 'their',
 'eggs.',
 'I’ve',
 'been',
 'bitten',
 'before.',
 'Still',
 'dream',
 'of',
 'the',
 'beast—black,',
 'like',
 'a',
 'thick',
 'tendril',
 'of',
 'oil.',
 'They',
 'can',
 'get',
 'as',
 'wide',
 'as',
 'a',
 'thigh',
 'and',
 'long',
 'as',
 'three',
 'men,',
 'but',
 'it’s',
 'the',
 'babies',
 'we',
 'fear.',
 'They',
 'don’t',
 'know',
 'how',
 'to',
 'ration',
 'their',
 'poison.',
 'Like',
 'me,',
 'their',
 'ancestors',
 'came',
 'from',
 'Earth,',
 'then',
 'Mars',
 'and',
 'the',
 'deep',
 'tunnels',
 'changed',
 'them.',
 '',
 'It',
 'is',
 'eerie',
 'in',
 'the',
 'deep',
 'tunnels.',
 'Lonely.',
 'Beyond',
 'the',
 'roar',
 'of',
 'the',
 'drill,',
 'I',
 'hear',
 'the',
 'voices',
 'of',
 'my',
 'friends,',
 'all',
 'older.',
 'But',
 'I',
 'cannot',
 'see',
 'them',
 'a',
 'half',
 'klick',
 'above',
 'me',
 'in',
 'the',
 'darkness.',
 'They',
 'drill',
 'high',
 'above,',
 'near',
 'the',
 'mouth',
 'of',
 'the',
 'tunnel',
 'that',
 'I’ve',
 'carved,',
 'descending',
 'with',
 'hooks',
 'and',
 'lines',
 'to',
 'dangle',
 'along',
 'the',
 'sides',
 'of',
 'the',
 'tunnel',
 'to',
 'get',
 'at',
 'the',
 'small',
 'veins',
 'of',
 'helium-3.',
 'They',
 'mine',
 'with',
 'meter-long',
 'drills,',
 'gobbling',
 'up',
 'the',
 'chaff.',
 'The',
 'work',
 'still',
 'requires',
 'mad',
 'dexterity',
 'of',
 'foot',
 'and',
 'hand,',
 'but',
 'I’m',
 'the',
 'earner',
 'in',
 'this',
 'crew.',
 'I',
 'am',
 'the',
 'Helldiver.',
 'It',
 'takes',
 'a',
 'certain',
 'kind—and',
 'I’m',
 'the',
 'youngest',
 'anyone',
 'can',
 'remember.',
 '',
 'I’ve',
 'been',
 'in',
 'the',
 'mines',
 'for',
 'three',
 'years.',
 'You',
 'start',
 'at',
 'thirteen.',
 'Old',
 'enough',
 'to',
 'screw,',
 'old',
 'enough',
 'to',
 'crew.',
 'At',
 'least',
 'that’s',
 'what',
 'Uncle',
 'Narol',
 'said.',
 'Except',
 'I',
 'didn’t',
 'get',
 'married',
 'till',
 'six',
 'months',
 'back,',
 'so',
 'I',
 'don’t',
 'know',
 'why',
 'he',
 'said',
 'it.',
 '',
 'Eo',
 'dances',
 'through',
 'my',
 'thoughts',
 'as',
 'I',
 'peer',
 'into',
 'my',
 'control',
 'display',
 'and',
 'slip',
 'the',
 'clawDrill’s',
 'fingers',
 'around',
 'a',
 'fresh',
 'vein.',
 'Eo.',
 'Sometimes',
 'it’s',
 'difficult',
 'to',
 'think',
 'of',
 'her',
 'as',
 'anything',
 'but',
 'what',
 'we',
 'used',
 'to',
 'call',
 'her',
 'as',
 'children.',
 '',
 'Little',
 'Eo—a',
 'tiny',
 'girl',
 'hidden',
 'beneath',
 'a',
 'mane',
 'of',
 'red.',
 'Red',
 'like',
 'the',
 'rock',
 'around',
 'me,',
 'not',
 'true',
 'red,',
 'rust-red.',
 'Red',
 'like',
 'our',
 'home,',
 'like',
 'Mars.',
 'Eo',
 'is',
 'sixteen',
 'too.',
 'And',
 'she',
 'may',
 'be',
 'like',
 'me—from',
 'a',
 'clan',
 'of',
 'Red',
 'earth',
 'diggers,',
 'a',
 'clan',
 'of',
 'song',
 'and',
 'dance',
 'and',
 'soil—but',
 'she',
 'could',
 'be',
 'made',
 'from',
 'air,',
 'from',
 'the',
 'ether',
 'that',
 'binds',
 'the',
 'stars',
 'in',
 'a',
 'patchwork.',
 'Not',
 'that',
 'I’ve',
 'ever',
 'seen',
 'stars.',
 'No',
 'Red',
 'from',
 'the',
 'mining',
 'colonies',
 'sees',
 'the',
 'stars.',
 '',
 'Little',
 'Eo.',
 'They',
 'wanted',
 'to',
 'marry',
 'her',
 'off',
 'when',
 'she',
 'turned',
 'fourteen,',
 'like',
 'all',
 'girls',
 'of',
 'the',
 'clans.',
 'But',
 'she',
 'took',
 'the',
 'short',
 'rations',
 'and',
 'waited',
 'for',
 'me',
 'to',
 'reach',
 'sixteen,',
 'wedAge',
 'for',
 'men,',
 'before',
 'slipping',
 'that',
 'cord',
 'around',
 'her',
 'finger.',
 'She',
 'said',
 'she',
 'knew',
 'we’d',
 'marry',
 'since',
 'we',
 'were',
 'children.',
 'I',
 'didn’t.',
 '',
 '“Hold.',
 'Hold.',
 'Hold!”',
 'Uncle',
 'Narol',
 'snaps',
 'over',
 'the',
 'comm',
 'channel.',
 '“Darrow,',
 'hold,',
 'boy!”',
 'My',
 'fingers',
 'freeze.',
 'He’s',
 'high',
 'above',
 'with',
 'the',
 'rest',
 'of',
 'them,',
 'watching',
 'my',
 'progress',
 'on',
 'his',
 'head',
 'unit.',
 '',
 '“What’s',
 'the',
 'burn?”',
 'I',
 'ask,',
 'annoyed.',
 'I',
 'don’t',
 'like',
 'being',
 'interrupted.',
 '',
 '“What’s',
 'the',
 'burn,',
 'the',
 'little',
 'Helldiver',
 'asks.”',
 'Old',
 'Barlow',
 ...]

Open from a Pandas Data Frame

We will use Twitter data. This dataset has a collection of Twitter timelines of all members of the 117th Congress for the year of 2021. It is a rich dataset, and interesting to play with for some descriptive text analysis.

We will work mostly with the columns text

In [11]:
import pandas as pd
import numpy as np

Download the data

Get the data here

In [12]:
# Open data
tweets_data = pd.read_csv("tweets_congress.csv")
tweets_data.head()
Out[12]:
author text date bios retweet_author Name Link State Party congress
0 AustinScottGA08 It is my team’s privilege to help our constitu... Fri Dec 31 18:24:51 +0000 2021 I am proud to represent the 8th Congressional ... NaN Scott, Austin https://twitter.com/AustinScottGA08 GA R House
1 AustinScottGA08 I am proud to have sponsored this amendment wh... Wed Dec 29 20:14:48 +0000 2021 I am proud to represent the 8th Congressional ... NaN Scott, Austin https://twitter.com/AustinScottGA08 GA R House
2 AustinScottGA08 From my family to yours, we wish you peace, jo... Sat Dec 25 16:48:16 +0000 2021 I am proud to represent the 8th Congressional ... NaN Scott, Austin https://twitter.com/AustinScottGA08 GA R House
3 AustinScottGA08 President Biden and Congress have a responsibi... Wed Dec 22 19:14:13 +0000 2021 I am proud to represent the 8th Congressional ... NaN Scott, Austin https://twitter.com/AustinScottGA08 GA R House
4 AustinScottGA08 Happy second birthday to @SpaceForceDoD!\n\nSe... Mon Dec 20 15:37:11 +0000 2021 I am proud to represent the 8th Congressional ... NaN Scott, Austin https://twitter.com/AustinScottGA08 GA R House
In [14]:
tweets_data.shape
Out[14]:
(1266542, 10)
In [15]:
# reduce the size of the data a bit
import random
authors = tweets_data["author"].unique()[random.sample(range(1, 425), 10)]
tweets_data_ = tweets_data[tweets_data['author'].str.contains("|".join(authors))].copy()
In [16]:
tweets_data_.shape
Out[16]:
(32403, 10)

Pre-Processing Steps

Almost every data science task using text requires data to be preprocessed before running any type of analysis. These tasks often consists on reducing noise on text data - making the the data more informative and less complex - and converting the data to formats computer understand.

The most commong pre-processing steps are:

  • tokenization: splitting text into words or tokens.

  • normalization: convert text to all lowercase and removing punctuation

  • stop word removal: remove noise, words with little meaning. Usually involves a pre-defined set of words + some domain knowledge/context dependet words

  • stemming: removing the suffixes from words, such as "ing" or "ed," to reduce them to their base form

  • lemmatization: relies on accurately determining the intended part-of-speech and the meaning of a word based on its context.

Important: pre-processing steps can profoundly change what your text looks like. See this article here to understand more in-depth some trade-offs associated with pre-processing steps: https://www.cambridge.org/core/journals/political-analysis/article/abs/text-preprocessing-for-unsupervised-learning-why-it-matters-when-it-misleads-and-what-to-do-about-it/AA7D4DE0AA6AB208502515AE3EC6989E

The implementation of these steps consists of a mix of string methods and nltk methods. Let's see examples with the Politicians tweets datasets.

In [17]:
# import nltk methods

# stopwords
from nltk.corpus import stopwords

# tokenizer
from nltk.tokenize import word_tokenize

# lemmatizer
from nltk.stem import WordNetLemmatizer

# stemming
from nltk.stem.porter import PorterStemmer

Tokenizetion

  • word_tokenize() from nltk
In [18]:
# apply as a dataframe # with half of the dataframe
import time
tweets_data_["tokens"] = tweets_data_["text"].apply(word_tokenize)
In [19]:
# see
tweets_data_["tokens"]
Out[19]:
9727       [Some, sage, words, of, wisdom, from, the, FDR...
9728       [We, made, it, together, so, let, ’, s, celebr...
9729       [RT, @, BillPascrell, :, 🚨, Starting, tomorrow...
9730       [358, days, ago, terrorists, ransacked, the, U...
9731            [RT, @, NJGov, :, NEW, YEAR, ,, NEW, JERSEY]
                                 ...                        
1263289    [Border, security, is, national, security, ., ...
1263290    [Failure, to, address, the, root, cause, will,...
1263291    [Rationally, ,, greater, uncertainty, and, gre...
1263292    [🥺🤕, America, ’, s, central, bank, is, destroy...
1263293    [After, failing, to, drag, America, and, @, re...
Name: tokens, Length: 32403, dtype: object

Normalization

  • isalpha() - string methods to remove punctuation
  • lower() - string methods to convert text to lower
In [20]:
# normalization
tweets_data_["tokens"] = tweets_data_["tokens"].apply(lambda x: [word.lower() for word in x if word.isalpha()])
tweets_data_["tokens"].head()
Out[20]:
9727    [some, sage, words, of, wisdom, from, the, fdr...
9728    [we, made, it, together, so, let, s, celebrate...
9729    [rt, billpascrell, starting, tomorrow, large, ...
9730    [days, ago, terrorists, ransacked, the, us, ca...
9731                  [rt, njgov, new, year, new, jersey]
Name: tokens, dtype: object

Remove stop words

  • stopwords.words('english') from nltk
In [21]:
# import stopword first
stop_words = stopwords.words('english')
print(stop_words)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Want to add some more?

In [22]:
stop_words = stop_words + (["dr", "mr", "miss","congressman","congresswomen", "http", "rt"])
In [23]:
# remove
tweets_data_["tokens"] = tweets_data_["tokens"].apply(lambda x: [word for word in x if word not in stop_words])
tweets_data_["tokens"]
Out[23]:
9727       [sage, words, wisdom, fdr, memorial, enter, ne...
9728       [made, together, let, celebrate, bottom, heart...
9729       [billpascrell, starting, tomorrow, large, surp...
9730       [days, ago, terrorists, ransacked, us, capitol...
9731                         [njgov, new, year, new, jersey]
                                 ...                        
1263289    [border, security, national, security, secure,...
1263290    [failure, address, root, cause, correct, probl...
1263291    [rationally, greater, uncertainty, greater, ri...
1263292    [america, central, bank, destroying, value, mo...
1263293    [failing, drag, america, realdonaldtrump, wars...
Name: tokens, Length: 32403, dtype: object

stemming

We stem the tokens using nltk.stem.porter.PorterStemmer to get the stemmed tokens.

In [24]:
# instatiate the stemmer
porter = PorterStemmer()

# run
tweets_data_["tokens"] = tweets_data_["tokens"].apply(lambda x: [porter.stem(word) for word in x if word])

# see
tweets_data_["tokens"].head()
Out[24]:
9727    [sage, word, wisdom, fdr, memori, enter, new, ...
9728    [made, togeth, let, celebr, bottom, heart, wis...
9729    [billpascrel, start, tomorrow, larg, surpris, ...
9730    [day, ago, terrorist, ransack, us, capitol, ho...
9731                      [njgov, new, year, new, jersey]
Name: tokens, dtype: object

lemmatization

We will lemmatize the tokens using WordNetLemmatizer() from nltk

In [25]:
# import
from nltk.stem import WordNetLemmatizer

# instantiate
lemmatizer = WordNetLemmatizer()

# run (doenst' make much sense to run on a stemm, but just for your reference)
tweets_data_["tokens"] = tweets_data_["tokens"].apply(lambda x: [lemmatizer.lemmatize(word) for word in x if word])

# see
tweets_data_["tokens"].tail()
Out[25]:
1263289    [border, secur, nation, secur, secur, border, ...
1263290    [failur, address, root, caus, correct, problem...
1263291    [ration, greater, uncertainti, greater, risk, ...
1263292    [america, central, bank, destroy, valu, money,...
1263293    [fail, drag, america, realdonaldtrump, war, pl...
Name: tokens, dtype: object

Bag-of-Words: Document-Feature Matrix Representation

As we saw in the lecture, our next step is to represent text numerically. We will do so by using the Bag of Words assumption. This assumption states that we represent text as an unordered set of words in a document.

  • Order is ignored;
  • frequency matters.

Remember, the idea here is to represent text data as numbers. We do so by breaking the text in words, and counting them. A standard way to do so is by using a Document-Feature Matrix (DFM)

  • Rows: documents of the corpus
  • Columns: feature or tokens or words
  • Cell: number of times a word j occurs in document i

To create a DFM, we will use the CountVectorizer() method from sklearn

In [26]:
from sklearn.feature_extraction.text import CountVectorizer

# combine the pre-processed data
tweets_data_['tokens_join'] = tweets_data_['tokens'].apply(' '.join)

# instantiate a vectorizer
vectorizer = CountVectorizer()

# transform the data
dfm = vectorizer.fit_transform(tweets_data_['tokens_join'])
In [27]:
# oput is a matrix
type(dfm)
Out[27]:
scipy.sparse._csr.csr_matrix
In [28]:
# Convert the matrix to an array and display it
feature_matrix = dfm.todense()
In [29]:
# super sparse matrix
feature_matrix
Out[29]:
matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])
In [30]:
# Get feature names to use as dataframe column headers
feature_names = vectorizer.get_feature_names_out()
In [31]:
# Create a DataFrame with the feature matrix
df = pd.DataFrame(feature_matrix, columns=feature_names)
df
Out[31]:
aa aaa aaahct aacf aadcwv aaf aahomecar aamd aan aanmemb ... 𝙻𝚘𝚠𝚎𝚜𝚝 𝚁𝙴𝙿𝙾𝚁𝚃 𝚁𝚎𝚌𝚘𝚛𝚍 𝚂𝙴𝙿𝚃𝙴𝙼𝙱𝙴𝚁 𝚏𝚘𝚛 𝚒𝚗 𝚕𝚘𝚠 𝚛𝚊𝚝𝚎 𝚞𝚗𝚎𝚖𝚙𝚕𝚘𝚢𝚖𝚎𝚗𝚝 𝚢𝚎𝚊𝚛𝚜
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32398 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
32399 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
32400 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
32401 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
32402 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

32403 rows × 16919 columns

hugely sparse data!!

Visualizing most used words

With this representation, we can actually start visualizing some interesting patterns in the data.

For example, we can visualize the most distictive words tweets by each politician. In this case, we need:

  • Change unit of analysis from tweets to politicians
  • Join all tweets by politician
  • Pre-process the text
  • Build the dfm
  • Estimate some type of distictiveness measure
In [32]:
tweets_data.head()
Out[32]:
author text date bios retweet_author Name Link State Party congress
0 AustinScottGA08 It is my team’s privilege to help our constitu... Fri Dec 31 18:24:51 +0000 2021 I am proud to represent the 8th Congressional ... NaN Scott, Austin https://twitter.com/AustinScottGA08 GA R House
1 AustinScottGA08 I am proud to have sponsored this amendment wh... Wed Dec 29 20:14:48 +0000 2021 I am proud to represent the 8th Congressional ... NaN Scott, Austin https://twitter.com/AustinScottGA08 GA R House
2 AustinScottGA08 From my family to yours, we wish you peace, jo... Sat Dec 25 16:48:16 +0000 2021 I am proud to represent the 8th Congressional ... NaN Scott, Austin https://twitter.com/AustinScottGA08 GA R House
3 AustinScottGA08 President Biden and Congress have a responsibi... Wed Dec 22 19:14:13 +0000 2021 I am proud to represent the 8th Congressional ... NaN Scott, Austin https://twitter.com/AustinScottGA08 GA R House
4 AustinScottGA08 Happy second birthday to @SpaceForceDoD!\n\nSe... Mon Dec 20 15:37:11 +0000 2021 I am proud to represent the 8th Congressional ... NaN Scott, Austin https://twitter.com/AustinScottGA08 GA R House
In [33]:
# change unit of analysis
tweets_data_g = tweets_data.groupby(["author","State", "Party"])["text"].apply(lambda x: "".join(x)).reset_index().copy()
In [34]:
tweets_data_g
Out[34]:
author State Party text
0 AustinScottGA08 GA R It is my team’s privilege to help our constitu...
1 BennieGThompson MS D RT @DerrickNAACP: If you can afford to pause s...
2 BettyMcCollum04 MN D Happy New Year! May 2022 bring you peace &...
3 BillPascrell NJ D Some sage words of wisdom from the FDR Memoria...
4 BobbyScott VA D RT @EdLaborCmte: At 12am, we will finally say ...
... ... ... ... ...
420 replouiegohmert TX R I recently had the honor of guest hosting the ...
421 repmarkpocan WI D Betty White could get anyone to laugh. An Amer...
422 rosadelauro CT D May you and your family have a joyful, happy, ...
423 senrobportman OH R It is clear that with record levels of unlawfu...
424 virginiafoxx NC R Happy New Year!\n\nWho’s ready for the #RedWav...

425 rows × 4 columns

In [35]:
# see
authors = ["RepAOC", "Ilhan", "SpeakerPelosi", "marcorubio", "SenatorTimScott", 
           "SenTedCruz", "Jim_Jordan", "GOPLeader"]

# make a copy
reps = tweets_data_g[tweets_data_g["author"].str.contains("|".join(authors))].copy()
In [36]:
reps
Out[36]:
author State Party text
23 GOPLeader CA R Happy new year! https://t.co/GQA2zzlmWLRT @Rep...
27 Ilhan MN D Now would be a good time to cancel student deb...
31 Jim_Jordan OH R Happy New Year. God Bless America!RT @GOPLeade...
53 RepAOC NY D Millions of Yemenis are facing famine due to c...
359 SenTedCruz TX R Heidi & I wish you all a happy, healthy, a...
388 SenatorTimScott SC R Happy New Year! \n \nWishing everyone peace, p...
390 SpeakerPelosi CA D May this New Year usher in a time of joy, pros...
In [37]:
stop_words = stop_words + ["new", "https", "rt"]
In [38]:
# pre-process

# tokenize
reps["tokens"] = reps["text"].apply(word_tokenize)

# normalize
reps["tokens"] = reps["tokens"].apply(lambda x: [word.lower() for word in x if word.isalpha()])

# stem and stopwords
reps["tokens"] = reps["tokens"].apply(lambda x: [porter.stem(word) for word in x if word not in stop_words])
In [39]:
## Create dfm
# combine the pre-processed data
reps['tokens_join'] = reps['tokens'].apply(' '.join)

# instantiate a vectorizer
vectorizer = CountVectorizer()

# transform the data
dfm = vectorizer.fit_transform(reps['tokens_join'])
In [41]:
# convert df
dfm_d = pd.DataFrame(dfm.toarray(), 
                     columns=vectorizer.get_feature_names_out(), 
                     index=reps["author"])
In [42]:
# see the dataset
dfm_d
Out[42]:
aaa aadaw aalftwinc aapi aaron aarpca ab abajoladictadura abandon abbevil ... 𝙃𝙤𝙬 𝙗𝙤𝙧𝙙𝙚𝙧 𝙘𝙤𝙢𝙥𝙧𝙚𝙝𝙚𝙣𝙨𝙞𝙫𝙚 𝙘𝙧𝙚𝙖𝙩𝙚 𝙘𝙧𝙞𝙨𝙞𝙨 𝙢𝙤𝙧𝙚 𝙩𝙤 𝙰𝚙𝚛𝚒𝚕 𝙿𝙿𝙿 𝚄𝙿𝙳𝙰𝚃𝙴
author
GOPLeader 1 0 0 0 0 0 1 0 23 0 ... 1 1 1 1 1 3 1 0 0 0
Ilhan 0 1 1 5 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
Jim_Jordan 0 0 0 0 0 0 0 0 5 0 ... 0 0 0 0 0 0 0 0 0 0
RepAOC 0 0 0 2 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
SenTedCruz 0 0 0 0 0 0 0 1 17 0 ... 0 0 0 0 0 0 0 0 0 0
SenatorTimScott 0 0 0 0 1 0 0 0 2 4 ... 0 0 0 0 0 0 0 1 1 1
SpeakerPelosi 0 0 0 6 0 1 0 0 4 0 ... 0 0 0 0 0 0 0 0 0 0

7 rows × 11979 columns

In [43]:
# overall most important features
index = dfm_d.sum().sort_values(ascending=False).index
In [44]:
index
Out[44]:
Index(['biden', 'american', 'democrat', 'amp', 'presid', 'border', 'today',
       'hous', 'peopl', 'year',
       ...
       'kermit', 'kerik', 'kenya', 'kentuckymbb', 'kenpaxtontx',
       'kendilaniannbc', 'kelseykoberg', 'kellynashradio', 'kellymakena',
       '𝚄𝙿𝙳𝙰𝚃𝙴'],
      dtype='object', length=11979)
In [45]:
# see the most important features
dfm_d[index]
Out[45]:
biden american democrat amp presid border today hous peopl year ... kermit kerik kenya kentuckymbb kenpaxtontx kendilaniannbc kelseykoberg kellynashradio kellymakena 𝚄𝙿𝙳𝙰𝚃𝙴
author
GOPLeader 766 365 520 201 361 523 141 234 125 146 ... 0 0 0 0 0 1 1 0 0 0
Ilhan 25 190 15 92 108 19 165 188 222 134 ... 0 0 1 0 0 0 0 0 0 0
Jim_Jordan 434 173 334 69 275 168 96 103 66 59 ... 1 1 0 0 1 0 0 0 0 0
RepAOC 11 29 9 101 9 0 77 57 45 42 ... 0 0 0 0 0 0 0 0 1 0
SenTedCruz 758 197 262 267 157 309 114 27 89 91 ... 0 0 0 0 0 0 0 0 0 0
SenatorTimScott 93 202 159 168 87 49 247 58 104 152 ... 0 0 0 1 0 0 0 1 0 1
SpeakerPelosi 61 454 154 510 290 2 176 300 180 185 ... 0 0 0 0 0 0 0 0 0 0

7 rows × 11979 columns

In [46]:
# most features words by candidate

# clean to capture top 10 terms
dfm_d.index.name = "author_tweet"

# contained
df_list = list()


# get top terms by group
for id, row in dfm_d.groupby("author_tweet"):
    idx = row.sum().sort_values(ascending=False).index
    temp = row.loc[:, idx].reset_index().melt(id_vars=["author_tweet"]).iloc[0:10, :]
    df_list.append(temp)

# concat
top_terms = pd.concat(df_list, axis=0)
In [55]:
top_terms
Out[55]:
author_tweet variable value
0 GOPLeader biden 766
1 GOPLeader border 523
2 GOPLeader democrat 520
3 GOPLeader american 365
4 GOPLeader presid 361
... ... ... ...
5 SpeakerPelosi famili 219
6 SpeakerPelosi capitol 217
7 SpeakerPelosi trump 204
8 SpeakerPelosi act 193
9 SpeakerPelosi year 185

70 rows × 3 columns

In [47]:
# visualize
from plotnine import *

# plot
(ggplot(top_terms, aes(x='variable', y='value')) +
    geom_bar(stat='identity') +
    facet_wrap('~author_tweet', scales='free') +
    coord_flip() +  # To make horizontal bar plots
    theme(subplots_adjust={'wspace': 0.25},  # Adjust the space between plots
          axis_text_y=element_text(size=10),  # Adjust text size for y axis
          figure_size=(15, 10)) +  # Adjust the figure size
    labs(x='Frequency', y='') +
 theme_minimal()
) 
Out[47]:
<Figure Size: (640 x 480)>

Other ways to count better than simple frequencies

Counts of simple frequencies is a bit silly. Let's look at other ways to count that retrieve more information:

N-Grams

N-grams: count words that appear together with a N-size window

TF-IDF:

It is a weighted measure of counts by the number of times the term appears in other documents.

Term-Frequency

$$\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in a document } d}{\text{Total number of terms in the document } d}$$

Inverse Document Frequency (IDF): $$ \text{IDF}(t, D) = \log\left(\frac{\text{Total number of documents } |D|}{\text{Number of documents with term } t \text{ in it}}\right) $$

TF-IDF: $$ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) $$

In [48]:
# bi-grams

# instantiate a vectorizer
vectorizer = CountVectorizer()

# get bigrams
vectorizer = CountVectorizer(
    lowercase=True,
    stop_words='english',
    ngram_range=(2,2), ## see here is the main difference
    # max_features=N  # Optionally restricts to top N tokens
)

text_bi = vectorizer.fit_transform(reps['tokens_join'])

# Convert matrix to DataFrame with bigram columns
# convert df
text_bi_pd = pd.DataFrame(text_bi.toarray(), 
                     columns=vectorizer.get_feature_names_out(), 
                     index=reps["author"])

# see
text_bi_pd
Out[48]:
aaa ga aadaw sahanjourn aalftwinc drive aapi commun aapi equal aapi hero aapi mayor aapi week aapi women aaron personifi ... 𝙘𝙤𝙢𝙥𝙧𝙚𝙝𝙚𝙣𝙨𝙞𝙫𝙚 invest 𝙘𝙧𝙚𝙖𝙩𝙚 𝙗𝙤𝙧𝙙𝙚𝙧 𝙘𝙧𝙞𝙨𝙞𝙨 stop 𝙢𝙤𝙧𝙚 govern 𝙢𝙤𝙧𝙚 spend 𝙢𝙤𝙧𝙚 tax 𝙩𝙤 𝙘𝙧𝙚𝙖𝙩𝙚 𝙰𝚙𝚛𝚒𝚕 loan 𝙿𝙿𝙿 𝚄𝙿𝙳𝙰𝚃𝙴 𝚄𝙿𝙳𝙰𝚃𝙴 𝙰𝚙𝚛𝚒𝚕
author
GOPLeader 1 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 0 0 0
Ilhan 0 1 1 3 0 0 0 0 2 0 ... 0 0 0 0 0 0 0 0 0 0
Jim_Jordan 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
RepAOC 0 0 0 1 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
SenTedCruz 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
SenatorTimScott 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 1 1 1
SpeakerPelosi 0 0 0 3 1 1 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

7 rows × 118919 columns

In [50]:
# clean to capture top 10 terms
text_bi_pd.index.name = "author_tweet"

# contained
df_list = list()


# get top terms by group
for id, row in text_bi_pd.groupby("author_tweet"):
    idx = row.sum().sort_values(ascending=False).index
    temp = row.loc[:, idx].reset_index().melt(id_vars=["author_tweet"]).iloc[0:10, :]
    df_list.append(temp)

# concat
top_terms = pd.concat(df_list, axis=0)

# see it
top_terms.head()
Out[50]:
author_tweet variable value
0 GOPLeader presid biden 267
1 GOPLeader southern border 99
2 GOPLeader border crisi 92
3 GOPLeader biden administr 89
4 GOPLeader hous democrat 68
In [51]:
# visualize
from plotnine import *

# plot
(ggplot(top_terms, aes(x='variable', y='value')) +
    geom_bar(stat='identity') +
    facet_wrap('~author_tweet', scales='free') +
    coord_flip() +  # To make horizontal bar plots
    theme(subplots_adjust={'wspace': 0.25},  # Adjust the space between plots
          axis_text_y=element_text(size=10),  # Adjust text size for y axis
          figure_size=(15, 10)) +  # Adjust the figure size
    labs(x='Frequency', y='') +
 theme_minimal()
) 
Out[51]:
<Figure Size: (640 x 480)>
In [52]:
# Term Frequency - Inverse Document Frequency (TF-IDF):
from sklearn.feature_extraction.text import TfidfVectorizer

# instantiate a vectorizer
vectorizer = TfidfVectorizer()

# get tfidf
vectorizer = TfidfVectorizer(
    lowercase=True,
    stop_words='english',
    # max_features=N  # Optionally restricts to top N tokens
)

# transform
text_tfidf = vectorizer.fit_transform(reps['tokens_join'])
In [53]:
# convert df
text_tfidf_pd = pd.DataFrame(text_tfidf.toarray(), 
                     columns=vectorizer.get_feature_names_out(), 
                     index=reps["author"])


# clean to capture top 10 terms
text_tfidf_pd.index.name = "author_tweet"

# contained
df_list = list()

# get top terms by group
for id, row in text_tfidf_pd.groupby("author_tweet"):
    idx = row.sum().sort_values(ascending=False).index
    temp = row.loc[:, idx].reset_index().melt(id_vars=["author_tweet"]).iloc[0:10, :]
    df_list.append(temp)

# concat
top_terms = pd.concat(df_list, axis=0)

top_terms
Out[53]:
author_tweet variable value
0 GOPLeader biden 0.442720
1 GOPLeader border 0.342638
2 GOPLeader democrat 0.300541
3 GOPLeader american 0.210957
4 GOPLeader presid 0.208645
... ... ... ...
5 SpeakerPelosi famili 0.151202
6 SpeakerPelosi capitol 0.149821
7 SpeakerPelosi trump 0.140845
8 SpeakerPelosi act 0.133251
9 SpeakerPelosi year 0.127727

70 rows × 3 columns

In [54]:
# visualize
from plotnine import *

# plot
(ggplot(top_terms, aes(x='variable', y='value')) +
    geom_bar(stat='identity') +
    facet_wrap('~author_tweet', scales='free') +
    coord_flip() +  # To make horizontal bar plots
    theme(subplots_adjust={'wspace': 0.25},  # Adjust the space between plots
          axis_text_y=element_text(size=10),  # Adjust text size for y axis
          figure_size=(15, 10)) +  # Adjust the figure size
    labs(x='Frequency', y='') +
 theme_minimal()
) 
Out[54]:
<Figure Size: (640 x 480)>

Practice:

Repeate the process described above, but using a different grouping variable. In this case, you can:

  • either group using other variables in the data (day, party, state)

  • use other politicians.

Use one of the metric below (count, tfidf or bigrams) to understand the most important words for each group.

In [ ]:
# your code here

Similarity between documents

Let's now calculate measures of similarity between the authors of the tweets. Notice, this could be done for each tweet, or for all the politicians. We will focus on the latter just to make things more interesting.

Here is our similarity measure:

$$\text{Sim}(A, B) = \frac{{A \cdot B}}{{\|A\| \|B\|}}$$

Where:

  • The $\cdot$ here means a dot product: $\sum_j \mathbf{a_j} \cdot \mathbf{b_j}$
  • The vector norm $\mathbf{||A||} = \sqrt{\sum \mathbf{{a_j}^2}}$

We will use as an input the tf-idf matrix! The function (which is similar to the one you wrote in problem set 2) is implemented with the sklearn library

In [55]:
# import
from sklearn.metrics.pairwise import cosine_similarity

# re-estimate tf-idf
vectorizer = TfidfVectorizer()

# transform
text_tfidf = vectorizer.fit_transform(reps['tokens_join'])
In [56]:
# Calculate the cosine similarity between all pairs in the matrix
cosine_sim = cosine_similarity(text_tfidf, text_tfidf )

# Display the cosine similarity matrix
cosine_sim
Out[56]:
array([[1.        , 0.46860774, 0.53976422, 0.37072597, 0.68188297,
        0.46312371, 0.53468698],
       [0.46860774, 1.        , 0.32128238, 0.59587565, 0.39043828,
        0.53491185, 0.67609217],
       [0.53976422, 0.32128238, 1.        , 0.25702161, 0.43712255,
        0.31630066, 0.35650099],
       [0.37072597, 0.59587565, 0.25702161, 1.        , 0.33853562,
        0.46278013, 0.55208889],
       [0.68188297, 0.39043828, 0.43712255, 0.33853562, 1.        ,
        0.39513423, 0.44274346],
       [0.46312371, 0.53491185, 0.31630066, 0.46278013, 0.39513423,
        1.        , 0.56210807],
       [0.53468698, 0.67609217, 0.35650099, 0.55208889, 0.44274346,
        0.56210807, 1.        ]])
In [57]:
# convert to a df
author = reps["author"]
similarity_df = pd.DataFrame(cosine_sim, columns=reps["author"], index=reps["author"])

# similarity
similarity_df
Out[57]:
author GOPLeader Ilhan Jim_Jordan RepAOC SenTedCruz SenatorTimScott SpeakerPelosi
author
GOPLeader 1.000000 0.468608 0.539764 0.370726 0.681883 0.463124 0.534687
Ilhan 0.468608 1.000000 0.321282 0.595876 0.390438 0.534912 0.676092
Jim_Jordan 0.539764 0.321282 1.000000 0.257022 0.437123 0.316301 0.356501
RepAOC 0.370726 0.595876 0.257022 1.000000 0.338536 0.462780 0.552089
SenTedCruz 0.681883 0.390438 0.437123 0.338536 1.000000 0.395134 0.442743
SenatorTimScott 0.463124 0.534912 0.316301 0.462780 0.395134 1.000000 0.562108
SpeakerPelosi 0.534687 0.676092 0.356501 0.552089 0.442743 0.562108 1.000000
In [58]:
# AOC closest to?
similarity_df["RepAOC"].sort_values(ascending=False)
Out[58]:
author
RepAOC             1.000000
Ilhan              0.595876
SpeakerPelosi      0.552089
SenatorTimScott    0.462780
GOPLeader          0.370726
SenTedCruz         0.338536
Jim_Jordan         0.257022
Name: RepAOC, dtype: float64
In [59]:
# Jim Jordan closest to?
similarity_df["Jim_Jordan"].sort_values(ascending=False)
Out[59]:
author
Jim_Jordan         1.000000
GOPLeader          0.539764
SenTedCruz         0.437123
SpeakerPelosi      0.356501
Ilhan              0.321282
SenatorTimScott    0.316301
RepAOC             0.257022
Name: Jim_Jordan, dtype: float64
In [60]:
# Convert to tidy
df_tidy = similarity_df.reset_index().melt(id_vars='author', var_name='related_author', value_name='correlation')
df_tidy = df_tidy.sort_values(["author", "correlation"], ascending=False).copy()
In [61]:
# get order
order = df_tidy.tail(7).related_author
In [62]:
# Creating the heatmap
(ggplot(df_tidy, aes(x='author', y='related_author', fill='correlation'))
 + geom_tile()
 +  scale_fill_gradient(low="white", high="blue", 
                       limits=(.4, 1.01)) 
 +  scale_x_discrete(limits=order) 
 +  scale_y_discrete(limits=order) 
 + theme(axis_text_x=element_text(angle=90, hjust=1))
 + labs(title='Correlation Tile Matrix', x='Author', y='Related Author', fill='Correlation')
)
Out[62]:
<Figure Size: (640 x 480)>

Topic Model: LDA Implementation

To estimate topic models, we will use the gensim library. gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. It is the main library to retrieve pre-trained word embeddings, or to train word embeddings using the famous word2vec algorithm.

This is a step-by-step of estimate LDA using gensim:

  • Preprocess the Text: Follow most of the steps we saw before, including tokenization, removing stopwords, normalization, etc..

  • Create a dictionary: gensim requires you to create a dictionary of all stemmed/preprocessed words in the corpus (collection of documents); the method Dictionary from gensim will crete this data structure for us.

  • Filter out words from the dictionary that appear in either a very low proportion of documents (lower bound) or a very high proportion of documents (upper bound).

  • Create a bag-of-words representation of the documents: maps words from the dictionary representation to each document.

  • Estimate the topic model: use LDA model within gensim

In [63]:
# get a sample
td = tweets_data.iloc[random.sample(range(1, tweets_data.shape[0]), 1000)].copy()

Step 1 - Pre-Processing

Write a function with all our previous steps

In [64]:
# Write a preprocessing function
def preprocess_text(text):
    
    # increase stop words
    stop_words = stopwords.words('english')
    stop_words = stop_words + ["http"]
    
    # tokenization 
    tokens_ = word_tokenize(text)
    
    # Generate a list of tokens after preprocessing
 
    # normalize
    tokens_ = [word.lower() for word in tokens_ if word.isalpha()]

    # stem and stopwords
    tokens_ =  [porter.stem(word) for word in tokens_ if word not in stop_words]
    # Return the preprocessed tokens as a string
    return tokens_
In [65]:
# apply
td["tokens"] = td["text"].apply(preprocess_text)

Step 2: Create a Dictionary

In [66]:
# import dictionar
from gensim.corpora import Dictionary

# convert to a list
tokens = td["tokens"].tolist()

# let's look what this input is. 
# should be a list of list for each document split by tokens
tokens[1]
Out[66]:
['year',
 'ago',
 'civilian',
 'conserv',
 'corp',
 'put',
 'million',
 'work',
 'build',
 'road',
 'public',
 'land',
 'http']
In [67]:
# Create a dictionary representation of the documents
dictionary = Dictionary(tokens)

# see
dictionary.token2id
Out[67]:
{'american': 0,
 'flag': 1,
 'flagday': 2,
 'http': 3,
 'justic': 4,
 'let': 5,
 'liberti': 6,
 'recogn': 7,
 'repres': 8,
 'truli': 9,
 'us': 10,
 'ago': 11,
 'build': 12,
 'civilian': 13,
 'conserv': 14,
 'corp': 15,
 'land': 16,
 'million': 17,
 'public': 18,
 'put': 19,
 'road': 20,
 'work': 21,
 'year': 22,
 'basebal': 23,
 'cece': 24,
 'cheer': 25,
 'congression': 26,
 'enjoy': 27,
 'game': 28,
 'last': 29,
 'night': 30,
 'penc': 31,
 'republican': 32,
 'vp': 33,
 'home': 34,
 'honor': 35,
 'housegop': 36,
 'join': 37,
 'never': 38,
 'pleas': 39,
 'prison': 40,
 'return': 41,
 'rt': 42,
 'today': 43,
 'war': 44,
 'action': 45,
 'demand': 46,
 'endgunviol': 47,
 'housedemocrat': 48,
 'march': 49,
 'mcconnel': 50,
 'offic': 51,
 'senatemajldr': 52,
 'administr': 53,
 'bathroom': 54,
 'execut': 55,
 'misguid': 56,
 'obama': 57,
 'school': 58,
 'statement': 59,
 'democrat': 60,
 'edlaborgop': 61,
 'educ': 62,
 'higher': 63,
 'leader': 64,
 'legisl': 65,
 'partisan': 66,
 'react': 67,
 'reform': 68,
 'repsmuck': 69,
 'virginiafoxx': 70,
 'back': 71,
 'congress': 72,
 'eventu': 73,
 'get': 74,
 'need': 75,
 'plan': 76,
 'speaker': 77,
 'washington': 78,
 'corbett': 79,
 'epitom': 80,
 'exemplari': 81,
 'petti': 82,
 'rachel': 83,
 'semper': 84,
 'servic': 85,
 'spirit': 86,
 'uscg': 87,
 'blaze': 88,
 'entrepreneur': 89,
 'innov': 90,
 'other': 91,
 'risk': 92,
 'take': 93,
 'thank': 94,
 'trail': 95,
 'alway': 96,
 'christma': 97,
 'great': 98,
 'holiday': 99,
 'jefferson': 100,
 'kick': 101,
 'parad': 102,
 'season': 103,
 'way': 104,
 'west': 105,
 'behind': 106,
 'close': 107,
 'door': 108,
 'even': 109,
 'hous': 110,
 'impeach': 111,
 'oper': 112,
 'polit': 113,
 'process': 114,
 'rule': 115,
 'amp': 116,
 'came': 117,
 'counti': 118,
 'discuss': 119,
 'dr': 120,
 'ed': 121,
 'give': 122,
 'lake': 123,
 'lakeschool': 124,
 'moxley': 125,
 'superintend': 126,
 'susan': 127,
 'updat': 128,
 'compani': 129,
 'constitu': 130,
 'make': 131,
 'medic': 132,
 'pharmaceut': 133,
 'profit': 134,
 'ration': 135,
 'record': 136,
 'sick': 137,
 'tire': 138,
 'andrew': 139,
 'capitol': 140,
 'dc': 141,
 'middl': 142,
 'rain': 143,
 'stop': 144,
 'time': 145,
 'tour': 146,
 'captain': 147,
 'collin': 148,
 'command': 149,
 'hunt': 150,
 'met': 151,
 'new': 152,
 'pnsi': 153,
 'shipyard': 154,
 'welcom': 155,
 'ballot': 156,
 'help': 157,
 'measur': 158,
 'receiv': 159,
 'sole': 160,
 'voter': 161,
 'wa': 162,
 'airbnb': 163,
 'avail': 164,
 'enrol': 165,
 'evacue': 166,
 'learn': 167,
 'lodg': 168,
 'may': 169,
 'sccounti': 170,
 'visit': 171,
 'daughter': 172,
 'equal': 173,
 'law': 174,
 'mother': 175,
 'opportun': 176,
 'realiti': 177,
 'right': 178,
 'committe': 179,
 'intellig': 180,
 'terror': 181,
 'threat': 182,
 'day': 183,
 'goldstandard': 184,
 'happi': 185,
 'mapl': 186,
 'syrup': 187,
 'vermont': 188,
 'countri': 189,
 'ilhan': 190,
 'introduc': 191,
 'meal': 192,
 'richest': 193,
 'sensand': 194,
 'sweep': 195,
 'true': 196,
 'univers': 197,
 'busi': 198,
 'closur': 199,
 'govandybeshear': 200,
 'govern': 201,
 'guidanc': 202,
 'base': 203,
 'case': 204,
 'dismiss': 205,
 'previou': 206,
 'roevwad': 207,
 'scotu': 208,
 'abdic': 209,
 'away': 210,
 'cap': 211,
 'choic': 212,
 'low': 213,
 'number': 214,
 'popul': 215,
 'refuge': 216,
 'trump': 217,
 'turn': 218,
 'vulner': 219,
 'break': 220,
 'campaign': 221,
 'model': 222,
 'period': 223,
 'prohibit': 224,
 'promot': 225,
 'repadamschiff': 226,
 'resourc': 227,
 'taxpay': 228,
 'use': 229,
 'fauci': 230,
 'listen': 231,
 'member': 232,
 'pardonmytak': 233,
 'recommend': 234,
 'staff': 235,
 'think': 236,
 'affect': 237,
 'assist': 238,
 'butt': 239,
 'california': 240,
 'elig': 241,
 'fema': 242,
 'follow': 243,
 'resid': 244,
 'wildfir': 245,
 'counsel': 246,
 'counterintellig': 247,
 'formal': 248,
 'invit': 249,
 'mueller': 250,
 'special': 251,
 'testifi': 252,
 'appli': 253,
 'blue': 254,
 'deadlin': 255,
 'extend': 256,
 'keep': 257,
 'news': 258,
 'oct': 259,
 'roof': 260,
 'come': 261,
 'condemn': 262,
 'judiciari': 263,
 'reject': 264,
 'togeth': 265,
 'watch': 266,
 'democraci': 267,
 'freedom': 268,
 'import': 269,
 'men': 270,
 'protect': 271,
 'sacrif': 272,
 'want': 273,
 'women': 274,
 'aca': 275,
 'access': 276,
 'condit': 277,
 'healthcar': 278,
 'preexist': 279,
 'repeal': 280,
 'repjaredpoli': 281,
 'respond': 282,
 'support': 283,
 'usrepk': 284,
 'end': 285,
 'first': 286,
 'open': 287,
 'sign': 288,
 'bartkowiak': 289,
 'denni': 290,
 'eric': 291,
 'fountain': 292,
 'leiker': 293,
 'mitchellbyar': 294,
 'neven': 295,
 'old': 296,
 'rikki': 297,
 'stanis': 298,
 'stong': 299,
 'suzann': 300,
 'teri': 301,
 'tralona': 302,
 'actonclim': 303,
 'advoc': 304,
 'awesom': 305,
 'cleanairmom': 306,
 'forc': 307,
 'ri': 308,
 'act': 309,
 'alloc': 310,
 'amidst': 311,
 'care': 312,
 'concern': 313,
 'ct': 314,
 'elect': 315,
 'ensur': 316,
 'health': 317,
 'prep': 318,
 'aumentando': 319,
 'ayudar': 320,
 'de': 321,
 'debemo': 322,
 'en': 323,
 'infeccion': 324,
 'la': 325,
 'losangel': 326,
 'nuestra': 327,
 'para': 328,
 'part': 329,
 'poner': 330,
 'siguen': 331,
 'todo': 332,
 'aeromed': 333,
 'airnatlguard': 334,
 'evacu': 335,
 'nasfortworthjrb': 336,
 'select': 337,
 'usairforc': 338,
 'everywher': 339,
 'father': 340,
 'fathersday': 341,
 'live': 342,
 'love': 343,
 'play': 344,
 'role': 345,
 'birthday': 346,
 'centuri': 347,
 'nation': 348,
 'oldest': 349,
 'organ': 350,
 'veteran': 351,
 'vfwhq': 352,
 'amen': 353,
 'repcleav': 354,
 'lost': 355,
 'moment': 356,
 'noon': 357,
 'rememb': 358,
 'silenc': 359,
 'corpor': 360,
 'deficit': 361,
 'explod': 362,
 'goptaxscam': 363,
 'step': 364,
 'tax': 365,
 'wealthiest': 366,
 'buse': 367,
 'color': 368,
 'commun': 369,
 'folk': 370,
 'known': 371,
 'lifelin': 372,
 'long': 373,
 'vital': 374,
 'anyth': 375,
 'deni': 376,
 'repbarbarale': 377,
 'state': 378,
 'thereidout': 379,
 'activist': 380,
 'dedic': 381,
 'meet': 382,
 'partnership': 383,
 'readi': 384,
 'announc': 385,
 'gap': 386,
 'nearli': 387,
 'pandem': 388,
 'proud': 389,
 'significantli': 390,
 'worsen': 391,
 'afghan': 392,
 'beaten': 393,
 'kill': 394,
 'offici': 395,
 'partner': 396,
 'silent': 397,
 'taliban': 398,
 'age': 399,
 'cdc': 400,
 'children': 401,
 'head': 402,
 'vaccin': 403,
 'week': 404,
 'feder': 405,
 'locat': 406,
 'manchinmobilemonday': 407,
 'chr': 408,
 'eastern': 409,
 'employe': 410,
 'facilit': 411,
 'instrument': 412,
 'rehab': 413,
 'serv': 414,
 'town': 415,
 'clyburn': 416,
 'defund': 417,
 'dem': 418,
 'jim': 419,
 'polic': 420,
 'will': 421,
 'author': 422,
 'bush': 423,
 'iraq': 424,
 'presid': 425,
 'send': 426,
 'andoverohpublib': 427,
 'congratul': 428,
 'endow': 429,
 'grant': 430,
 'human': 431,
 'debt': 432,
 'fulli': 433,
 'oblig': 434,
 'owe': 435,
 'repay': 436,
 'servicemen': 437,
 'construct': 438,
 'decad': 439,
 'fund': 440,
 'got': 441,
 'inact': 442,
 'move': 443,
 'project': 444,
 'repmoolenaar': 445,
 'three': 446,
 'direct': 447,
 'inquiri': 448,
 'ohdeptofhealth': 449,
 'question': 450,
 'certainli': 451,
 'one': 452,
 'recoveri': 453,
 'across': 454,
 'buckey': 455,
 'celebr': 456,
 'famili': 457,
 'hope': 458,
 'peac': 459,
 'wish': 460,
 'believ': 461,
 'jahim': 462,
 'lie': 463,
 'major': 464,
 'peopl': 465,
 'still': 466,
 'therecount': 467,
 'arkansa': 468,
 'cool': 469,
 'evolv': 470,
 'field': 471,
 'rice': 472,
 'see': 473,
 'techniqu': 474,
 'autom': 475,
 'fraud': 476,
 'impostor': 477,
 'messag': 478,
 'notic': 479,
 'overpay': 480,
 'sent': 481,
 'victim': 482,
 'august': 483,
 'hero': 484,
 'late': 485,
 'must': 486,
 'pass': 487,
 'provid': 488,
 'rise': 489,
 'senat': 490,
 'america': 491,
 'black': 492,
 'brown': 493,
 'buy': 494,
 'indianapoli': 495,
 'like': 496,
 'look': 497,
 'good': 498,
 'grindstonecaf': 499,
 'kingdom': 500,
 'newport': 501,
 'next': 502,
 'northeast': 503,
 'talk': 504,
 'vter': 505,
 'vtsnek': 506,
 'address': 507,
 'bill': 508,
 'bipartisan': 509,
 'michigan': 510,
 'dstdreamgirl': 511,
 'border': 512,
 'chao': 513,
 'crisi': 514,
 'expand': 515,
 'possibl': 516,
 'southern': 517,
 'worst': 518,
 'would': 519,
 'call': 520,
 'complaint': 521,
 'houseintel': 522,
 'immedi': 523,
 'unless': 524,
 'whistleblow': 525,
 'decemb': 526,
 'enough': 527,
 'payment': 528,
 'budget': 529,
 'est': 530,
 'facebook': 531,
 'govwast': 532,
 'relief': 533,
 'repkevinhern': 534,
 'thomasaschatz': 535,
 'tomorrow': 536,
 'around': 537,
 'denysenko': 538,
 'disast': 539,
 'sad': 540,
 'tresja': 541,
 'usaid': 542,
 'worker': 543,
 'world': 544,
 'biden': 545,
 'democracyendur': 546,
 'harri': 547,
 'vice': 548,
 'charg': 549,
 'conveni': 550,
 'focu': 551,
 'network': 552,
 'pay': 553,
 'reliabl': 554,
 'attain': 555,
 'clear': 556,
 'consequ': 557,
 'lead': 558,
 'mani': 559,
 'militari': 560,
 'mistak': 561,
 'object': 562,
 'strategi': 563,
 'syria': 564,
 'unforeseen': 565,
 'card': 566,
 'collect': 567,
 'credit': 568,
 'davidcicillin': 569,
 'group': 570,
 'interest': 571,
 'led': 572,
 'waiv': 573,
 'crackdown': 574,
 'houseforeign': 575,
 'journalist': 576,
 'protest': 577,
 'russian': 578,
 'strongest': 579,
 'term': 580,
 'cemeteri': 581,
 'deliv': 582,
 'gettysburg': 583,
 'lincoln': 584,
 'monument': 585,
 'farmwork': 586,
 'essenti': 587,
 'internet': 588,
 'parkersburg': 589,
 'sentinel': 590,
 'forest': 591,
 'histor': 592,
 'invest': 593,
 'pois': 594,
 'reduc': 595,
 'watersh': 596,
 'freestyl': 597,
 'madden': 598,
 'mobil': 599,
 'paig': 600,
 'place': 601,
 'swim': 602,
 'abl': 603,
 'formerli': 604,
 'hundr': 605,
 'pipelin': 606,
 'train': 607,
 'underemploy': 608,
 'unemploy': 609,
 'hear': 610,
 'procedur': 611,
 'releas': 612,
 'resolut': 613,
 'vote': 614,
 'bargain': 615,
 'chip': 616,
 'court': 617,
 'nomin': 618,
 'reproduct': 619,
 'suprem': 620,
 'alce': 621,
 'alongsid': 622,
 'colleagu': 623,
 'friend': 624,
 'hast': 625,
 'privileg': 626,
 'ca': 627,
 'dh': 628,
 'given': 629,
 'respons': 630,
 'secur': 631,
 'settl': 632,
 'chair': 633,
 'examin': 634,
 'farm': 635,
 'subcommitte': 636,
 'tighten': 637,
 'cancer': 638,
 'fellow': 639,
 'fowler': 640,
 'longer': 641,
 'share': 642,
 'stori': 643,
 'survivor': 644,
 'openenrol': 645,
 'start': 646,
 'china': 647,
 'everi': 648,
 'privaci': 649,
 'sanction': 650,
 'boot': 651,
 'foxbusi': 652,
 'ground': 653,
 'reprwilliam': 654,
 'coast': 655,
 'cut': 656,
 'forget': 657,
 'guard': 658,
 'simonwdc': 659,
 'tsa': 660,
 'attend': 661,
 'director': 662,
 'gold': 663,
 'manufactur': 664,
 'recept': 665,
 'robert': 666,
 'technolog': 667,
 'umain': 668,
 'cornwatch': 669,
 'acr': 670,
 'calfireczu': 671,
 'contain': 672,
 'czulightningcomplex': 673,
 'begin': 674,
 'faith': 675,
 'list': 676,
 'negoti': 677,
 'white': 678,
 'dreamandpromis': 679,
 'nydiavelazquez': 680,
 'reproybalallard': 681,
 'word': 682,
 'accur': 683,
 'aim': 684,
 'depart': 685,
 'determin': 686,
 'method': 687,
 'pyrrhotit': 688,
 'research': 689,
 'test': 690,
 'underway': 691,
 'applaud': 692,
 'drewbueno': 693,
 'everifi': 694,
 'numbersusa': 695,
 'repmobrook': 696,
 'effort': 697,
 'event': 698,
 'explor': 699,
 'sdchamber': 700,
 'solut': 701,
 'yesterday': 702,
 'convers': 703,
 'icymi': 704,
 'mica': 705,
 'washtim': 706,
 'addit': 707,
 'background': 708,
 'check': 709,
 'cosponsor': 710,
 'fix': 711,
 'improv': 712,
 'nic': 713,
 'qualiti': 714,
 'anoth': 715,
 'continu': 716,
 'fairytal': 717,
 'morn': 718,
 'mythic': 719,
 'repjerrynadl': 720,
 'covid': 721,
 'español': 722,
 'find': 723,
 'getvax': 724,
 'janschakowski': 725,
 'text': 726,
 'vacuna': 727,
 'zipcod': 728,
 'disarmh': 729,
 'gun': 730,
 'particip': 731,
 'repbecerra': 732,
 'roundtabl': 733,
 'violenc': 734,
 'encourag': 735,
 'nevadan': 736,
 'stand': 737,
 'goe': 738,
 'latina': 739,
 'lgbtqhistorymonth': 740,
 'often': 741,
 'repchuygarcia': 742,
 'rivera': 743,
 'sylvia': 744,
 'trailblaz': 745,
 'unrecogn': 746,
 'crucial': 747,
 'impact': 748,
 'implement': 749,
 'throughout': 750,
 'delet': 751,
 'tweet': 752,
 'vox': 753,
 'becom': 754,
 'firefight': 755,
 'firsthand': 756,
 'idaho': 757,
 'youth': 758,
 'broadband': 759,
 'emc': 760,
 'gachamb': 761,
 'govkemp': 762,
 'legislatur': 763,
 'belong': 764,
 'complex': 765,
 'second': 766,
 'could': 767,
 'ferc': 768,
 'read': 769,
 'report': 770,
 'saw': 771,
 'truth': 772,
 'hospit': 773,
 'loan': 774,
 'ppp': 775,
 'sbagov': 776,
 'small': 777,
 'ustreasuri': 778,
 'breakthrough': 779,
 'discov': 780,
 'excit': 781,
 'incred': 782,
 'nanotechnolog': 783,
 'present': 784,
 'scientif': 785,
 'drug': 786,
 'far': 787,
 'fight': 788,
 'restor': 789,
 'academi': 790,
 'graduat': 791,
 'hs': 792,
 'inform': 793,
 'monday': 794,
 'remind': 795,
 'student': 796,
 'aumf': 797,
 'cm': 798,
 'repgregorymeek': 799,
 'emerg': 800,
 'ongo': 801,
 'violat': 802,
 'exactli': 803,
 'lay': 804,
 'appl': 805,
 'entir': 806,
 'epic': 807,
 'exist': 808,
 'four': 809,
 'revenu': 810,
 'trial': 811,
 'chairman': 812,
 'councilman': 813,
 'councilwoman': 814,
 'hopkin': 815,
 'owen': 816,
 'sisseton': 817,
 'tribal': 818,
 'wahpeton': 819,
 'ceremoni': 820,
 'medal': 821,
 'urbandal': 822,
 'account': 823,
 'assault': 824,
 'best': 825,
 'brightest': 826,
 'cost': 827,
 'demwomencaucu': 828,
 'dod': 829,
 'failur': 830,
 'hold': 831,
 'perpetr': 832,
 'sexual': 833,
 'bether': 834,
 'know': 835,
 'mental': 836,
 'vetaffairsdem': 837,
 'alli': 838,
 'alliesact': 839,
 'bring': 840,
 'safeti': 841,
 'spoke': 842,
 'bellair': 843,
 'program': 844,
 'child': 845,
 'establish': 846,
 'leav': 847,
 'paid': 848,
 'coordin': 849,
 'repmarktakano': 850,
 'anniversari': 851,
 'enact': 852,
 'mark': 853,
 'chávez': 854,
 'césar': 855,
 'ignit': 856,
 'life': 857,
 'movement': 858,
 'everyth': 859,
 'ga': 860,
 'hatr': 861,
 'irrat': 862,
 'oil': 863,
 'produc': 864,
 'refer': 865,
 'said': 866,
 'everyon': 867,
 'galvestonferri': 868,
 'safe': 869,
 'thanksgiv': 870,
 'consid': 871,
 'footnot': 872,
 'histori': 873,
 'revolutionari': 874,
 'free': 875,
 'parent': 876,
 'senatorhaywood': 877,
 'summer': 878,
 'virtual': 879,
 'weekli': 880,
 'workshop': 881,
 'lose': 882,
 'pain': 883,
 'terribl': 884,
 'didyouknow': 885,
 'exchang': 886,
 'housebudgetgop': 887,
 'limit': 888,
 'obamacar': 889,
 'sold': 890,
 'advanc': 891,
 'amount': 892,
 'began': 893,
 'claim': 894,
 'half': 895,
 'individu': 896,
 'juli': 897,
 'monthli': 898,
 'total': 899,
 'activ': 900,
 'exploit': 901,
 'lack': 902,
 'system': 903,
 'transpar': 904,
 'box': 905,
 'commfoodshar': 906,
 'distribut': 907,
 'food': 908,
 'fresh': 909,
 'louisvil': 910,
 'pack': 911,
 'repjoenegus': 912,
 'volunt': 913,
 'anacabrera': 914,
 'damn': 915,
 'doctor': 916,
 'liter': 917,
 'shot': 918,
 'among': 919,
 'district': 920,
 'repjohncurti': 921,
 'twwpioneer': 922,
 'utah': 923,
 'contract': 924,
 'reagan': 925,
 'alreadi': 926,
 'contracept': 927,
 'deserv': 928,
 'womenvet': 929,
 'block': 930,
 'critic': 931,
 'cruz': 932,
 'race': 933,
 'theori': 934,
 'ilhanmn': 935,
 'rockstar': 936,
 'wait': 937,
 'j': 938,
 'paus': 939,
 'though': 940,
 'two': 941,
 'abbi': 942,
 'crystal': 943,
 'dahlkemp': 944,
 'dunn': 945,
 'jessica': 946,
 'mcdonald': 947,
 'mewi': 948,
 'sam': 949,
 'thenccourag': 950,
 'fact': 951,
 'waysmeanscmt': 952,
 'moon': 953,
 'nasa': 954,
 'set': 955,
 'stevescalis': 956,
 'big': 957,
 'content': 958,
 'editori': 959,
 'held': 960,
 'media': 961,
 'newsandsentinel': 962,
 'social': 963,
 'tech': 964,
 'billion': 965,
 'class': 966,
 'mail': 967,
 'piec': 968,
 'usp': 969,
 'fresno': 970,
 'fresnohsngceo': 971,
 'recent': 972,
 'excess': 973,
 'includ': 974,
 'intern': 975,
 'unit': 976,
 'decis': 977,
 'repchiproy': 978,
 'agre': 979,
 'dramat': 980,
 'order': 981,
 'top': 982,
 'accept': 983,
 'area': 984,
 'snap': 985,
 'upinngil': 986,
 'cover': 987,
 'insur': 988,
 'loss': 989,
 'provis': 990,
 'requir': 991,
 'urg': 992,
 'amend': 993,
 'disord': 994,
 'encroach': 995,
 'govt': 996,
 'stood': 997,
 'forward': 998,
 'imag': 999,
 ...}

Step 3: Filter out words

This is a additional pre-processing task. More meaningful topics comes when we remove rare and overly common words`

In [68]:
# Filter out words that occur in less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

Step 4: Create a bag-of-words representation of the documents

In [69]:
# Create a bag-of-words representation of the documents

# notice here you are just inputign every doc in a .doc2bow methods
corpus = [dictionary.doc2bow(doc) for doc in tokens]

# see case by case
# tuple with (id for every token, frequency) 
dictionary.doc2bow(tokens[0])
Out[69]:
[(0, 1), (1, 1)]

Step 5 - Fit the model

In [70]:
from gensim.models.ldamodel import LdaModel

# Make an index to word dictionary
temp = dictionary[0]  # This is only to "load" the dictionary
id2word = dictionary.id2token

# Train the LDA model
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=10,
    eval_every=False
)

Visualizing results

In [71]:
# Print the Keyword in the 10 topics
lda_model.print_topics()
Out[71]:
[(0,
  '0.074*"amp" + 0.074*"live" + 0.072*"proud" + 0.063*"vaccin" + 0.063*"make" + 0.052*"rt" + 0.052*"act" + 0.052*"must" + 0.040*"help" + 0.035*"right"'),
 (1,
  '0.129*"rt" + 0.129*"amp" + 0.076*"great" + 0.061*"today" + 0.040*"time" + 0.036*"feder" + 0.032*"work" + 0.032*"help" + 0.032*"get" + 0.027*"make"'),
 (2,
  '0.154*"rt" + 0.122*"today" + 0.046*"work" + 0.041*"amp" + 0.041*"trump" + 0.036*"honor" + 0.034*"american" + 0.028*"need" + 0.023*"presid" + 0.023*"last"'),
 (3,
  '0.214*"rt" + 0.094*"thank" + 0.071*"week" + 0.054*"hous" + 0.045*"pass" + 0.032*"stop" + 0.029*"help" + 0.027*"amp" + 0.027*"great" + 0.027*"health"'),
 (4,
  '0.088*"american" + 0.069*"bill" + 0.069*"nation" + 0.057*"great" + 0.051*"hous" + 0.047*"rt" + 0.044*"trump" + 0.044*"continu" + 0.038*"vote" + 0.035*"congress"'),
 (5,
  '0.124*"year" + 0.084*"famili" + 0.084*"one" + 0.052*"million" + 0.042*"today" + 0.040*"pandem" + 0.036*"talk" + 0.036*"us" + 0.030*"day" + 0.030*"countri"'),
 (6,
  '0.151*"peopl" + 0.093*"rt" + 0.093*"protect" + 0.053*"act" + 0.047*"support" + 0.041*"american" + 0.030*"first" + 0.030*"trump" + 0.024*"famili" + 0.024*"work"'),
 (7,
  '0.069*"rt" + 0.067*"right" + 0.067*"new" + 0.057*"biden" + 0.055*"thank" + 0.050*"discuss" + 0.040*"american" + 0.040*"job" + 0.040*"happi" + 0.040*"support"'),
 (8,
  '0.086*"amp" + 0.073*"state" + 0.053*"join" + 0.051*"need" + 0.046*"thank" + 0.041*"rt" + 0.041*"help" + 0.036*"live" + 0.036*"pandem" + 0.031*"million"'),
 (9,
  '0.108*"join" + 0.072*"presid" + 0.065*"thank" + 0.051*"biden" + 0.047*"health" + 0.040*"today" + 0.037*"care" + 0.037*"bill" + 0.029*"get" + 0.029*"support"')]

Estimate Topic Prevalence

In [196]:
# Extract topics from each documenct
td['topic'] = [sorted(lda_model[corpus][text]) for text in range(len(td["text"]))]

# expand the dataframe
df_exploded = td["topic"].explode().reset_index()

# separate information
df_exploded[["topic", "probability"]] = pd.DataFrame(df_exploded['topic'].tolist(), index=df_exploded.index)
In [197]:
# data frame with the distribution for each topic vs document
df_exploded
Out[197]:
index topic probability
0 1224340 0 0.050002
1 1224340 1 0.050008
2 1224340 2 0.050000
3 1224340 3 0.050004
4 1224340 4 0.050003
... ... ... ...
9995 672779 5 0.050000
9996 672779 6 0.050002
9997 672779 7 0.050004
9998 672779 8 0.050000
9999 672779 9 0.050001

10000 rows × 3 columns

In [199]:
# merge
df_exploded = pd.merge(df_exploded, td.reset_index(), on="index")
In [200]:
# topic prevalence
tp_prev = df_exploded.groupby("topic_x")["probability"].mean().reset_index()
tp_prev.sort_values("probability", ascending=False)
Out[200]:
topic_x probability
8 8 0.137033
3 3 0.110278
5 5 0.108009
7 7 0.102183
9 9 0.098605
0 0 0.094215
1 1 0.089792
4 4 0.089726
2 2 0.088123
6 6 0.082036

Bringing the words back

In [201]:
# Get the most important words for each topic
topic_words = list()
for i in range(lda_model.num_topics):
    # Get the top words for the topic
    words = lda_model.show_topic(i, topn=10)
    topic_words.append(", ".join([word for word, prob in words]))
      
In [202]:
topic_words
Out[202]:
['state, amp, nation, today, live, year, work, us, famili, rt',
 'rt, american, support, must, live, famili, take, great, administr, help',
 'thank, one, must, year, act, congress, bill, presid, amp, hous',
 'work, trump, presid, act, famili, rt, honor, amp, american, million',
 'time, american, help, today, call, job, week, rt, need, year',
 'today, rt, vote, great, congress, member, right, need, thank, health',
 'continu, great, rt, commun, act, year, american, day, today, amp',
 'peopl, hous, pass, student, protect, act, help, american, join, year',
 'rt, amp, tax, senat, act, bill, health, join, today, get',
 'amp, today, make, busi, nation, help, discuss, rt, join, live']
In [203]:
tp_prev["words"] = topic_words
In [204]:
tp_prev
Out[204]:
topic_x probability words
0 0 0.094215 state, amp, nation, today, live, year, work, u...
1 1 0.089792 rt, american, support, must, live, famili, tak...
2 2 0.088123 thank, one, must, year, act, congress, bill, p...
3 3 0.110278 work, trump, presid, act, famili, rt, honor, a...
4 4 0.089726 time, american, help, today, call, job, week, ...
5 5 0.108009 today, rt, vote, great, congress, member, righ...
6 6 0.082036 continu, great, rt, commun, act, year, america...
7 7 0.102183 peopl, hous, pass, student, protect, act, help...
8 8 0.137033 rt, amp, tax, senat, act, bill, health, join, ...
9 9 0.098605 amp, today, make, busi, nation, help, discuss,...

Very nice representation of the topics. you can merge this back with the core data set and see different distributions for candidates, parties, time of the day, any other group variable you have

We just touch the surface of unsupervised model and topic modeling here. If you want see more, take my computational linguistics class next spring!

In [207]:
!jupyter nbconvert _week_11_nlp_I.ipynb --to html --template classic
[NbConvertApp] Converting notebook _week_11_nlp_I.ipynb to html
[NbConvertApp] Writing 1098851 bytes to _week_11_nlp_I.html
In [ ]: