PPOL 5203 Data Science I: Foundations

Working with Text as Data

Tiago Ventura


Learning Goals

This notebook will cover:

  • Unsupervised vs Supervised tasks with Text-as-Data
  • Unsupervised:
    • Topic Models of Congressional Tweets
  • Supervised: Sentiment Analysis
    • Dictionary
    • ML for Text-Classification
    • Working with Pre-trained Models - Transformers
    • Outsourcing to Generative Text-Based Models

Unsupervised vs Supervised Tasks

The field of Natural Language Processing is strongly interwined with statistical learning. For this reasons, the lessons we covered on statistical learning form the foundation for the modelling approach we will take when analyzing text-as-data. As an example, the difference between unsupervised and supervised tasks we saw earlier in this course provides an useful analytical framework for us to separate distinct tasks when working with text.

  • Supervised learning consists on tasks in which for every observation $i$, we observe both inputs/predictors/features and outcomes we wish to predict. To solve this task, we use statistical learning on some sort of labeled data (outcomes), with a focus on maping inputs on outputs.

  • On Unsupervised tasks, we only observe the input data, but our tasks do not use any type of outcome/label we wish to predict or explain. The goal of unsupervised learning is to recover hidden structures in the data, for example, clusters, groups, or topics based on the co-occurence of words.

When working with text, supervised tasks often involve some sort of classification into known categories. For example, sentiment analysis, detection of stance, detection of spam, classification of social media posts containing toxic language or misinformation. These are all examples of supervised learning tasks. On the other side, unsupervised tasks are often used when the goal is to discover patterns in the text without having to make assumptions about the content of the corpus. The most common type of unsupervised task is to find topics, or words that occur together in large volume of text. We know there is some hidden structure in the text, but we don't have the labels, so this becomes an unsupervised learning task.

In this notebook, we will:

  • Use Topic Models to analyze a large set of Tweets from Members of the Congress.
  • Supervised classification task of sentiment analysis using a labelled dataset of IMDB Reviews. We will perform this task using four different methods: dictionaries, training ML models, using pre-trained models, and outsourcing to generative text-based models (ChatGPT under-the-hood model)

This notebook serves the purpose of providing you with the intuition behind these models in a super applied manner so that you can use them on your final projects. That is all to say, I will give you code, but will not explain in details the implementation of the models.

Topic Models

Download the data

Get the data here

In [13]:
# Open data
import pandas as pd
import numpy as np

tweets_data = pd.read_csv("tweets_congress.csv")
tweets_data.head()
Out[13]:
author text date bios retweet_author Name Link State Party congress
0 AustinScottGA08 It is my team’s privilege to help our constitu... Fri Dec 31 18:24:51 +0000 2021 I am proud to represent the 8th Congressional ... NaN Scott, Austin https://twitter.com/AustinScottGA08 GA R House
1 AustinScottGA08 I am proud to have sponsored this amendment wh... Wed Dec 29 20:14:48 +0000 2021 I am proud to represent the 8th Congressional ... NaN Scott, Austin https://twitter.com/AustinScottGA08 GA R House
2 AustinScottGA08 From my family to yours, we wish you peace, jo... Sat Dec 25 16:48:16 +0000 2021 I am proud to represent the 8th Congressional ... NaN Scott, Austin https://twitter.com/AustinScottGA08 GA R House
3 AustinScottGA08 President Biden and Congress have a responsibi... Wed Dec 22 19:14:13 +0000 2021 I am proud to represent the 8th Congressional ... NaN Scott, Austin https://twitter.com/AustinScottGA08 GA R House
4 AustinScottGA08 Happy second birthday to @SpaceForceDoD!\n\nSe... Mon Dec 20 15:37:11 +0000 2021 I am proud to represent the 8th Congressional ... NaN Scott, Austin https://twitter.com/AustinScottGA08 GA R House
In [14]:
tweets_data.shape
Out[14]:
(1266542, 10)

Pre-Processing Functions

In [15]:
## Pre processing steps
# stopwords
from nltk.corpus import stopwords

# tokenizer
from nltk.tokenize import word_tokenize

# lemmatizer
from nltk.stem import WordNetLemmatizer

# stemming
from nltk.stem.porter import PorterStemmer

Topic Model: LDA Implementation

To estimate topic models, we will use the gensim library. gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. It is the main library to retrieve pre-trained word embeddings, or to train word embeddings using the famous word2vec algorithm.

This is a step-by-step of estimate LDA using gensim:

  • Preprocess the Text: Follow most of the steps we saw before, including tokenization, removing stopwords, normalization, etc..

  • Create a dictionary: gensim requires you to create a dictionary of all stemmed/preprocessed words in the corpus (collection of documents); the method Dictionary from gensim will crete this data structure for us.

  • Filter out words from the dictionary that appear in either a very low proportion of documents (lower bound) or a very high proportion of documents (upper bound).

  • Create a bag-of-words representation of the documents: maps words from the dictionary representation to each document.

  • Estimate the topic model: use LDA model within gensim

In [16]:
# get a sample
import random
td = tweets_data.iloc[random.sample(range(1, tweets_data.shape[0]), 10000)].copy()

Step 1 - Pre-Processing

Write a function with all our previous steps

In [17]:
# Write a preprocessing function
def preprocess_text(text):
    
    # increase stop words
    stop_words = stopwords.words('english')
    stop_words = stop_words + ["https", "rt", "amp"]
    
    # tokenization 
    tokens_ = word_tokenize(text)
    
    # Generate a list of tokens after preprocessing
 
    # normalize
    tokens_ = [word.lower() for word in tokens_ if word.isalpha()]

    # stem and stopwords
    
    # instatiate the stemmer
    porter = PorterStemmer()

    tokens_ =  [porter.stem(word) for word in tokens_ if word not in stop_words]
    # Return the preprocessed tokens as a string
    return tokens_
In [18]:
# apply
td["tokens"] = td["text"].apply(preprocess_text)

Step 2: Create a Dictionary

In [19]:
# import dictionary
from gensim.corpora import Dictionary

# convert to a list
tokens = td["tokens"].tolist()

# let's look what this input is. 
# should be a list of list for each document split by tokens
tokens[1]
#td["text"][999540]
Out[19]:
['riteaid',
 'commit',
 'reduc',
 'drug',
 'misus',
 'abus',
 'commun',
 'safe',
 'medic',
 'dispos',
 'repscottperri']
In [20]:
# Create a dictionary representation of the documents
dictionary = Dictionary(tokens)

# see
dictionary.token2id
Out[20]:
{'battl': 0,
 'becom': 1,
 'begin': 2,
 'danger': 3,
 'encourag': 4,
 'estacada': 5,
 'fire': 6,
 'firefight': 7,
 'pull': 8,
 'resid': 9,
 'sandi': 10,
 'theportlandtrib': 11,
 'abus': 12,
 'commit': 13,
 'commun': 14,
 'dispos': 15,
 'drug': 16,
 'medic': 17,
 'misus': 18,
 'reduc': 19,
 'repscottperri': 20,
 'riteaid': 21,
 'safe': 22,
 'compani': 23,
 'highlight': 24,
 'new': 25,
 'restor': 26,
 'servic': 27,
 'spring': 28,
 'tarpon': 29,
 'today': 30,
 'vetcor': 31,
 'veteranownedsmallbusinessweek': 32,
 'beyond': 33,
 'bill': 34,
 'border': 35,
 'expand': 36,
 'introduc': 37,
 'mexico': 38,
 'rafaelcarranza': 39,
 'repgregstanton': 40,
 'senmcsallyaz': 41,
 'tourism': 42,
 'tucson': 43,
 'access': 44,
 'afford': 45,
 'critic': 46,
 'educ': 47,
 'ensur': 48,
 'everyon': 49,
 'folk': 50,
 'great': 51,
 'meet': 52,
 'qualiti': 53,
 'area': 54,
 'cltstorm': 55,
 'heart': 56,
 'lightn': 57,
 'lincolnton': 58,
 'move': 59,
 'sever': 60,
 'storm': 61,
 'toward': 62,
 'wind': 63,
 'call': 64,
 'join': 65,
 'repbyrn': 66,
 'repmartharobi': 67,
 'repmikerogers': 68,
 'repmobrook': 69,
 'senat': 70,
 'usrepgarypalm': 71,
 'appreci': 72,
 'behalf': 73,
 'http': 74,
 'ossbaoklahoma': 75,
 'pleasur': 76,
 'state': 77,
 'work': 78,
 'author': 79,
 'birthday': 80,
 'congress': 81,
 'continent': 82,
 'happi': 83,
 'navi': 84,
 'octob': 85,
 'unit': 86,
 'advanc': 87,
 'ali': 88,
 'condol': 89,
 'deepest': 90,
 'famili': 91,
 'fight': 92,
 'friend': 93,
 'grate': 94,
 'human': 95,
 'made': 96,
 'muhammad': 97,
 'right': 98,
 'chose': 99,
 'continu': 100,
 'divis': 101,
 'fear': 102,
 'housedemocrat': 103,
 'path': 104,
 'prefer': 105,
 'realdonaldtrump': 106,
 'announc': 107,
 'barrasso': 108,
 'counti': 109,
 'enzi': 110,
 'hour': 111,
 'lincoln': 112,
 'march': 113,
 'member': 114,
 'offic': 115,
 'thursday': 116,
 'advoc': 117,
 'award': 118,
 'delta': 119,
 'honor': 120,
 'ppl': 121,
 'preserv': 122,
 'receiv': 123,
 'restorethedelta': 124,
 'class': 125,
 'countri': 126,
 'timetodeliv': 127,
 'econom': 128,
 'expect': 129,
 'impact': 130,
 'note': 131,
 'payment': 132,
 'pleas': 133,
 'sent': 134,
 'still': 135,
 'black': 136,
 'decad': 137,
 'dilut': 138,
 'elect': 139,
 'produc': 140,
 'purpos': 141,
 'requir': 142,
 'vote': 143,
 'act': 144,
 'commonsens': 145,
 'hous': 146,
 'may': 147,
 'mcconnel': 148,
 'pass': 149,
 'unaccept': 150,
 'yet': 151,
 'american': 152,
 'closer': 153,
 'energi': 154,
 'extraordinari': 155,
 'giant': 156,
 'independ': 157,
 'moment': 158,
 'one': 159,
 'step': 160,
 'donald': 161,
 'fbi': 162,
 'governor': 163,
 'housejudiciari': 164,
 'kidnap': 165,
 'liber': 166,
 'michigan': 167,
 'plot': 168,
 'trump': 169,
 'tweet': 170,
 'uncov': 171,
 'answer': 172,
 'case': 173,
 'deserv': 174,
 'guillen': 175,
 'investig': 176,
 'spc': 177,
 'vanessa': 178,
 'cup': 179,
 'go': 180,
 'goldenknight': 181,
 'knightup': 182,
 'semifin': 183,
 'stanley': 184,
 'billion': 185,
 'deal': 186,
 'democrat': 187,
 'fund': 188,
 'give': 189,
 'green': 190,
 'socialist': 191,
 'solar': 192,
 'spend': 193,
 'would': 194,
 'capitol': 195,
 'defend': 196,
 'democraci': 197,
 'guard': 198,
 'nation': 199,
 'protect': 200,
 'thank': 201,
 'coast': 202,
 'current': 203,
 'employe': 204,
 'face': 205,
 'garag': 206,
 'miss': 207,
 'respons': 208,
 'sale': 209,
 'barbara': 210,
 'chase': 211,
 'concept': 212,
 'eliasonmik': 213,
 'even': 214,
 'friday': 215,
 'gather': 216,
 'palm': 217,
 'park': 218,
 'santa': 219,
 'vigil': 220,
 'wreath': 221,
 'agricultur': 222,
 'cropinsur': 223,
 'deadlin': 224,
 'extend': 225,
 'given': 226,
 'perdu': 227,
 'secretarysonni': 228,
 'usda': 229,
 'america': 230,
 'justiceinpol': 231,
 'polic': 232,
 'transform': 233,
 'bless': 234,
 'celebr': 235,
 'god': 236,
 'proudli': 237,
 'usmc': 238,
 'figur': 239,
 'help': 240,
 'ir': 241,
 'irsnew': 242,
 'irstaxtip': 243,
 'juli': 244,
 'tax': 245,
 'chang': 246,
 'climat': 247,
 'innov': 248,
 'opportun': 249,
 'risk': 250,
 'secur': 251,
 'threat': 252,
 'chaotic': 253,
 'come': 254,
 'first': 255,
 'florida': 256,
 'report': 257,
 'respond': 258,
 'run': 259,
 'scene': 260,
 'terribl': 261,
 'colleg': 262,
 'govmurphi': 263,
 'jersey': 264,
 'live': 265,
 'rowan': 266,
 'south': 267,
 'tour': 268,
 'vaccin': 269,
 'watch': 270,
 'behavior': 271,
 'dictat': 272,
 'kasiedc': 273,
 'kind': 274,
 'normal': 275,
 'russian': 276,
 'see': 277,
 'thing': 278,
 'alon': 279,
 'let': 280,
 'rejoin': 281,
 'sign': 282,
 'tackl': 283,
 'bodi': 284,
 'cruz': 285,
 'impress': 286,
 'long': 287,
 'perspect': 288,
 'physician': 289,
 'sen': 290,
 'stamina': 291,
 'stand': 292,
 'talk': 293,
 'tough': 294,
 'borderwal': 295,
 'governmentshutdown': 296,
 'wall': 297,
 'colleagu': 298,
 'make': 299,
 'senbrianschatz': 300,
 'senhirano': 301,
 'biden': 302,
 'determin': 303,
 'govern': 304,
 'presid': 305,
 'problem': 306,
 'revenu': 307,
 'seem': 308,
 'got': 309,
 'plant': 310,
 'set': 311,
 'speakerpelosi': 312,
 'wait': 313,
 'constitut': 314,
 'facial': 315,
 'hear': 316,
 'question': 317,
 'recognit': 318,
 'regard': 319,
 'seriou': 320,
 'technolog': 321,
 'claim': 322,
 'consumpt': 323,
 'high': 324,
 'histor': 325,
 'jobless': 326,
 'low': 327,
 'market': 328,
 'person': 329,
 'repdeanphillip': 330,
 'time': 331,
 'develop': 332,
 'faculti': 333,
 'import': 334,
 'partnership': 335,
 'student': 336,
 'workforc': 337,
 'get': 338,
 'iowan': 339,
 'pandem': 340,
 'rural': 341,
 'telehealth': 342,
 'big': 343,
 'children': 344,
 'onlin': 345,
 'privaci': 346,
 'tech': 347,
 'uk': 348,
 'legisl': 349,
 'mammal': 350,
 'marin': 351,
 'murphi': 352,
 'rep': 353,
 'research': 354,
 'stephani': 355,
 'care': 356,
 'health': 357,
 'hero': 358,
 'keep': 359,
 'worker': 360,
 'bipartisan': 361,
 'postal': 362,
 'provid': 363,
 'recent': 364,
 'reintroduc': 365,
 'resolut': 366,
 'demonstr': 367,
 'deploy': 368,
 'unwav': 369,
 'battlebornprog': 370,
 'commishmccurdi': 371,
 'hall': 372,
 'happen': 373,
 'host': 374,
 'rephorsford': 375,
 'telephon': 376,
 'town': 377,
 'emerg': 378,
 'establish': 379,
 'repdebdingel': 380,
 'reprashida': 381,
 'undeni': 382,
 'water': 383,
 'build': 384,
 'economi': 385,
 'futur': 386,
 'hook': 387,
 'senategop': 388,
 'senatemajldr': 389,
 'childtaxcredit': 390,
 'ctc': 391,
 'donaldnorcross': 392,
 'last': 393,
 'monthli': 394,
 'week': 395,
 'annual': 396,
 'beauti': 397,
 'day': 398,
 'jensen': 399,
 'jubile': 400,
 'mani': 401,
 'neighbor': 402,
 'action': 403,
 'chip': 404,
 'repeatedli': 405,
 'taken': 406,
 'wors': 407,
 'assist': 408,
 'avail': 409,
 'child': 410,
 'feder': 411,
 'million': 412,
 'relief': 413,
 'aca': 414,
 'august': 415,
 'end': 416,
 'enrol': 417,
 'know': 418,
 'period': 419,
 'search': 420,
 'someon': 421,
 'special': 422,
 'also': 423,
 'contact': 424,
 'household': 425,
 'includ': 426,
 'meharrymed': 427,
 'phase': 428,
 'pregnant': 429,
 'women': 430,
 'davi': 431,
 'grade': 432,
 'insignia': 433,
 'junior': 434,
 'lt': 435,
 'pin': 436,
 'sealift': 437,
 'strateg': 438,
 'vickeri': 439,
 'warfar': 440,
 'brought': 441,
 'glad': 442,
 'intuitbrad': 443,
 'tonight': 444,
 'virginia': 445,
 'west': 446,
 'congratul': 447,
 'petebuttigieg': 448,
 'secretari': 449,
 'ad': 450,
 'ap': 451,
 'break': 452,
 'employ': 453,
 'februari': 454,
 'gain': 455,
 'hire': 456,
 'job': 457,
 'rebound': 458,
 'scant': 459,
 'sharpli': 460,
 'solid': 461,
 'done': 462,
 'histori': 463,
 'immigr': 464,
 'much': 465,
 'throughout': 466,
 'congrat': 467,
 'indianauniv': 468,
 'iupui': 469,
 'retir': 470,
 'adopte': 471,
 'appli': 472,
 'benefit': 473,
 'could': 474,
 'law': 475,
 'limit': 476,
 'origin': 477,
 'unfortun': 478,
 'allow': 479,
 'citizen': 480,
 'demand': 481,
 'republ': 482,
 'trust': 483,
 'art': 484,
 'competit': 485,
 'conginst': 486,
 'congression': 487,
 'elizabeth': 488,
 'grimm': 489,
 'morton': 490,
 'school': 491,
 'winner': 492,
 'center': 493,
 'dedic': 494,
 'denver': 495,
 'met': 496,
 'safesport': 497,
 'sexual': 498,
 'staff': 499,
 'allianc': 500,
 'believ': 501,
 'global': 502,
 'israel': 503,
 'republican': 504,
 'safeti': 505,
 'ask': 506,
 'examin': 507,
 'mental': 508,
 'must': 509,
 'usgao': 510,
 'veteran': 511,
 'challeng': 512,
 'issu': 513,
 'madam': 514,
 'speaker': 515,
 'subject': 516,
 'support': 517,
 'us': 518,
 'afghan': 519,
 'evacu': 520,
 'fort': 521,
 'pickett': 522,
 'process': 523,
 'temporarili': 524,
 'check': 525,
 'find': 526,
 'freebi': 527,
 'mayb': 528,
 'sam': 529,
 'taxday': 530,
 'uncl': 531,
 'write': 532,
 'equal': 533,
 'extra': 534,
 'mark': 535,
 'pay': 536,
 'take': 537,
 'effort': 538,
 'foxbusi': 539,
 'gop': 540,
 'hardwork': 541,
 'mikekellypa': 542,
 'plan': 543,
 'sure': 544,
 'taxpay': 545,
 'ag': 546,
 'elig': 547,
 'particip': 548,
 'paycheckprotectionprogram': 549,
 'reach': 550,
 'benlockhartnew': 551,
 'deseretnew': 552,
 'extraordinarili': 553,
 'indic': 554,
 'korea': 555,
 'north': 556,
 'repchrisstewart': 557,
 'say': 558,
 'discuss': 559,
 'roundtabl': 560,
 'view': 561,
 'weekend': 562,
 'actual': 563,
 'bolster': 564,
 'cyber': 565,
 'databas': 566,
 'defens': 567,
 'ii': 568,
 'use': 569,
 'anthoni': 570,
 'campaign': 571,
 'deliv': 572,
 'dougmillsnyt': 573,
 'glass': 574,
 'list': 575,
 'reflect': 576,
 'remark': 577,
 'susan': 578,
 'account': 579,
 'address': 580,
 'fall': 581,
 'fundament': 582,
 'racial': 583,
 'reform': 584,
 'short': 585,
 'admin': 586,
 'integr': 587,
 'jimbridenstin': 588,
 'kansa': 589,
 'nasa': 590,
 'wsutech': 591,
 'yesterday': 592,
 'congressman': 593,
 'enshrin': 594,
 'fallen': 595,
 'name': 596,
 'richland': 597,
 'tricityherald': 598,
 'busi': 599,
 'money': 600,
 'postopinion': 601,
 'save': 602,
 'small': 603,
 'spent': 604,
 'stabil': 605,
 'wast': 606,
 'execut': 607,
 'number': 608,
 'obama': 609,
 'order': 610,
 'speak': 611,
 'creat': 612,
 'cut': 613,
 'fmlessordinari': 614,
 'leesburg': 615,
 'locat': 616,
 'ribbon': 617,
 'fda': 618,
 'leadership': 619,
 'need': 620,
 'opioid': 621,
 'press': 622,
 'strong': 623,
 'cumul': 624,
 'death': 625,
 'delay': 626,
 'elpasotxgov': 627,
 'result': 628,
 'test': 629,
 'experi': 630,
 'individu': 631,
 'recidiv': 632,
 'skill': 633,
 'want': 634,
 'flood': 635,
 'goe': 636,
 'natur': 637,
 'prevent': 638,
 'resourc': 639,
 'capito': 640,
 'infrastructur': 641,
 'lead': 642,
 'pushback': 643,
 'ga': 644,
 'tear': 645,
 'tell': 646,
 'chuck': 647,
 'nanci': 648,
 'pelosi': 649,
 'schumer': 650,
 'unnecessari': 651,
 'hungerhero': 652,
 'backpack': 653,
 'cmthomasphl': 654,
 'giveaway': 655,
 'repmaryisaacson': 656,
 'senatorsav': 657,
 'suppli': 658,
 'two': 659,
 'budget': 660,
 'citizenship': 661,
 'decis': 662,
 'parliamentarian': 663,
 'pathway': 664,
 'across': 665,
 'balanc': 666,
 'better': 667,
 'reciproc': 668,
 'trade': 669,
 'usmca': 670,
 'utah': 671,
 'year': 672,
 'read': 673,
 'record': 674,
 'rise': 675,
 'washingtonpost': 676,
 'next': 677,
 'obamacar': 678,
 'repeal': 679,
 'replac': 680,
 'kcu': 681,
 'kcumb': 682,
 'leader': 683,
 'medicalstud': 684,
 'repcleav': 685,
 'visit': 686,
 'farm': 687,
 'farmer': 688,
 'nationalfarmersday': 689,
 'proud': 690,
 'administr': 691,
 'clear': 692,
 'might': 693,
 'pictur': 694,
 'potu': 695,
 'goddard': 696,
 'institut': 697,
 'said': 698,
 'scientist': 699,
 'senior': 700,
 'space': 701,
 'speed': 702,
 'studi': 703,
 'timothi': 704,
 'top': 705,
 'place': 706,
 'repdonyoung': 707,
 'sendansullivan': 708,
 'entertain': 709,
 'restaur': 710,
 'venu': 711,
 'crisi': 712,
 'robust': 713,
 'gold': 714,
 'medal': 715,
 'ambush': 716,
 'contractor': 717,
 'embassi': 718,
 'facil': 719,
 'intent': 720,
 'kill': 721,
 'target': 722,
 'david': 723,
 'depsecdef': 724,
 'divert': 725,
 'dod': 726,
 'norquist': 727,
 'share': 728,
 'chaffetz': 729,
 'gopoversight': 730,
 'jasoninthehous': 731,
 'passag': 732,
 'regulatori': 733,
 'via': 734,
 'good': 735,
 'anymor': 736,
 'balaji': 737,
 'best': 738,
 'corpor': 739,
 'founder': 740,
 'level': 741,
 'media': 742,
 'prestig': 743,
 'writer': 744,
 'seventi': 745,
 'forward': 746,
 'heard': 747,
 'improv': 748,
 'reject': 749,
 'system': 750,
 'togeth': 751,
 'trumpcar': 752,
 'voic': 753,
 'existenti': 754,
 'news': 755,
 'serious': 756,
 'boblatta': 757,
 'broadband': 758,
 'close': 759,
 'digit': 760,
 'divid': 761,
 'incompa': 762,
 'peterwelch': 763,
 'goodwin': 764,
 'night': 765,
 'tuesday': 766,
 'univers': 767,
 'central': 768,
 'fisheri': 769,
 'gener': 770,
 'maritim': 771,
 'play': 772,
 'role': 773,
 'wa': 774,
 'mentalhealth': 775,
 'suffer': 776,
 'ball': 777,
 'fail': 778,
 'latest': 779,
 'peopl': 780,
 'wreck': 781,
 'amchemmatt': 782,
 'il': 783,
 'lyondellbasel': 784,
 'morri': 785,
 'repkinzing': 786,
 'iii': 787,
 'learn': 788,
 'sat': 789,
 'combat': 790,
 'common': 791,
 'crimin': 792,
 'enemi': 793,
 'foreign': 794,
 'militarytribun': 795,
 'terrorist': 796,
 'treat': 797,
 'men': 798,
 'sea': 799,
 'arizona': 800,
 'cdc': 801,
 'closur': 802,
 'guidanc': 803,
 'mitig': 804,
 'southern': 805,
 'strategi': 806,
 'diegan': 807,
 'district': 808,
 'highway': 809,
 'lost': 810,
 'memori': 811,
 'san': 812,
 'sdcaltran': 813,
 'teampet': 814,
 'kitchent': 815,
 'quick': 816,
 'took': 817,
 'updat': 818,
 'airlin': 819,
 'often': 820,
 'pilot': 821,
 'plane': 822,
 'enough': 823,
 'prioriti': 824,
 'repsarajacob': 825,
 'buildbackbett': 826,
 'easier': 827,
 'afectada': 828,
 'ayudar': 829,
 'beneficio': 830,
 'congreso': 831,
 'el': 832,
 'informacion': 833,
 'ley': 834,
 'nueva': 835,
 'para': 836,
 'pasado': 837,
 'persona': 838,
 'por': 839,
 'sobr': 840,
 'una': 841,
 'accru': 842,
 'almost': 843,
 'meanwhil': 844,
 'month': 845,
 'protest': 846,
 'summer': 847,
 'told': 848,
 'violent': 849,
 'afghanistan': 850,
 'inflat': 851,
 'christma': 852,
 'drive': 853,
 'tip': 854,
 'travel': 855,
 'winter': 856,
 'china': 857,
 'jointsess': 858,
 'legal': 859,
 'put': 860,
 'street': 861,
 'tariff': 862,
 'wimp': 863,
 'confirm': 864,
 'dead': 865,
 'devast': 866,
 'nearli': 867,
 'aisl': 868,
 'immun': 869,
 'justinamash': 870,
 'qualifi': 871,
 'bigcatpublicsafetyact': 872,
 'cat': 873,
 'real': 874,
 'takeaway': 875,
 'tigerk': 876,
 'wild': 877,
 'came': 878,
 'incred': 879,
 'morn': 880,
 'stevescalis': 881,
 'thousand': 882,
 'counter': 883,
 'fair': 884,
 'free': 885,
 'al': 886,
 'articl': 887,
 'cspanwj': 888,
 'file': 889,
 'impeach': 890,
 'third': 891,
 'committe': 892,
 'justic': 893,
 'possibl': 894,
 'theblackcaucu': 895,
 'without': 896,
 'condit': 897,
 'court': 898,
 'anniversari': 899,
 'commemor': 900,
 'peoriachart': 901,
 'replahood': 902,
 'air': 903,
 'automak': 904,
 'back': 905,
 'car': 906,
 'clean': 907,
 'roll': 908,
 'standard': 909,
 'vehicl': 910,
 'cost': 911,
 'fl': 912,
 'mean': 913,
 'bureaucrat': 914,
 'forc': 915,
 'gear': 916,
 'govwast': 917,
 'power': 918,
 'reclaim': 919,
 'republicanstudi': 920,
 'solut': 921,
 'task': 922,
 'unelect': 923,
 'program': 924,
 'went': 925,
 'concern': 926,
 'depart': 927,
 'hill': 928,
 'interior': 929,
 'polici': 930,
 'stori': 931,
 'wildland': 932,
 'key': 933,
 'like': 934,
 'rush': 935,
 'biggest': 936,
 'dem': 937,
 'hike': 938,
 'increas': 939,
 'item': 940,
 'singl': 941,
 'ago': 942,
 'attack': 943,
 'domest': 944,
 'endur': 945,
 'horrif': 946,
 'injur': 947,
 'paso': 948,
 'repescobar': 949,
 'away': 950,
 'quickli': 951,
 'caus': 952,
 'interf': 953,
 'peac': 954,
 'refus': 955,
 'stop': 956,
 'transit': 957,
 'har': 958,
 'home': 959,
 'model': 960,
 'rest': 961,
 'rocket': 962,
 'ship': 963,
 'americasvetdog': 964,
 'companionship': 965,
 'wound': 966,
 'applaud': 967,
 'garland': 968,
 'pattern': 969,
 'practic': 970,
 'pursuit': 971,
 'tool': 972,
 'nytim': 973,
 'administración': 974,
 'apoyan': 975,
 'apoyar': 976,
 'felicito': 977,
 'la': 978,
 'lo': 979,
 'pueblo': 980,
 'que': 981,
 'sancionar': 982,
 'tiranía': 983,
 'venezolano': 984,
 'energycommerc': 985,
 'stopbadrobocal': 986,
 'certain': 987,
 'churchil': 988,
 'mankind': 989,
 'winston': 990,
 'affect': 991,
 'damag': 992,
 'inflict': 993,
 'properti': 994,
 'sole': 995,
 'thought': 996,
 'bookman': 997,
 'debbi': 998,
 'row': 999,
 ...}

Step 3: Filter out words

This is a additional pre-processing task. More meaningful topics comes when we remove rare and overly common words`

In [21]:
# Filter out words that occur in less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

Step 4: Create a bag-of-words representation of the documents

In [22]:
# Create a bag-of-words representation of the documents

# notice here you are just inputing every doc in a .doc2bow methods
corpus = [dictionary.doc2bow(doc) for doc in tokens]
In [23]:
corpus[0]
Out[23]:
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]
In [24]:
# see case by case
# tuple with (id for every token, frequency) 
dictionary.doc2bow(tokens[0])
Out[24]:
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]

Step 5 - Fit the model

In [25]:
from gensim.models.ldamodel import LdaModel
# Train the LDA model
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    # this is your only input!!!
    num_topics=10,
    eval_every=False
)

Visualizing results

In [26]:
# Print the Keyword in the 10 topics
lda_model.print_topics()
Out[26]:
[(0,
  '0.026*"veteran" + 0.016*"import" + 0.016*"serv" + 0.015*"tax" + 0.015*"one" + 0.011*"time" + 0.010*"support" + 0.010*"take" + 0.010*"discuss" + 0.010*"offic"'),
 (1,
  '0.018*"bill" + 0.015*"work" + 0.015*"american" + 0.013*"hear" + 0.013*"support" + 0.011*"health" + 0.010*"fund" + 0.010*"need" + 0.010*"worker" + 0.010*"lead"'),
 (2,
  '0.034*"trump" + 0.020*"presid" + 0.017*"administr" + 0.016*"biden" + 0.013*"want" + 0.012*"read" + 0.011*"join" + 0.011*"hear" + 0.011*"job" + 0.011*"new"'),
 (3,
  '0.049*"today" + 0.025*"day" + 0.019*"hous" + 0.017*"join" + 0.014*"pass" + 0.012*"famili" + 0.012*"one" + 0.012*"proud" + 0.011*"secur" + 0.011*"year"'),
 (4,
  '0.032*"great" + 0.021*"today" + 0.019*"http" + 0.018*"thank" + 0.014*"host" + 0.013*"us" + 0.012*"vaccin" + 0.012*"meet" + 0.012*"visit" + 0.011*"see"'),
 (5,
  '0.020*"care" + 0.020*"health" + 0.020*"need" + 0.015*"help" + 0.015*"plan" + 0.014*"senat" + 0.013*"get" + 0.013*"legisl" + 0.012*"bill" + 0.011*"bipartisan"'),
 (6,
  '0.027*"american" + 0.020*"happi" + 0.017*"peopl" + 0.016*"day" + 0.014*"let" + 0.013*"alway" + 0.013*"presid" + 0.012*"make" + 0.012*"everi" + 0.011*"democrat"'),
 (7,
  '0.030*"thank" + 0.027*"nation" + 0.019*"law" + 0.018*"act" + 0.017*"women" + 0.014*"today" + 0.011*"health" + 0.010*"sign" + 0.010*"men" + 0.009*"protect"'),
 (8,
  '0.019*"year" + 0.018*"last" + 0.015*"american" + 0.012*"live" + 0.011*"offici" + 0.009*"two" + 0.009*"hall" + 0.009*"gun" + 0.009*"town" + 0.009*"first"'),
 (9,
  '0.032*"vote" + 0.022*"right" + 0.021*"student" + 0.015*"busi" + 0.014*"today" + 0.013*"small" + 0.013*"protect" + 0.013*"la" + 0.012*"act" + 0.012*"high"')]

Estimate Topic Prevalence

In [27]:
# Extract topics from each documenct
td['topic'] = [sorted(lda_model[corpus][text]) for text in range(len(td["text"]))]
td.head()
Out[27]:
author text date bios retweet_author Name Link State Party congress tokens topic
297173 RepBonamici RT @ThePortlandTrib: Estacada: firefighters pu... Thu Sep 10 23:05:24 +0000 2020 Representing Oregon's 1st District. Working to... ThePortlandTrib Bonamici, Suzanne https://twitter.com/RepBonamici OR D House [theportlandtrib, estacada, firefight, pull, f... [(0, 0.17090175), (1, 0.014289286), (2, 0.0142...
882637 RepScottPerry RT @riteaid: We’re committed to reducing drug ... Fri Jun 01 19:13:25 +0000 2018 Husband, Father of two young daughters, Small ... riteaid Perry, Scott https://twitter.com/RepScottPerry PA R House [riteaid, commit, reduc, drug, misus, abus, co... [(0, 0.012504354), (1, 0.012501202), (2, 0.012...
529191 RepGusBilirakis Today’s #VeteranOwnedSmallBusinessWeek highlig... Thu Nov 04 15:52:00 +0000 2021 #FL12. #TampaBay Most Effective/FL Most Effect... NaN Bilirakis, Gus M. https://twitter.com/RepGusBilirakis FL R House [today, veteranownedsmallbusinessweek, highlig... [(0, 0.014286572), (1, 0.014288466), (2, 0.014...
523082 RepGregStanton RT @RafaelCarranza: NEW THIS AM: @SenMcSallyAZ... Thu Dec 12 18:54:37 +0000 2019 Proudly serving Arizona's 9th Congressional Di... RafaelCarranza Stanton, Greg https://twitter.com/RepGregStanton AZ D House [rafaelcarranza, new, senmcsallyaz, repgregsta... [(0, 0.18539806), (1, 0.014291376), (2, 0.0142...
323963 RepBryanSteil Ensuring everyone has access to quality, affor... Tue Feb 12 20:57:01 +0000 2019 Official Twitter account for Bryan Steil, prou... NaN Steil, Bryan https://twitter.com/RepBryanSteil WI R House [ensur, everyon, access, qualiti, afford, educ... [(4, 0.21832615), (9, 0.708931)]
In [28]:
# expand the dataframe
df_exploded = td["topic"].explode().reset_index()

# separate information
df_exploded[["topic", "probability"]] = pd.DataFrame(df_exploded['topic'].tolist(), index=df_exploded.index)
In [29]:
# data frame with the distribution for each topic vs document
df_exploded
Out[29]:
index topic probability
0 297173 0 0.170902
1 297173 1 0.014289
2 297173 2 0.014288
3 297173 3 0.014288
4 297173 4 0.576970
... ... ... ...
92852 527420 5 0.010005
92853 527420 6 0.010003
92854 527420 7 0.010001
92855 527420 8 0.010002
92856 527420 9 0.010003

92857 rows × 3 columns

In [30]:
# merge
df_exploded = pd.merge(df_exploded, td.reset_index(), on="index")
In [31]:
# topic prevalence
tp_prev = df_exploded.groupby("topic_x")["probability"].mean().reset_index()
tp_prev.sort_values("probability", ascending=False)
Out[31]:
topic_x probability
4 4 0.120206
5 5 0.115282
3 3 0.113876
6 6 0.109788
1 1 0.109146
2 2 0.102622
7 7 0.101693
8 8 0.100910
9 9 0.098934
0 0 0.097659

Bringing the words back

In [32]:
# Get the most important words for each topic
topic_words = list()
for i in range(lda_model.num_topics):
    # Get the top words for the topic
    words = lda_model.show_topic(i, topn=10)
    topic_words.append(", ".join([word for word, prob in words]))
      
In [33]:
topic_words
Out[33]:
['veteran, import, serv, tax, one, time, support, take, discuss, offic',
 'bill, work, american, hear, support, health, fund, need, worker, lead',
 'trump, presid, administr, biden, want, read, join, hear, job, new',
 'today, day, hous, join, pass, famili, one, proud, secur, year',
 'great, today, http, thank, host, us, vaccin, meet, visit, see',
 'care, health, need, help, plan, senat, get, legisl, bill, bipartisan',
 'american, happi, peopl, day, let, alway, presid, make, everi, democrat',
 'thank, nation, law, act, women, today, health, sign, men, protect',
 'year, last, american, live, offici, two, hall, gun, town, first',
 'vote, right, student, busi, today, small, protect, la, act, high']
In [34]:
tp_prev["words"] = topic_words
In [35]:
tp_prev
Out[35]:
topic_x probability words
0 0 0.097659 veteran, import, serv, tax, one, time, support...
1 1 0.109146 bill, work, american, hear, support, health, f...
2 2 0.102622 trump, presid, administr, biden, want, read, j...
3 3 0.113876 today, day, hous, join, pass, famili, one, pro...
4 4 0.120206 great, today, http, thank, host, us, vaccin, m...
5 5 0.115282 care, health, need, help, plan, senat, get, le...
6 6 0.109788 american, happi, peopl, day, let, alway, presi...
7 7 0.101693 thank, nation, law, act, women, today, health,...
8 8 0.100910 year, last, american, live, offici, two, hall,...
9 9 0.098934 vote, right, student, busi, today, small, prot...

Very nice representation of the topics. you can merge this back with the core data set and see different distributions for candidates, parties, time of the day, any other group variable you have

Visualizing Words

In [36]:
# preparing the dataframe

# convert topics to category
tp_prev['topic_x'] = tp_prev.topic_x.astype('category')

# get a list for ordering
topics = tp_prev["probability"].sort_values().index.tolist()

# create a new re-ordered variable
tp_prev = tp_prev.assign(topics_ord=
              tp_prev['topic_x'].cat.reorder_categories(topics))
In [37]:
# plot
from plotnine import *

(ggplot(tp_prev, aes(x='topics_ord', y='probability', label='words')) 
        + geom_col( fill='lightblue') 
        + coord_flip() 
        + geom_text(aes(y='probability + 0.01'), nudge_y=-.05)  # Adjusting text position
        + theme(axis_text_y=element_text(angle=90))  # Rotating x-axis labels for better visibility
        + theme(figure_size=(12, 6))
)
     
Out[37]:
<Figure Size: (1200 x 600)>

Analyzing Model Coherence

How many topics should I use? As argued here, to decide how many topics you should use, one needs to use both qualitative assessment of the topics (humans in the loop) and some coeherence measures across the topics.

There are many coherence measures to be used to assess topic models and the quality of the topics. Broadly speaking, these measures all try to compare words that appear in the same topic and measures on average how similar they are compared to words from different topics.

What matters here is for you to understand the procedure. In general, we fit many models with different number of topics, look at which point gains in these measures are very marginal, and then make qualitative assessments of how our topics look like.

Let's see an example below using two measures (u_mass and c_v). On both cases, higher values means better topics

In [38]:
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=10,
    eval_every=False
)
In [39]:
from gensim.models import CoherenceModel

coherence_values = []
model_list = []
for num_topics in range(5, 30, 4):
    
    print(num_topics)
    
    # estimate the model
    model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)
    
    # save the model
    model_list.append(model)
    
    # get coherence umass
    coherencemodel_umass = CoherenceModel(model=model, corpus=corpus, dictionary=dictionary, coherence='u_mass')

    # get another coherence measures
    coherencemodel_cv = CoherenceModel(model=model, texts=tokens, dictionary=dictionary, coherence='c_v')

    coherence_values.append((num_topics, coherencemodel_umass.get_coherence(), coherencemodel_cv.get_coherence()))
5
9
13
17
21
25
29
In [40]:
# grab the results
res = pd.DataFrame(coherence_values, columns=["topics", "u_mass", "c_v"])

# tidy
res = res.melt(id_vars="topics")

# plotnine
(ggplot(res, aes(y="value", x="topics"))
 + geom_line()
 + facet_wrap("variable", scales="free") 
 + theme_minimal())
Out[40]:
<Figure Size: (640 x 480)>

Supervised Learning with Text

To practice with supervised learning with text data, we will perform some classic sentiment analysis classification task. Sentiment analysis natural language processing technique that given a textual input (tweets, movie reviews, comments on a website chatbox, etc... ) identifies the polarity of the text.

There are different flavors of sentiment analysis, but one of the most widely used techniques labels data into positive, negative and neutral. Other options are classifying text according to the levels of toxicity, which I did in the paper I asked you to read, or more fine-graine measures of sentiments.

Sentiment analysis is just one of many types of classification tasks that can be done with text. For any type of task in which you need to identify if the input pertains to a certain category, you can use a similar set of tools as we will see for sentiment analysis. For example, these are some classification tasks I have used in my work before:

  • Classify the levels of toxicity in social media live-streaming comments.
  • Analyze the sentiment of tweets.
  • Classify if the user is a Republican or Democrat given the their Twitter bios.
  • Identify if a particular social media post contains misinformation.

For all these tasks, you need:

  • some type of labelled data (which you and your research team will do),
  • build/or use a pre-trained machine learning models to make the prediction
  • evaluate the performance of the models

Here, we will work with data that was alread labelled for us. We will analyze the sentiment on IMDB dataset of reviews

IMDB Dataset

For the rest of this notebook, we will IMDB dataset provided by Hugging Face. The IMDB dataset contains 25,000 movie reviews labeled by sentiment for training a model and 25,000 movie reviews for testing it.

We will talk more about the Hugging Face project later in this notebook. For now, just download their main transformers library, and import the IMDB Review Dataset

Accessing the Dataset

In [41]:
#pip install -q transformers
from datasets import load_dataset
imdb = load_dataset("imdb")

get a smaller sample

In [42]:
small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(3000))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(300))])
In [43]:
# convert to a dataframe
pd_train = pd.DataFrame(small_train_dataset)
pd_test = pd.DataFrame(small_test_dataset)

# see the data
pd_train.head()
Out[43]:
text label
0 There is no relation at all between Fortier an... 1
1 This movie is a great. The plot is very true t... 1
2 George P. Cosmatos' "Rambo: First Blood Part I... 0
3 In the process of trying to establish the audi... 1
4 Yeh, I know -- you're quivering with excitemen... 0

Dictionary Methods

Our first approach for sentiment classification will use dictionary methods.

Common Procedure: Consists on using a pre-determined set of words (dictionary) that identifies the categories you want to classify documents. With this dictionary, you can do a simple search through the documents, count how many times these words appear, and use some type of aggregation function to classify the text. For example:

  • Positive or negative, for sentiment
  • Sad, happy, angry, anxious... for emotions
  • Sexism, homophobia, xenophobia, racism... for hate speech

Dictionaries are the most basic strategy to classify documents. Its simplicity requires some unrealistic assumptions (for example related to ignoring contextual information of the documents). However, the use of dicitionaries have one major advantage: it allows for a bridge between qualititative and quantitative knowledge. You need human experts to build good dictionaries.

VADER

There are many options for dictionaries for sentiment classification. We will use one popular open-source option available at NLTK: The VADER dictionary. VADER stands for Valence Aware Dictionary for Sentiment Reasoning. It is a model used for text sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strength) of emotion, and it was developed to handling particularly social media content.

Key Components of the VADER Dictionary:*

  • Sentiment Lexicon: This is a list of known words and their associated sentiment scores.

  • Sentiment Intensity Scores: Each word in the lexicon is assigned a score that ranges from -4 (extremely negative) to +4 (extremely positive).

  • Handling of Contextual and Qualitative Modifiers: VADER is sensitive to both intensifiers (e.g., "very") and negations (e.g., "not").

You can read the original paper that created the VADER here

In [44]:
#### Import dictionary
import nltk
# nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
In [45]:
# instantiate the model
sid = SentimentIntensityAnalyzer()
In [46]:
# simple example
review1 = "Oh, I loved the Data Science I course. The best course I have ever done!"

# classify
sid.polarity_scores(review1)
Out[46]:
{'neg': 0.0, 'neu': 0.544, 'pos': 0.456, 'compound': 0.8553}
In [47]:
# simple example
review2 = "DS I was ok. Professor Ventura jokes were not great"

# classify
sid.polarity_scores(review2)
Out[47]:
{'neg': 0.244, 'neu': 0.445, 'pos': 0.311, 'compound': -0.0243}

Let's now apply the dictionary at scale in our IMDB review dataset

In [48]:
# apply the dictionary to your data frame
pd_test["vader_scores"]=pd_test["text"].apply(sid.polarity_scores)
In [49]:
# let's see
pd_test.head()
Out[49]:
text label vader_scores
0 <br /><br />When I unsuspectedly rented A Thou... 1 {'neg': 0.069, 'neu': 0.788, 'pos': 0.143, 'co...
1 This is the latest entry in the long series of... 1 {'neg': 0.066, 'neu': 0.862, 'pos': 0.073, 'co...
2 This movie was so frustrating. Everything seem... 0 {'neg': 0.24, 'neu': 0.583, 'pos': 0.177, 'com...
3 I was truly and wonderfully surprised at "O' B... 1 {'neg': 0.075, 'neu': 0.752, 'pos': 0.173, 'co...
4 This movie spends most of its time preaching t... 0 {'neg': 0.066, 'neu': 0.707, 'pos': 0.227, 'co...
In [50]:
# grab final sentiment
pd_test["sentiment_vader"]=pd_test["vader_scores"].apply(lambda x: np.where(x["compound"] > 0, 1, 0))

Now that we have performed the classification task, we can see compare the labels and our predictions. We will be using a simple accuracy measure of how many labels were correctly classified

In [51]:
pd_test['vader_scores']
Out[51]:
0      {'neg': 0.069, 'neu': 0.788, 'pos': 0.143, 'co...
1      {'neg': 0.066, 'neu': 0.862, 'pos': 0.073, 'co...
2      {'neg': 0.24, 'neu': 0.583, 'pos': 0.177, 'com...
3      {'neg': 0.075, 'neu': 0.752, 'pos': 0.173, 'co...
4      {'neg': 0.066, 'neu': 0.707, 'pos': 0.227, 'co...
                             ...                        
295    {'neg': 0.059, 'neu': 0.812, 'pos': 0.129, 'co...
296    {'neg': 0.056, 'neu': 0.87, 'pos': 0.074, 'com...
297    {'neg': 0.056, 'neu': 0.825, 'pos': 0.119, 'co...
298    {'neg': 0.092, 'neu': 0.768, 'pos': 0.14, 'com...
299    {'neg': 0.083, 'neu': 0.785, 'pos': 0.132, 'co...
Name: vader_scores, Length: 300, dtype: object
In [52]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(pd_test['label'], pd_test['sentiment_vader'])

# see
print(accuracy)
0.6966666666666667

Training a Machine Learning Classifier

The next step to try if your dictionary is not working well is to train your own machine learning classifier. To build a simple Machine Learning classifier you will need to combine two different processes:

  • Build your input: here you will do all the steps we saw before to convert text to numbers. The goal here is to use a document feature matrix as the input for the ML model.

  • Use sklearn to train your model and assess its accuracy.

As before, let's not dig deep on the differences between each model. We will learn how to train a model using a logistic regression with penalized terms. You can use the same code with sklearn to try different models.

Building you input

In [53]:
# apply the same pre-process function we used for topic models
# notice we are working with the training data
pd_train["tokens"] = pd_train["text"].apply(preprocess_text)

# let's see
pd_train.head()
Out[53]:
text label tokens
0 There is no relation at all between Fortier an... 1 [relat, fortier, profil, fact, polic, seri, vi...
1 This movie is a great. The plot is very true t... 1 [movi, great, plot, true, book, classic, writt...
2 George P. Cosmatos' "Rambo: First Blood Part I... 0 [georg, cosmato, rambo, first, blood, part, ii...
3 In the process of trying to establish the audi... 1 [process, tri, establish, audienc, empathi, ja...
4 Yeh, I know -- you're quivering with excitemen... 0 [yeh, know, quiver, excit, well, secret, live,...
In [54]:
# Repeate for test
pd_test["tokens"] = pd_test["text"].apply(preprocess_text)
In [55]:
# join all
pd_train["tokens"] = pd_train["tokens"].apply(' '.join)
pd_test["tokens"] = pd_test["tokens"].apply(' '.join)
In [56]:
# lets build our document feature matrix using Tfidf
from sklearn.feature_extraction.text import TfidfVectorizer

# instantiate a vectorizer
vectorizer = TfidfVectorizer()

# get tfidf
vectorizer = TfidfVectorizer(
    lowercase=True,
    stop_words='english',
    min_df=5, max_df=.90
)

# transform
train_tfidf = vectorizer.fit_transform(pd_train["tokens"]) # transform train
test_tfidf = vectorizer.transform(pd_test["tokens"]) # transform test

# check
print(train_tfidf.shape)
print(test_tfidf.shape)
(3000, 5608)
(300, 5608)
In [57]:
# separate the targer
y_train = pd_train["label"]
y_test = pd_test["label"]

Train your model

In [58]:
# import the models
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


# train the model
model = LogisticRegression(penalty='l1', solver='liblinear', random_state=42)
model.fit(train_tfidf,y_train)
Out[58]:
LogisticRegression(penalty='l1', random_state=42, solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [59]:
# assess the model
y_pred = model.predict(test_tfidf)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
Accuracy: 0.8366666666666667
Confusion Matrix:
 [[121  29]
 [ 20 130]]

Pretty cool accuracy!

Practice

Modify the code above to try a different model. You can either use different parameters for the logistic regression or just try a different model. Did you do better than the model I showed?

In [60]:
# Your code here

Pre-Trained Large Language Models: Hugging Face

In the past few years, the field of natural language processing has undergone through a major revolution. As we first saw, the early generation of NLP models was based on the idea of converting text to numbers through the use of document-feature matrix relying on the bag-of-words assumptions.

In the past ten-years, we have seen the emergence of a new paradigm using deep-learning and neural networks models to improve on the representation of text as numbers. These new models move away from the idea of a bag-of-words towards a more refined representation of text capturing the contextual meaning of words and sentences. This is achieved by training models with billions of parameters on text-sequencing tasks, using as inputs a dense representation of words. These are the famous word embeddings.

The most recent innovation on this revolution has been the Transformers Models. These models use multiple embeddings (matrices) to represent word, in which each matrix can capture different contextual representations of words. This dynamic representation allow for higher predictive power on downstream tasks in which these matrices form the foundation of the entire machine learning architecture. For example, Transformers are the the core of the language models like Open AI's GPTs and Meta's LLaMa.

The Transformers use a sophisticated architecture that requires a huge amount of data and computational power to be trained. However, several of these models are open-sourced and are made available for us on the web through a platform called Hugging Face. Those are what we call pre-trained large language models. At this point, there are thousands of pre-trained models based on the transformers framework available at hugging face.

Once you find a model that fits your task, you have two options:

  • Use the model architecture: access the model through the transformers library, and use it in you predictive tasks.

  • Fine-Tunning: this is the most traditional way. You will get the model, give some data, re-train the model slightly so that the model will learn patterns from your data, and use on your predictive task. By fine-tuning a Transformers-based model for our own application, we can improve contextual understanding and therefore task-specific performance

We will see example of the first for sentiment analysis. If you were to do build a full pipeline for classification, you would probably need to fine-tune the model. To learn more about fine-tunning, I suggest you to read:

Transformers Library

To use a model available on hugging face, you only need a few lines of code.

In [61]:
# import the pipeline function
from transformers import pipeline

Use the pipeline class to access the model. The pipeline function will give you the default model for this task, that in thsi case is a Bert-Based Model, see here: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english?text=I+like+you.+I+love+you

In [62]:
# instantiate your model
sentiment_pipeline = pipeline("sentiment-analysis")
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
In [63]:
# see simple cases
review1 = "Oh, I loved the Data Science I course. The best course I have ever done!"
review2 = "DS I was ok... bit less than ok... actually, only thing worthy was the cute baby we met at the end"

print(review1, review2)
Oh, I loved the Data Science I course. The best course I have ever done! DS I was ok... bit less than ok... actually, only thing worthy was the cute baby we met at the end
In [64]:
#prediction
sentiment_pipeline([review1, review2])
Out[64]:
[{'label': 'POSITIVE', 'score': 0.9998651742935181},
 {'label': 'NEGATIVE', 'score': 0.9798290133476257}]

We can easily use this model to make predictions on our entire dataset

In [66]:
# predict in the entire model. 
# notice here I am truncating the model. Transformers can only deal with 512 tokens max
pd_test["bert_scores"]=pd_test["text"].apply(sentiment_pipeline, truncation=True, max_length=512)
In [67]:
# let's clean it up
pd_test["bert_class"]=pd_test["bert_scores"].apply(lambda x: np.where(x[0]["label"]=="POSITIVE", 1, 0))
In [68]:
pd_test.head()
Out[68]:
text label vader_scores sentiment_vader tokens bert_scores bert_class
0 <br /><br />When I unsuspectedly rented A Thou... 1 {'neg': 0.069, 'neu': 0.788, 'pos': 0.143, 'co... 1 br br unsuspectedli rent thousand acr thought ... [{'label': 'POSITIVE', 'score': 0.998875796794... 1
1 This is the latest entry in the long series of... 1 {'neg': 0.066, 'neu': 0.862, 'pos': 0.073, 'co... 1 latest entri long seri film french agent frenc... [{'label': 'POSITIVE', 'score': 0.996983110904... 1
2 This movie was so frustrating. Everything seem... 0 {'neg': 0.24, 'neu': 0.583, 'pos': 0.177, 'com... 0 movi frustrat everyth seem energet total prepa... [{'label': 'NEGATIVE', 'score': 0.997244238853... 0
3 I was truly and wonderfully surprised at "O' B... 1 {'neg': 0.075, 'neu': 0.752, 'pos': 0.173, 'co... 1 truli wonder surpris brother art thou video st... [{'label': 'NEGATIVE', 'score': 0.649209439754... 0
4 This movie spends most of its time preaching t... 0 {'neg': 0.066, 'neu': 0.707, 'pos': 0.227, 'co... 1 movi spend time preach script make movi appar ... [{'label': 'NEGATIVE', 'score': 0.998503446578... 0
In [70]:
## accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(pd_test['label'], pd_test['bert_class'])
# see
print(accuracy)
0.87

Without any fine-tunning, we are already doing much, much better than dictionaries!

Use contextual knowledge: Model Trained on Amazon Reviews

Since I do not want go to the in-depth process of fine-tunning your model, let's see if there are models on Hugging Face that were actually trained on a similar task: predicting reviews.

Actually, there are many. See here: https://huggingface.co/models?sort=trending&search=sentiment+reviews

In [71]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# acessing the model
model = AutoModelForSequenceClassification.from_pretrained("MICADEE/autonlp-imdb-sentiment-analysis2-7121569")

# Acessing the tokenizer
tokenizer = AutoTokenizer.from_pretrained("MICADEE/autonlp-imdb-sentiment-analysis2-7121569")
In [72]:
# use in my model
sentiment_pipeline = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# Run in the dataframe
pd_test["imdb_scores"]=pd_test["text"].apply(sentiment_pipeline, truncation=True, max_length=512)
In [73]:
# clean
pd_test["imdb_class"]=pd_test["imdb_scores"].apply(lambda x: np.where(x[0]["label"]=="positive", 1, 0))
In [74]:
## accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(pd_test['label'], pd_test['imdb_class'])
# see
print(accuracy)
0.97

If you want to see more about fine-tunning your own transformer models, I suggest you read this chapter: https://huggingface.co/learn/llm-course/chapter3/1

Outsourcing to Generative Text-Based Models (next week!)

The last thing I will show you in class is the possibility of using ChatGPT as a classification tool. As you know, ChatGPT is an large language model (as we just saw) developed by OpenAI, based on the GPT architecture. The model was trained on a word-prediction task and it has blown the world by its capacity to engage in conversational interactions.

Some recent papers have shown ChatGPT exhbits a strong performance on downstream classification tasks, like sentiment analysis, even though the model has not been trained or even fine-tuned for this task. Read here:https://osf.io/preprints/psyarxiv/sekf5/

In this paper, there is availabe code in R on how to interact with ChatGPT API. The example I show you below pretty much converts their code to Python. You can see a nice video showing their R code here: https://www.youtube.com/watch?v=Mm3uoK4Fogc

The whole process requires us to have access to the Open AI API which allow us to query continously the GPT models. Notice, this is not free. You pay for every query. That being said, it is quite cheap.

In [207]:
# load api key
# load library to get environmental files
import os
from dotenv import load_dotenv


# load keys from  environmental var
load_dotenv() # .env file in cwd
gpt_key = os.environ.get("gpt") 
In [211]:
import requests 

# define headers
headers = {
        "Authorization": f"Bearer {gpt_key}",
        "Content-Type": "application/json",
    }

# define gpt model
question = "Please, tell me more about the Data Science and Public Policy Program at Georgetown's McCourt School"

data = {
        "model": "gpt-3.5-turbo-0301",
        "temperature": 0,
        "messages": [{"role": "user", "content": question}]
    }



# send a post request
response = requests.post("https://api.openai.com/v1/chat/completions", 
                             json=data, 
                             headers=headers)
# convert to json
response_json = response.json()
In [212]:
## see the output
response_json
Out[212]:
{'id': 'chatcmpl-8NTy58kBVsCS4d4JRasN2d679sdRd',
 'object': 'chat.completion',
 'created': 1700607433,
 'model': 'gpt-3.5-turbo-0301',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': "The Data Science and Public Policy Program at Georgetown's McCourt School is a unique program that combines the fields of data science and public policy. The program is designed to equip students with the skills and knowledge needed to use data science to solve complex policy problems.\n\nThe program is interdisciplinary in nature, drawing on expertise from the fields of statistics, computer science, economics, and political science. Students in the program learn how to collect, analyze, and interpret data to inform policy decisions.\n\nThe curriculum includes courses in data science, statistics, machine learning, and policy analysis. Students also have the opportunity to work on real-world policy projects, collaborating with government agencies, non-profit organizations, and private sector companies.\n\nGraduates of the program are well-equipped to pursue careers in a variety of fields, including government, non-profit organizations, and the private sector. They are able to use data science to inform policy decisions and drive positive change in their communities.\n\nOverall, the Data Science and Public Policy Program at Georgetown's McCourt School is an innovative and exciting program that is helping to shape the future of public policy."},
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 27, 'completion_tokens': 219, 'total_tokens': 246}}

Let's now write a function to query the api at scale

In [213]:
# Function to interact with the ChatGPT API
def hey_chatGPT(question_text, api_key):
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
    }

    data = {
        "model": "gpt-3.5-turbo-0301",
        "temperature": 0,
        "messages": [{"role": "user", "content": question_text}]
    }

    response = requests.post("https://api.openai.com/v1/chat/completions", 
                             json=data, 
                             headers=headers, timeout=5)
    
    response_json = response.json()
    return response_json['choices'][0]['message']['content'].strip()
In [122]:
import time
output = []
# Run a loop over your dataset of reviews and prompt ChatGPT
for i in range(len(pd_test)):
    try: 
        print(i)
        question = "Is the sentiment of this text positive, neutral, or negative? \
        Answer only with a number: 1 if positive, 0 if neutral or negative. \
        Here is the text: "
        text = pd_test.loc[i, "text"]
        full_question = question + str(text)
        output.append(hey_chatGPT(full_question, gpt_key))
    except:
        output.append(np.nan)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
In [124]:
# add as a column
pd_test["gpt_scores"] = pd.to_numeric(output)
In [128]:
pd_test2 = pd_test.dropna().copy()
In [129]:
# check accuracy
accuracy = accuracy_score(pd_test2['label'], pd_test2['gpt_scores'])
# see
print(accuracy)
0.8783269961977186

Pretty good results! Notice, we have no fine-tunning here. Just grabing results from the model!

In [131]:
!jupyter nbconvert _week_12_nlp_II.ipynb --to html --template classic
[NbConvertApp] Converting notebook _week_12_nlp_II.ipynb to html
[NbConvertApp] Writing 781397 bytes to _week_12_nlp_II.html