
<h1><center> PPOL 6801 Text as Data <br><br> 
<font color='grey'> Supervised Learning with Text and Off-the-Shelf  <br><br>
Tiago Ventura </center> <h1> 

---

In [17]:
# Open data
import pandas as pd
import numpy as np

## Supervised Learning with Text

To practice with supervised learning with text data, we will perform some classic sentiment analysis classification task. Sentiment analysis natural language processing technique that given a textual input (tweets, movie reviews, comments on a website chatbox, etc... ) identifies the polarity of the text. 

There are different flavors of sentiment analysis, but one of the most widely used techniques labels data into positive, negative and neutral. Other options are classifying text according to the levels of toxicity, which I did in the paper I asked you to read, or more fine-graine measures of sentiments. 

Sentiment analysis is just one of many types of classification tasks that can be done with text. For any type of task in which you need to identify if the input pertains to a certain category, you can use a similar set of tools as we will see for sentiment analysis. For example, these are some classification tasks I have used in my work before: 

- Classify the levels of toxicity in social media live-streaming comments.
- Analyze the sentiment of tweets.
- Classify if the user is a Republican or Democrat  given the their Twitter bios. 
- Identify if a particular social media post contains misinformation. 

For all these tasks, you need: 

- some type of labelled data (which you and your research team will do), 
- build/or use a pre-trained machine learning models to make the prediction
- evaluate the performance of the models

Here, we will work with data that was alread labelled for us. We will analyze the sentiment on IMDB dataset of reviews

### IMDB Dataset

For the rest of this notebook, we will IMDB dataset provided by [Hugging Face](https://huggingface.co/datasets/imdb). The IMDB dataset contains 25,000 movie reviews labeled by sentiment for training a model and 25,000 movie reviews for testing it. 

We will talk more about the Hugging Face project later in this notebook. For now, just download their main transformers library, and import the IMDB Review Dataset

#### Accessing the Dataset

In [18]:
#!pip install transformers
#!pip install datasets
from datasets import load_dataset
imdb = load_dataset("imdb")

Using the latest cached version of the module from /Users/tb186/.cache/huggingface/modules/datasets_modules/datasets/imdb/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0 (last modified on Sat Nov 18 11:40:32 2023) since it couldn't be found locally at imdb, or remotely on the Hugging Face Hub.


In [3]:
#### get a smaller sample
small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(3000))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(300))])

# convert to a dataframe
pd_train = pd.DataFrame(small_train_dataset)
pd_test = pd.DataFrame(small_test_dataset)

# see the data
pd_train.head()

Unnamed: 0,text,label
0,There is no relation at all between Fortier an...,1
1,This movie is a great. The plot is very true t...,1
2,"George P. Cosmatos' ""Rambo: First Blood Part I...",0
3,In the process of trying to establish the audi...,1
4,"Yeh, I know -- you're quivering with excitemen...",0


### Dictionary Methods

Our first approach for sentiment classification will use dictionary methods. 

**Common Procedure:** Consists on using a pre-determined set of words (dictionary) that identifies the categories you want to classify documents. With this dictionary, you can do a simple search through the documents, count how many times these words appear, and use some type of aggregation function to classify the text. For example: 

- Positive or negative, for sentiment
- Sad, happy, angry, anxious... for emotions
- Sexism, homophobia, xenophobia, racism... for hate speech

Dictionaries are the most basic strategy to classify documents. Its simplicity requires some unrealistic assumptions (for example related to ignoring contextual information of the documents). However, the use of dicitionaries have one major advantage: it allows for a bridge between qualititative and quantitative knowledge. You need human experts to build good dictionaries.  

#### VADER

There are many options for dictionaries for sentiment classification. We will use one popular open-source option available at NLTK: The VADER dictionary. VADER stands for Valence Aware Dictionary for Sentiment Reasoning. It is a model used for text sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strength) of emotion, and it was developed to handling particularly social media content. 


**Key Components of the VADER Dictionary:***

- Sentiment Lexicon: This is a list of known words and their associated sentiment scores. 

- Sentiment Intensity Scores: Each word in the lexicon is assigned a score that ranges from -4 (extremely negative) to +4 (extremely positive). 

- Handling of Contextual and Qualitative Modifiers: VADER is sensitive to both intensifiers (e.g., "very") and negations (e.g., "not"). 

You can read the original paper that created the VADER [here](https://www.google.com/search?q=ADER%3A+A+Parsimonious+Rule-based+Model+for+Sentiment+Analysis+of+Social+Media+Text.+Eighth+International+Conference+on+Weblogs+and+Social+Media&rlz=1C5GCEM_enUS1072US1073&oq=ADER%3A+A+Parsimonious+Rule-based+Model+for+Sentiment+Analysis+of+Social+Media+Text.+Eighth+International+Conference+on+Weblogs+and+Social+Media&gs_lcrp=EgZjaHJvbWUyBggAEEUYOTIGCAEQRRg60gEHMTU2ajBqNKgCALACAA&sourceid=chrome&ie=UTF-8)

#### Import dictionary

In [19]:
import nltk
# nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [20]:
# instantiate the model
sid = SentimentIntensityAnalyzer()

# simple example
review1 = "I have eaten here dozens of times and have always had an outstanding experience. a meal at Fogo de Chao is always a wonderful experience!"

# classify
sid.polarity_scores(review1)

{'neg': 0.0, 'neu': 0.714, 'pos': 0.286, 'compound': 0.8398}

In [21]:
# simple example
review2 = "Long wait on a rainy day. \
I had to order my burger twice. \
I ordered it medium and came way over cooked."

# classify
sid.polarity_scores(review2)

{'neg': 0.067, 'neu': 0.933, 'pos': 0.0, 'compound': -0.0772}

Let's now apply the dictionary at scale in our IMDB review dataset

In [22]:
# apply the dictionary to your data frame
pd_test["vader_scores"]=pd_test["text"].apply(sid.polarity_scores)

# let's see
pd_test.head()

# grab final sentiment
pd_test["sentiment_vader"]=pd_test["vader_scores"].apply(lambda x: np.where(x["compound"] > 0, 1, 0))

Now that we have performed the classification task, we can see compare the labels and our predictions. We will be using a simple accuracy measure of how many labels were correctly classified

In [23]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(pd_test['label'], pd_test['sentiment_vader'])

# see
print(accuracy)

0.6966666666666667


## Pre-Trained Large Language Models: Hugging Face

In the past few years, the field of natural language processing  has undergone through a major revolution. As we first saw, the early generation of NLP models was based on the idea of converting text to numbers through the use of document-feature matrix relying on the bag-of-words assumptions. 

In the past ten-years, we have seen the emergence of a new paradigm using deep-learning and neural networks models to improve on the representation of text as numbers. These new models move away from the idea of a bag-of-words towards a more refined representation of text capturing the contextual meaning of words and sentences. This is achieved by training models with billions of parameters on text-sequencing tasks, using as inputs a dense representation of words. These are the famous word embeddings. 

The most recent innovation on this revolution has been the Transformers Models. These models use multiple embeddings (matrices) to represent word, in which each matrix can capture different contextual representations of words. This dynamic representation allow for higher predictive power on downstream tasks in which these matrices form the foundation of the entire machine learning architecture. For example, Transformers are the the core of the language models like Open AI's GPTs and Meta's LLaMa.

The Transformers use a sophisticated architecture that requires a huge amount of data and computational power to be trained. However, several of these models are open-sourced and are made available for us on the web through a platform called [Hugging Face](https://huggingface.co/). Those are what we call **pre-trained large language models**. At this point, there are thousands of pre-trained models based on the transformers framework available at hugging face. 

Once you find a model that fits your task, you have two options: 

- **Use the model architecture: access the model through the transformers library, and use it in you predictive tasks.** 

- **Fine-Tunning:** this is the most traditional way. You will get the model, give some data, re-train the model slightly so that the model will learn patterns from your data, and use on your predictive task. By fine-tuning a Transformers-based model for our own application, we can improve contextual understanding and therefore task-specific performance

We will see example of the first for sentiment analysis. If you were to do build a full pipeline for classification, you would probably need to fine-tune the model. To learn more about fine-tunning, I suggest you to read: 

- here on hugging face: https://huggingface.co/blog/sentiment-analysis-python

- and this forthcoming paper for political science applications:https://joantimoneda.netlify.app/files/Timoneda%20Vallejo%20V%20JOP.pdf

### Transformers Library

To use a model available on hugging face, you only need a few lines of code. 

#### Use the pipeline class to access the model. 

The pipeline function will give you the default model for this task, that in this case is a Bert-Based Model, see here: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english?text=I+like+you.+I+love+you

In [24]:
# import the pipeline function
from transformers import pipeline

In [25]:
# instantiate your model
sentiment_pipeline = pipeline("sentiment-analysis")

# see simple cases
print(review1, review2)

#prediction
sentiment_pipeline([review1, review2])

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


I have eaten here dozens of times and have always had an outstanding experience. a meal at Fogo de Chao is always a wonderful experience! Long wait on a rainy day. I had to order my burger twice. I ordered it medium and came way over cooked.


[{'label': 'POSITIVE', 'score': 0.9998835325241089},
 {'label': 'NEGATIVE', 'score': 0.9927043914794922}]

In [11]:
# predict in the entire model. 
# notice here I am truncating the model. Transformers can only deal with 512 tokens max
pd_test["bert_scores"]=pd_test["text"].apply(sentiment_pipeline, truncation=True, max_length=512)

# let's clean it up
pd_test["bert_class"]=pd_test["bert_scores"].apply(lambda x: np.where(x[0]["label"]=="POSITIVE", 1, 0))

pd_test.head()

Unnamed: 0,text,label,vader_scores,sentiment_vader,bert_scores,bert_class
0,<br /><br />When I unsuspectedly rented A Thou...,1,"{'neg': 0.069, 'neu': 0.788, 'pos': 0.143, 'co...",1,"[{'label': 'POSITIVE', 'score': 0.998875796794...",1
1,This is the latest entry in the long series of...,1,"{'neg': 0.066, 'neu': 0.862, 'pos': 0.073, 'co...",1,"[{'label': 'POSITIVE', 'score': 0.996983110904...",1
2,This movie was so frustrating. Everything seem...,0,"{'neg': 0.24, 'neu': 0.583, 'pos': 0.177, 'com...",0,"[{'label': 'NEGATIVE', 'score': 0.997244238853...",0
3,"I was truly and wonderfully surprised at ""O' B...",1,"{'neg': 0.075, 'neu': 0.752, 'pos': 0.173, 'co...",1,"[{'label': 'NEGATIVE', 'score': 0.649214446544...",0
4,This movie spends most of its time preaching t...,0,"{'neg': 0.066, 'neu': 0.707, 'pos': 0.227, 'co...",1,"[{'label': 'NEGATIVE', 'score': 0.998503446578...",0


We can easily use this model to make predictions on our entire dataset

In [26]:
## accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(pd_test['label'], pd_test['bert_class'])
# see
print(accuracy)

0.87


Without any fine-tunning, we are already doing much, much better than dictionaries!

### Use contextual knowledge: Model Trained on Amazon Reviews

We will go in-depth in the process of fine-tunning your model in the week 12 of the course. While we don't  do that, let's see if there are models on Hugging Face that were actually trained on a similar task: predicting reviews. 

Actually, there are many. See here: https://huggingface.co/models?sort=trending&search=sentiment+reviews

In [27]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# acessing the model
model = AutoModelForSequenceClassification.from_pretrained("MICADEE/autonlp-imdb-sentiment-analysis2-7121569")

# Acessing the tokenizer
tokenizer = AutoTokenizer.from_pretrained("MICADEE/autonlp-imdb-sentiment-analysis2-7121569")

# use in my model
sentiment_pipeline = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

In [28]:
# Run in the dataframe
pd_test["imdb_scores"]=pd_test["text"].apply(sentiment_pipeline, truncation=True, max_length=512)

# clean
pd_test["imdb_class"]=pd_test["imdb_scores"].apply(lambda x: np.where(x[0]["label"]=="positive", 1, 0))

In [15]:
## accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(pd_test['label'], pd_test['imdb_class'])
# see
print(accuracy)

0.97


### Who still needs a dictionary?!?!