PPOL 6801 - Text as Data - Computational Linguistics

Week 5: Supervised Learning:
Training your own classifiers

Professor: Tiago Ventura

Housekeeping

Today is your deadline for the problem set 1. Questions?

Coding

Where are we?

We started from pre-processing text as data, representing text as numbers, and describing features of the text.

Last week, we started measurement and classification:

Dictionary Methods
- Discuss some well-known dictionaries
Off-the-Shelf Classifiers
- Perspective API
- Hugging Face (only see as a off-the-shelf machines, LMMs later in this course)
Today: Supervised Learning

From dictionaries to supervised learning

How can we systematically classify text into meaningful categories (e.g., positive/negative, hate vs. non-hate speech, spam vs. non-spam)?

Goal: Learn a function f that maps text inputs x to category outputs y
Dictionaries: Mapping function is explicitly defined by the researcher using predefined word lists
Supervised Classification: Mapping function is learned from the data by training an algorithm to predict categories based on labeled documents

What is Supervised Learning?

Def: machine learning algorithm type that requires labeled input data to understandcthe mapping function from the input to the output

SUPERVISED: The true labels y act as a “supervisor” for the learning process

Model predicts ˆy for each input x
Compare predictions ˆy to true labels y
Adjust model to minimize errors

LEARNING: the process by which a machine/model improves its performance on a specific task

Quick introduction to Machine Learning

An Statistical Model

The aim of statistical models is to estimate the relationship between the outcome and some set of variables:

\[ y = f(X) + \epsilon\]

Where:

\(y\) is the outcome/dependent/response variable
\(X\) is a matrix of predictors/features/independent variables
\(f()\) is some fixed but unknown function mapping X to y. The “signal” in the data
\(\epsilon\) is some random error term. The “noise” in the data.

Supervised Learning in TaD

Consider this task: “Predicting sentiment of news about the economy in the US”.

Inference or Prediction?

Component	Definition	Example in This Case
Input Features (x)
Label (y)
Prediction (ŷ)

Component	Definition	Example in This Case
Input Features (x)	Textual data (e.g. DFM)	Sentiment scores from news
Label (y)	True category or outcome we want to predict	hand-coded sentiment on a small set of news
Prediction (ŷ)	Model’s estimated prediction	Predicted sentiment

Supervised Learning Pipeline

Step 1: Label some examples of the concept of we want to measure
Step 2: Train a statistical model on these set of label data using the document-feature matrix as input on a training dataset.
Step 3: Choose a model (transformation function) that gives higher out-of-sample accuracy
Step 4: use the classifier - some f(x) - to predict unseen documents.

Start with your target: Building a training dataset.

How to obtain a training labeled dataset?

External Source of Annotation: Someone else labelled the data for you
- Federalist papers
- Metadata from text
- Manifestos from Parties with well-developed dictionaries
Expert annotation:
- mostly undergrads ~ that you train to be experts
Crowd-sourced coding: digital labor markets
- Wisdom of Crowds: the idea that large groups of non-expert people are collectively smarter than individual experts when it comes to problem-solving

Crowdsourcing as a research tool for ML

Crowdsourcing is now understood to mean using the Internet to distribute a large package of small tasks to a large number of anonymous workers, located around the world and offered small financial rewards per task. The method is widely used for data-processing tasks such as image classification, video annotation, data entry, optical character recognition, translation, recommendation, and proofreading

Benoit et al, 206: Crowdsourcing Political Texts

Expert annotation is expensive.
Benoit, Conway, Lauderdale, Laver and Mikhaylov (2016) note that classification jobs could be given to a large number of relatively cheap online workers
Multiple workers ~ similar task ~ same stimuli ~ wisdom of crowds!
Representativeness of a broader population doesn’t matter ~ not a populational quantity, it is just a measurement task
Their task: Manifestos ~ sentences ~ workers:
- social|economic
  - very-left vs very right
Reduce uncertainty by having more workers for each sentence

Comparing Experts and online workers

Example of Crowdsourcing for labels

Training an statistical model.

Classifier Selection

Def: A classifier is an statistical model that assigns data points to a range of categories or classes.

in text as data, often your DFM has Features > Documents
- identification problems for statistical models
- overfitting the data with noise

Bias-Variance Trade-off
- fit a overly complicated model ~ leads to higher variance
- fit a more flexible model ~ leads to more bias

Many models:
- Naive Bayes, Regularized regressionm, SVM, k-nearest neighbors, tree-based methods, Ensemble methods + DL

Differences: loss function they minimize, model assumptions, regularization control (balance bias vs. variance), computational complexity

Regularized OLS Regression

The simplest, but highly effective, way to build ML models in TaD is to use known model (OLS) + regularization to reduce dimensionality:

OLS Loss Function :

\[ RSS = \sum_{i=1}^{N} \left( y_i - \beta_0 - \sum_{j=1}^{J} \beta_j x_{ij} \right)^2 \]

OLS + Penalty (Ridge)

\[ \text{RSS} = \sum_{i=1}^{N} \left( y_i - \beta_0 - \sum_{j=1}^{J} \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^{J} \beta_j^2 \rightarrow \text{ridge regression} \]

Penalizes large coefficients.
Shrinks all coefficients towards zero, but none are exactly zero.

OLS + Penalty (Lasso)

\[ \text{RSS} = \sum_{i=1}^{N} \left( y_i - \beta_0 - \sum_{j=1}^{J} \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^{J} |\beta_j| \rightarrow \text{lasso regression} \]

Penalizes absolute size of coefficients.
Many coefficients shrink to exactly zero → automatic variable selection.
VERY USEFUL FOR SPARSE MATRIX!!

Many Models

Differences: loss function they minimize, model assumptions, regularization control (balance bias vs. variance), computational complexity

How to select a model?

The purpose of training the model is to capture signal and ignore noise.

To do so:

Define a accuracy metric: reduce this metric as much as possible. Simplest form: how many errors I am making with the predictions?
Train-Test Split
- Split your data between training and test
- Training data: find the best model and tune the parameters of these models
- Test data: assess the accuracy of your model.
Select the model based on smaller out of sample predictive accuracy, using UNSEEN data.

What can go wrong?

Bias and Variance Trade-off

Expected prediction error for a data point:

\[ E\big[(\hat{y} - y)^2\big] = (\text{Bias}^2) + \text{Variance} + \text{Irreducible Error} \]

Bias: Difference between the expected prediction of the model and the true value.

\[\text{Bias}(\hat{f}(x)) = E[\hat{f}(x)] - f(x)\]

Variance: How much do the model’s predictions fluctuate for different training sets?

\[ \text{Var}(\hat{f}(x)) = E\Big[\big(\hat{f}(x) - E[\hat{f}(x)]\big)^2\Big]\]

Bias and Variance Tradeoff:

Can fit a very complicated model to our training set perfectly
- Bias ↓, Variance ↑

Can be more relaxed about the performance in the training set
- Bias ↑, Variance ↓

Bias and Variance Tradeoff

Train-Validation-Test

Training Set

Largest portion (60-80% of the data)
To train the model

Validation Set

Smaller portion (10-20% of data)
Check different models
To tune hyperparameters and evaluate model during training
Parameters that we set before training begins (e.g. learning rate, number of neural networks layers, strength of regularization)

Test Set - Smaller portion (10-20% of data) - To assess final model performance

Or Cross Validation

Cross-validation: resampling method that involves partitioning data into subsets, allowing us to train and test on different combinations

Summary: how to choose a classifier?

Factor	Recommended Classifiers
Task Type	Classification: Logistic Regression, Random Forest, SVM Regression: Linear Regression, Random Forest Regressor
Data Size	Small datasets: Logistic Regression, Naive Bayes Large datasets: Neural Networks, Gradient Boosting
Feature Characteristics	Sparse features (e.g., text data): Naive Bayes, SVM Non-linear relationships: Random Forest, Gradient Boosting
Interpretability	High interpretability: Logistic Regression, Decision Trees Performance focus: Neural Networks, Gradient Boosting
Computational Resources	Low resources: Naive Bayes, Logistic Regression High resources: Neural Networks, Gradient Boosting

Measures of Accuracy

Evaluating the Performance

		Predicted
		J	¬J	Total
Actual	J	a (True Positive)	b (False Negative)	a+b
	¬J	c (False Positive)	d (True Negative)	c+d
	Total	a+c	b+d	N

Accuracy: number correctly classified/total number of cases = (a+d)/(a+b+c+d)
Precision : number of TP / (number of TP+number of FP) = a/(a+c) .
- Fraction of the documents predicted to be J, that were in fact J.
- Think as a measure for the estimator (Very precise estimate)
Recall: (number of TP) / (number of TP + number of FN) = a /(a+b)
- Fraction of the documents that were in fact J, that method predicted were J.
- Think as a measure for the data (Covers most of the cases in the data)
F : 2 precision*recall / precision+recall
- Harmonic mean of precision and recall.

Quizz

You are working for the FBI, looking for emails that pertain to terrorist attacks. Fortunately, such emails are very, very rare (0.0001% of all emails).

For such tasks, there’s a trade-off between precision and recall. Explain why.
We may be skeptical of using accuracy as a performance indicator in this case. Explain why.

Barbera et al, 2020, Guide for Supervised Models with Text

via GIPHY

Task: Tone of New York Times coverage of the economy. Discusses:

How to build a corpus
Unit of analysis
Unique documents or More coders?
ML or Dictionaries?

PPOL 6801 - Text as Data - Computational Linguistics

Housekeeping

Coding

Where are we?

From dictionaries to supervised learning

What is Supervised Learning?

Quick introduction to Machine Learning

An Statistical Model

Supervised Learning in TaD

Supervised Learning Pipeline

Start with your target: Building a training dataset.

How to obtain a training labeled dataset?

Crowdsourcing as a research tool for ML

Benoit et al, 206: Crowdsourcing Political Texts

Comparing Experts and online workers

Example of Crowdsourcing for labels

Training an statistical model.

Classifier Selection

Regularized OLS Regression

OLS + Penalty (Ridge)

OLS + Penalty (Lasso)

Many Models

How to select a model?

What can go wrong?

Bias and Variance Trade-off

Bias and Variance Tradeoff

Train-Validation-Test

Or Cross Validation

Summary: how to choose a classifier?

Measures of Accuracy

Evaluating the Performance

Quizz

Barbera et al, 2020, Guide for Supervised Models with Text

How to build a corpus?

Unit of Analysis?

Unique documents or More coders?

ML or Dictionaries?

See you next week!

PPOL 6801 - Text as Data - Computational Linguistics

Housekeeping

Coding

Where are we?

From dictionaries to supervised learning

What is Supervised Learning?

Quick introduction to Machine Learning

An Statistical Model

Inference (Social Science) vs Prediction (Machine Learning)

Inference (Social Science) vs Prediction (Machine Learning)

Supervised Learning in TaD

Supervised Learning Pipeline

Start with your target: Building a training dataset.

How to obtain a training labeled dataset?

Crowdsourcing as a research tool for ML

Benoit et al, 206: Crowdsourcing Political Texts

Comparing Experts and online workers

Example of Crowdsourcing for labels

Training an statistical model.

Classifier Selection

Regularized OLS Regression

OLS + Penalty (Ridge)

OLS + Penalty (Lasso)

Many Models

How to select a model?

What can go wrong?

Bias and Variance Trade-off

Bias and Variance Tradeoff

Train-Validation-Test

Or Cross Validation

Summary: how to choose a classifier?

Measures of Accuracy

Evaluating the Performance

Quizz

Barbera et al, 2020, Guide for Supervised Models with Text

How to build a corpus?

Unit of Analysis?

Unique documents or More coders?

ML or Dictionaries?

See you next week!