PPOL 6801 - Text as Data - Computational Linguistics

Week 5: Supervised Learning:
Training your own classifiers

Professor: Tiago Ventura

Housekeeping

Today is your deadline for the problem set 1.

Replications next week
- Presentation (20 min each):
  - Introduction;
  - Methods;
  - Results;
  - Differences;
  - Autopsy of the replication;
  - Extensions
- Repository (by friday):
  - Github Repo
    - readme
    - your presentation pdf
    - code
    - 5pgs report

Where are we?

We started from pre-processing text as data, representing text as numbers, and describing features of the text.

Last week, we started learning how to measure concepts in text:

Documents pertaining to certain classes and how we can use statistical assumptions to measure these classes

Dictionary Methods
- Discuss some well-known dictionaries
Off-the-Shelf Classifiers
- Perspective API
- Hugging Face (only see as a off-the-shelf machines, LMMs later in this course)

Remember…

Unsupervised Models: learning (hidden or latent) structure in unlabeled data.
- Topic Models to cluster documents and words

Supervised Models: learning relationship between inputs and a labeled set of outputs.
- Sentiment Analysis, classify if a tweet contains misinformation, etc..

In TAD, we mostly use unsupervised techniques for discovery and supervised for measurement of concepts.

Today: cover the pipeline to train your own machine learning models to classify textual data.

Assuming:

Supervised Learning Pipeline for TAD

Step 1: label some examples of the concept of we want to measure
- some tweets are positive, some are neutral and some are negative
Step 2: train a statistical model on these set of label data using the document-feature matrix as input
- choose a model (transformation function) that gives higher out-of-sample accuracy
Step 3: use the classifier - some f(x) - to predict unseen documents.
- pick the best out-sample perfirmance
Step 4: use the measure + metadata|exogenous shocks to learn something new about the world.
- This is where social science happens!

Supervised Learning vs Dictionaries

Dictionary methods:

Advantage: not corpus-specific, cost to apply to a new corpus is trivial
Disadvantage: not corpus-specific, so performance on a new corpus is unknown (domain shift)

Supervised learning:

Generalization of dictionary methods
Features associated with each categories (and their relative weight) are learned from the data
By construction, ML will outperform dictionary methods in classification tasks, as long as training sample is large enough

Supervised Learning Pipeline for TAD

Step 1: label some examples of the concept of we want to measure
Step 2 train a statistical model on these set of label data using the document-feature matrix as input
Step 3: use the classifier - some f(x) - to predict unseen documents.
Step 4: use the measure + metadata|exogenous shocks to learn something new about the world.

Creating a labeled set

How to obtain a labeled dataset?

External Source of Annotation: someone else labelled the data for you
- Federalist papers
- Metadata from text
- Manifestos from Parties with well-developed dictionaries
Expert annotation: put experts in quotation
- mostly undergrads ~ that you train to be experts
Crowd-sourced coding: digital labor markets
- Wisdom of Crowsds: the idea that large groups of non-expert people are collectively smarter than individual experts when it comes to problem-solving

Crowdsourcing as a research tool for ML

Crowdsourcing is now understood to mean using the Internet to distribute a large package of small tasks to a large number of anonymous workers, located around the world and offered small financial rewards per task. The method is widely used for data-processing tasks such as image classification, video annotation, data entry, optical character recognition, translation, recommendation, and proofreading

Benoit et al, 206: Crowdsourcing Political Texts

Expert annotation is expensive.
Benoit, Conway, Lauderdale, Laver and Mikhaylov (2016) note that classification jobs could be given to a large number of relatively cheap online workers
Multiple workers ~ similar task ~ same stimuli ~ wisdom of crowds!
Representativeness of a broader population doesn’t matter ~ not a populational quantity, it is just a measurement task
Their task: Manifestos ~ sentences ~ workers:
- social|economic
  - very-left vs very right
Reduce uncertainty by having more workers for each sentence

Comparing Experts and online workers

How many workers?

Supervised Learning Pipeline for TAD

Step 1: label some examples of the concept of we want to measure
Step 2 train a statistical model on these set of label data using the document-feature matrix as input
Step 3: use the classifier - some f(x) - to predict unseen documents.
Step 4: use the measure + metadata|exogenous shocks to learn something new about the world.

General Thoughts

Once we have our training data, we need to pick a classifier. We face these challenges:

in text as data, often your DFM has Features > Documents
- identification problems for statistical models
- overfitting the data

Bias-Variance Trade-off
- fit a overly complicated model ~ leads to higher variance
- fit a more flexible model ~ leads to more bias

Many models:
- Naive Bayes
- Regularized regression
- SVM
- k-nearest neighbors, tree-based methods, etc.
- Ensemble methods + DL

Bias and Variance Tradeoff

Train-Validation-Test OR Cross Validation

Many Models

But not so different…

Regularized OLS Regression

The simplest, but highly effective, way to avoid overfit and improve out-sample accuracy is to add penalty parameters for statistical models:

OLS Loss Function :

\[ RSS = \sum_{i=1}^{N} \left( y_i - \beta_0 - \sum_{j=1}^{J} \beta_j x_{ij} \right)^2 \]

OLS + Penalty:

\[ \text{RSS} = \sum_{i=1}^{N} \left( y_i - \beta_0 - \sum_{j=1}^{J} \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^{J} \beta_j^2 \rightarrow \text{ridge regression} \]

\[ \text{RSS} = \sum_{i=1}^{N} \left( y_i - \beta_0 - \sum_{j=1}^{J} \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^{J} |\beta_j| \rightarrow \text{lasso regression} \]

Supervised Learning Pipeline for TAD

Step 1: label some examples of the concept of we want to measure
Step 2 train a statistical model on these set of label data using the document-feature matrix as input
Step 3: use the classifier - some f(x) - to predict unseen documents.
Step 4: use the measure + metadata|exogenous shocks to learn something new about the world.

Evaluating the Performance

		Predicted
		J	¬J	Total
Actual	J	a (TP)	b (FN)	a+b
	¬J	c (FP)	d (TN)	c+d
	Total	a+c	b+d	N

Accuracy: number correctly classified/total number of cases = (a+d)/(a+b+c+d)
Precision : number of TP / (number of TP+number of FP) = a/(a+c) .
- Fraction of the documents predicted to be J, that were in fact J.
- Think as a measure for the estimator
Recall: (number of TP) / (number of TP + number of FN) = a /(a+b)
- Fraction of the documents that were in fact J, that method predicted were J.
- Think as a measure for the data
F : 2 precision*recall / precision+recall
- Harmonic mean of precision and recall.

Barbera et al, 2020, Guide for Supervised Models with Text

via GIPHY

Task: Tone of New York Times coverage of the economy. Discusses:

How to build a corpus
Unit of analysis
Documents or Coders?
ML or Dictionaries?

PPOL 6801 - Text as Data - Computational Linguistics

Housekeeping

Where are we?

Remember…

In TAD, we mostly use unsupervised techniques for discovery and supervised for measurement of concepts.

Today: cover the pipeline to train your own machine learning models to classify textual data.

Assuming:

Supervised Learning Pipeline for TAD

Supervised Learning vs Dictionaries

Supervised Learning Pipeline for TAD

Creating a labeled set

How to obtain a labeled dataset?

Crowdsourcing as a research tool for ML

Benoit et al, 206: Crowdsourcing Political Texts

Comparing Experts and online workers

How many workers?

Supervised Learning Pipeline for TAD

General Thoughts

Bias and Variance Tradeoff

Train-Validation-Test OR Cross Validation

Many Models

But not so different…

Regularized OLS Regression

Supervised Learning Pipeline for TAD

Evaluating the Performance

Barbera et al, 2020, Guide for Supervised Models with Text

Coding!