PPOL 6801 - Text as Data - Computational Linguistics


Week 5: Supervised Learning:
Training your own classifiers

Professor: Tiago Ventura

Housekeeping

Today is your deadline for the problem set 1.

  • Replications next week

    • Presentation (20 min each):
      • Introduction;
      • Methods;
      • Results;
      • Differences;
      • Autopsy of the replication;
      • Extensions
    • Repository (by friday):
      • Github Repo
        • readme
        • your presentation pdf
        • code
        • 5pgs report

Where are we?

We started from pre-processing text as data, representing text as numbers, and describing features of the text.

Last week, we started learning how to measure concepts in text:

Documents pertaining to certain classes and how we can use statistical assumptions to measure these classes

  • Dictionary Methods
    • Discuss some well-known dictionaries
  • Off-the-Shelf Classifiers
    • Perspective API
    • Hugging Face (only see as a off-the-shelf machines, LMMs later in this course)

Remember…

  • Unsupervised Models: learning (hidden or latent) structure in unlabeled data.

    • Topic Models to cluster documents and words
  • Supervised Models: learning relationship between inputs and a labeled set of outputs.

    • Sentiment Analysis, classify if a tweet contains misinformation, etc..

In TAD, we mostly use unsupervised techniques for discovery and supervised for measurement of concepts.

Today: cover the pipeline to train your own machine learning models to classify textual data.

Assuming:

Supervised Learning Pipeline for TAD

  • Step 1: label some examples of the concept of we want to measure

    • some tweets are positive, some are neutral and some are negative
  • Step 2: train a statistical model on these set of label data using the document-feature matrix as input

    • choose a model (transformation function) that gives higher out-of-sample accuracy
  • Step 3: use the classifier - some f(x) - to predict unseen documents.

    • pick the best out-sample perfirmance
  • Step 4: use the measure + metadata|exogenous shocks to learn something new about the world.

    • This is where social science happens!

Supervised Learning vs Dictionaries

Dictionary methods:

  • Advantage: not corpus-specific, cost to apply to a new corpus is trivial
  • Disadvantage: not corpus-specific, so performance on a new corpus is unknown (domain shift)

Supervised learning:

  • Generalization of dictionary methods
  • Features associated with each categories (and their relative weight) are learned from the data
  • By construction, ML will outperform dictionary methods in classification tasks, as long as training sample is large enough

Supervised Learning Pipeline for TAD

  • Step 1: label some examples of the concept of we want to measure

  • Step 2 train a statistical model on these set of label data using the document-feature matrix as input

  • Step 3: use the classifier - some f(x) - to predict unseen documents.

  • Step 4: use the measure + metadata|exogenous shocks to learn something new about the world.

Creating a labeled set

How to obtain a labeled dataset?

  • External Source of Annotation: someone else labelled the data for you

    • Federalist papers
    • Metadata from text
    • Manifestos from Parties with well-developed dictionaries
  • Expert annotation: put experts in quotation

    • mostly undergrads ~ that you train to be experts
  • Crowd-sourced coding: digital labor markets

    • Wisdom of Crowsds: the idea that large groups of non-expert people are collectively smarter than individual experts when it comes to problem-solving

Crowdsourcing as a research tool for ML




Crowdsourcing is now understood to mean using the Internet to distribute a large package of small tasks to a large number of anonymous workers, located around the world and offered small financial rewards per task. The method is widely used for data-processing tasks such as image classification, video annotation, data entry, optical character recognition, translation, recommendation, and proofreading

Benoit et al, 206: Crowdsourcing Political Texts

  • Expert annotation is expensive.

  • Benoit, Conway, Lauderdale, Laver and Mikhaylov (2016) note that classification jobs could be given to a large number of relatively cheap online workers

  • Multiple workers ~ similar task ~ same stimuli ~ wisdom of crowds!

  • Representativeness of a broader population doesn’t matter ~ not a populational quantity, it is just a measurement task

  • Their task: Manifestos ~ sentences ~ workers:

    • social|economic
      • very-left vs very right
  • Reduce uncertainty by having more workers for each sentence

Comparing Experts and online workers

How many workers?

Supervised Learning Pipeline for TAD

  • Step 1: label some examples of the concept of we want to measure

  • Step 2 train a statistical model on these set of label data using the document-feature matrix as input

  • Step 3: use the classifier - some f(x) - to predict unseen documents.

  • Step 4: use the measure + metadata|exogenous shocks to learn something new about the world.

General Thoughts

Once we have our training data, we need to pick a classifier. We face these challenges:

  • in text as data, often your DFM has Features > Documents

    • identification problems for statistical models
    • overfitting the data
  • Bias-Variance Trade-off

    • fit a overly complicated model ~ leads to higher variance
    • fit a more flexible model ~ leads to more bias
  • Many models:

    • Naive Bayes
    • Regularized regression
    • SVM
    • k-nearest neighbors, tree-based methods, etc.
    • Ensemble methods + DL

Bias and Variance Tradeoff

Train-Validation-Test OR Cross Validation

Many Models

But not so different…

Regularized OLS Regression

The simplest, but highly effective, way to avoid overfit and improve out-sample accuracy is to add penalty parameters for statistical models:

OLS Loss Function :

\[ RSS = \sum_{i=1}^{N} \left( y_i - \beta_0 - \sum_{j=1}^{J} \beta_j x_{ij} \right)^2 \]

OLS + Penalty:

\[ \text{RSS} = \sum_{i=1}^{N} \left( y_i - \beta_0 - \sum_{j=1}^{J} \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^{J} \beta_j^2 \rightarrow \text{ridge regression} \]

\[ \text{RSS} = \sum_{i=1}^{N} \left( y_i - \beta_0 - \sum_{j=1}^{J} \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^{J} |\beta_j| \rightarrow \text{lasso regression} \]

Supervised Learning Pipeline for TAD

  • Step 1: label some examples of the concept of we want to measure

  • Step 2 train a statistical model on these set of label data using the document-feature matrix as input

  • Step 3: use the classifier - some f(x) - to predict unseen documents.

  • Step 4: use the measure + metadata|exogenous shocks to learn something new about the world.

Evaluating the Performance

Predicted
J ¬J Total
Actual J a (TP) b (FN) a+b
¬J c (FP) d (TN) c+d
Total a+c b+d N
  • Accuracy: number correctly classified/total number of cases = (a+d)/(a+b+c+d)

  • Precision : number of TP / (number of TP+number of FP) = a/(a+c) .

    • Fraction of the documents predicted to be J, that were in fact J.
    • Think as a measure for the estimator
  • Recall: (number of TP) / (number of TP + number of FN) = a /(a+b)

    • Fraction of the documents that were in fact J, that method predicted were J.
    • Think as a measure for the data
  • F : 2 precision*recall / precision+recall

    • Harmonic mean of precision and recall.

Barbera et al, 2020, Guide for Supervised Models with Text


Task: Tone of New York Times coverage of the economy. Discusses:

  • How to build a corpus
  • Unit of analysis
  • Documents or Coders?
  • ML or Dictionaries?

Coding!