Week 11: Text-As-Data II: Classification
Def: machine learning algorithm type that requires labeled input data to understand the mapping function from the input to the output
SUPERVISED: The true labels y act as a “supervisor” for the learning process
LEARNING: the process by which a machine/model improves its performance on a specific task
When doing supervised learning with text, we have three basic pieces of the puzzle:
Outcome: the class you want to classify.
Input: this is your text. Often as a Document Feature Matrix.
Model: An transformation function connecting your words with the outcome you want to predict.
In class, we will see three distinct types of approaches to do supervised learning with text.
Dictionary Models.
Classic Machine Learning with bag-of-words assumption.
Pre-Trained Models Deep Learning Models.
Plus: Prompting Large Language Models (Full class next week)
Use a set of pre-defined words that allow us to classify documents automatically, quickly and accurately.
Instead of optimizing a transformation function using statistical assumption and seen data, in dictionaries we have a pre-assumed recipe for the transformation function.
A dictionary contains:
Weights given to each word ~ same for all words or some continuous variation.
We have a set of key words with weights,
e.g. for sentiment analysis: horrible is scored as \(-1\) and beautiful as \(+1\)
the relative rate of occurrence of these terms tells us about the overall tone or category that the document should be placed in.
For document \(i\) and words \(m=1,\ldots, M\) in the dictionary,
\[\text{tone of document $i$}= \sum^M_{m=1} \frac{s_m w_{im}}{N_i}\]
Where:
Pipeline:
Step 1: label some examples of the concept of we want to measure (output)
Step 2: convert your data to a document-feature matrix (input)
Step 3: train a statistical model on these set of label data using the document-feature matrix.
Step 4: use the classifier - some f(x) - to predict unseen documents
External Source of Annotation: Someone else labelled the data for you
Expert annotation:
Crowd-sourced coding: digital labor markets
| Predicted | ||||
|---|---|---|---|---|
| J | ¬J | Total | ||
| Actual | J | a (True Positive) | b (False Negative) | a+b |
| ¬J | c (False Positive) | d (True Negative) | c+d | |
| Total | a+c | b+d | N |
Accuracy: number correctly classified/total number of cases = (a+d)/(a+b+c+d)
Precision : number of TP / (number of TP+number of FP) = a/(a+c) .
Recall: (number of TP) / (number of TP + number of FN) = a /(a+b)
F : 2 precision*recall / precision+recall
In recent years, text-as-data/nlp tasks have been dominated by the use of deep learning models. Let’s try to understand a bit what these models are.
Components: These deep learning models have two major components
Definition: Deep Learning Models are designed for general-purpose classification tasks or just simple next-word prediction tasks (language modelling)
Key Features:
In the vector space model, we learned:
A document \(D_i\) is represented as a collection of features \(W\) (words, tokens, n-grams..)
Each feature \(w_i\) can be place in a real line, then a document \(D_i\) is a point in a \(W\) dimensional space.
Embedded in this model, there is the idea we represent words as a one-hot encoding.
One-hot encoding / Sparse Representation:
cat = \(\begin{bmatrix} 0,0, 0, 0, 0, 0, 1, 0, 0 \end{bmatrix}\)
dog = \(\begin{bmatrix} 0,0, 0, 0, 0, 1, 0, 0, 0 \end{bmatrix}\)
Word Embedding / Dense Representation:
cat = \(\begin{bmatrix} 0.25, -0.75, 0.90, 0.12, -0.50, 0.33, 0.66, -0.88, 0.10, -0.45 \end{bmatrix}\)
dog = \(\begin{bmatrix} 0.25, 1.75, 0.90, 0.12, -0.50, 0.33, 0.66, -0.88, 0.10, -0.45 \end{bmatrix}\)
Dense representations are behind all recent advancements on NLP, including ChatGPT
Source: CS224N
The transformer architecture was introduced only in 2017!! This paper revolutionized many natural language processing tasks, particularly, machine translation and language modelling. All famous models LLMs of today are based in the transformers framework
If you have enough computing power and knowledge of working with these models, you can train your own LLM models. It takes a LOT of data, time and money. Only big tech companies can actually do this
So, you will very likely:
use pre-trained models available on the web (Google Perspective)
use pre-trained transformers, fine tune the model (retrain with new data), and improve the performance of the models
Or outsource tasks for generative models through prompting via APIs.
Hugging Face’s Model Hub: centralized repository for sharing and discovering pre-trained models [https://huggingface.co]
Data science I: Foundations