Week 11: Text-As-Data II: Classification
2023-11-26
Introduction to Text Analysis/Natural Language Processing
When doing supervised learning with text, we have three basic pieces of the puzzle:
Outcome: the content you want to classify.
Input: this is your text. Often this will come from converting you text to a numerical representation. As we learned last week, most times this numerical representation will be a document feature matrix.
Model: An transformation function connecting your words with the outcome you want to predict.
In class, we will see three distinct types of approaches to do supervised learning with text.
Dictionary Models.
Classic Machine Learning with bag-of-words assumption.
Pre-Trained Models Deep Learning Models.
Use a set of pre-defined words that allow us to classify documents automatically, quickly and accurately.
Instead of optimizing a transformation function using statistical assumption and seen data, in dictionaries we have a pre-assumed recipe for the transformation function.
A dictionary contains:
Weights given to each word ~ same for all words or some continuous variation.
We have a set of key words with weights,
e.g. for sentiment analysis: horrible is scored as \(-1\) and beautiful as \(+1\)
the relative rate of occurrence of these terms tells us about the overall tone or category that the document should be placed in.
For document \(i\) and words \(m=1,\ldots, M\) in the dictionary,
\[\text{tone of document $i$}= \sum^M_{m=1} \frac{s_m w_{im}}{N_i}\]
Where:
Pipeline:
Step 1: label some examples of the concept of we want to measure (output)
Step 2: convert your data to a document-feature matrix (input)
Step 3: train a statistical model on these set of label data using the document-feature matrix.
Step 4: use the classifier - some f(x) - to predict unseen documents
Predicted | ||||
---|---|---|---|---|
J | ¬J | Total | ||
Actual | J | a (TP) | b (FN) | a+b |
¬J | c (FP) | d (TN) | c+d | |
Total | a+c | b+d | N |
Accuracy: number correctly classified/total number of cases = (a+d)/(a+b+c+d)
Precision : number of TP / (number of TP+number of FP) = a/(a+c) .
Recall: (number of TP) / (number of TP + number of FN) = a /(a+b)
F : 2 precision*recall / precision+recall
In recent years, text-as-data/nlp tasks have been dominated by the use of deep learning models. Let’s try to understand a bit what these models are.
Components: These text-based deep learning models have two major components
Definition: Deep Learning Models are designed for general-purpose classification tasks or just simple next-word prediction tasks
Key Features:
In the vector space model, we learned:
A document \(D_i\) is represented as a collection of features \(W\) (words, tokens, n-grams..)
Each feature \(w_i\) can be place in a real line, then a document \(D_i\) is a point in a \(W\) dimensional space.
Embedded in this model, there is the idea we represent words as a one-hot encoding.
One-hot encoding / Sparse Representation:
cat = \(\begin{bmatrix} 0,0, 0, 0, 0, 0, 1, 0, 0 \end{bmatrix}\)
dog = \(\begin{bmatrix} 0,0, 0, 0, 0, 1, 0, 0, 0 \end{bmatrix}\)
Word Embedding / Dense Representation:
cat = \(\begin{bmatrix} 0.25, -0.75, 0.90, 0.12, -0.50, 0.33, 0.66, -0.88, 0.10, -0.45 \end{bmatrix}\)
dog = \(\begin{bmatrix} 0.25, 1.75, 0.90, 0.12, -0.50, 0.33, 0.66, -0.88, 0.10, -0.45 \end{bmatrix}\)
Dense representations are behind all recent advancements on NLP, including ChatGPT
Source: CS224N
If you have enough computing power and knowledge of working with these models, you can train your own LLM models. It takes a LOT of data, time and money.
Most people will:
use pre-trained models (transfer learning) available on the web.
use pre-trained word embeddings, fine tune the model (retrain with new data), and improve the performance
outsource tasks for generative models through prompting via APIs.
Data science I: Foundations