Week 11: Text-As-Data I: Description and Topics
2024-11-19
Scalability and Dimensionality Reduction
Humans are great at understanding the meaning of a sentence or a document in-depth
Computers are better at understanding patterns, classify and describe content across millions of subjects
Computational Text analysis augments humans – does not replace us
Statistical models + Powerful Computers allows us to process data at scale and understand common patterns
We know “All text models are wrong” – This is particularly true with text!
Acquire textual data:
Preprocess the data:
Apply method appropriate to research goal:
Preprocess the data:
Apply methods appropriate to research goal:
When working with text, we have two critical challenges:
Reduce Complexity of Text (Think about every word as a variable. This is huge matrix!!)
Convert unstructured text to numbers ~ that we can feed in to a statistical model
Tokenize: break out larger chunks of text in words (unigram) or n-grams
Lowercase: convert all to lower case
Remove Noise: stopwords, numbers, punctuation, function words
Stemming: chops off the ends of words
Lemmatization: doing things properly with the use of a vocabulary and morphological analysis of words
Text
We use a new dataset containing nearly 50 million historical newspaper pages from 2,700 local US newspapers over the years 1877–1977. We define and discuss a measure of power we develop based on observed word frequencies, and we validate it through a series of analyses. Overall, we find that the relative coverage of political actors and of political offices is a strong indicator of political power for the cases we study
After pre-processing
use new dataset contain near 50 million historical newspaper pag 2700 local u newspaper year 18771977 define discus measure power develop bas observ word frequenc validate ser analys overall find relat coverage political actor political offic strong indicator political power cas study
Represent text as an unordered set of tokens in a document. Bag-of-Words Assumption
While order is ignored, frequency matters!
To represent documents as numbers, we will use the vector space model representation:
A document \(D_i\) is represented as a collection of features \(W\) (words, tokens, n-grams..)
Each feature \(w_i\) can be placed in a real line, then a document \(D_i\) is a vector with \(W\) dimensions
Imagine the sentence below: “If that is a joke, I love it. If not, can’t wait to unpack that with you later.”
Sorted Vocabulary =(a, can’t, i, if, is, it, joke, later, love, not, that, to, unpack, wait, with, you”)
Feature Representation = (1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1)
Features will typically be the n-gram (mostly unigram) frequencies of the tokens in the document, or some function of those frequencies
Each document becomes a numerica vector (vector space model)
Documents
Document 1 = “yes yes yes no no no”
Document 2 = “yes yes yes yes yes yes”
Using the vector space, we can use notions of geometry to build well-defined comparison/similarity measures between the documents.
The ordinary, straight line distance between two points in space. Using document vectors \(y_a\) and \(y_b\) with \(j\) dimensions
Euclidean Distance
\[ ||y_a - y_b|| = \sqrt{\sum^{j}(y_{aj} - y_{bj})^2} \]
Euclidean Distance
\[ ||y_a - y_b|| = \sqrt{\sum^{j}(y_{aj} - y_{bj})^2} \]
\(y_a\) = [0, 2.51, 3.6, 0] and \(y_b\) = [0, 2.3, 3.1, 9.2]
\(\sum_{j=1}^j (y_a - y_b)^2\) = \((0-0)^2 + (2.51-2.3)^2 + (3.6-3.1)^2 + (9-0)^2\) = \(84.9341\)
\(\sqrt{\sum_{j=1}^j (y_a - y_b)^2}\) = 9.21
Documents, W=3 {yes, no}
Document 1 = “yes yes yes no no no” (3, 3)
Document 2 = “yes yes yes yes yes yes” (6,0)
Document 3= “yes yes yes no no no yes yes yes no no no yes yes yes no no no yes yes yes no no no” (12, 12)
Euclidean distance rewards magnitude, rather than direction
\[ \text{cosine similarity}(\mathbf{y_a}, \mathbf{y_b}) = \frac{\mathbf{y_a} \cdot \mathbf{y_b}}{\|\mathbf{y_a}\| \|\mathbf{y_b}\|} \]
Unpacking the formula:
\(\mathbf{y_a} \cdot \mathbf{y_b}\) ~ dot product between vectors
\(||\mathbf{y_a}||\) ~ vector magnitude, length ~ \(\sqrt{\sum{y_{aj}^2}}\)
normalizes similarity by documents’ length ~ independent of document length be because it deals only with the angle of the vectors
cosine similarity captures some notion of relative direction (e.g. style or topics in the document)
Cosine function has a range between -1 and 1.
Topic models are a unsupervised statistical model: purpose of finding a hidden structure in the corpus, across many different documents
Capture words that are more likely to occur together across a set of documents.
Assign these words a probability of being part of a cluster (topic).
Assign documents a probability of being associated of these clusters.
Documents: emerge from probability distribution of topics
Topics: emerge from probability distributions over words
Step 1: For each document:
Step 2: Then, for every word position in the document
Using the observed data, the words, we can estimate latent parameters. We start with the joint distribution implied by our language model (Blei, 2012):
\[ p(\beta_{1:K}, \theta_{1:D}, z_{1:D}, w_{1:D})= \prod_{K}^{i=1}p(\beta_i)\prod_{D}^{d=1}p(\theta_d)(\prod_{N}^{n=1}p(z_{d,n}|\theta_d)p(w_{d,n}|\beta_{1:K},z_{d,n}) \]
To get to the conditional:
\[ p(\beta_{1:K}, \theta_{1:D}, z_{1:D}|w_{1:D})=\frac{p(\beta_{1:K}, \theta_{1:D}, z_{1:D}, w_{1:D})}{p(w_{1:D})} \]
The denominator is hard complicate to be estimate (requires integration for every word for every topic):
Structural topic model: allow (1) topic prevalence, (2) topic content to vary as a function of document-level covariates (e.g., how do topics vary over time or documents produced in 1990 talk about something differently than documents produced in 2020?); implemented in stm in R (Roberts, Stewart, Tingley, Benoit)
Correlated topic model: way to explore between-topic relationships (Blei and Lafferty, 2017); implemented in topicmodels in R; possibly somewhere in Python as well!
Keyword-assisted topic model: seed topic model with keywords to try to increase the face validity of topics to what you’re trying to measure; implemented in keyATM in R (Eshima, Imai, Sasaki, 2019)
BertTopic: BERTopic is a topic modeling technique that leverages transformers and TF-IDF to create dense clusters of words.
Data science I: Foundations
Social Media