Week 2: From Text to Matrices: Representing Text as Data
From Gentzkow et al 2017:
sample of documents, each \(n_L\) words long, drawn from vocabulary of \(n_V\) words.
The unique representation of each document has dimension \(n_{V}^{n_L}\) .
In most social science applications of text as data, we are trying to make an inference about a latent variable
Traditional social science: mapping between observed and latent/theoretical concepts is easier.
We observe/measure country macroeconomic variables, collect survey responses, see how politicians vote.
In text, we only observe the words. Much harder to identify the latent concepts.
Cover techniques to reduce complexity from text data using a set of pre-processing steps ~ Challenge I
How to represent text as numbers using the vector space model ~ Challenge II
Starting next week we will deal more with inference and modeling latent parameters using text ~ Challenge III
A corpus is (typically) a large set of texts or documents which we wish to analyze.
When selecting a corpus, we should consider how the corpus relates to our research question in two aspects:
Population of interest: does the corpus allows us to make inferences about them?
Quantity of interest: can we measure what we plan to?
Sampling Bias: documents are often sampled from a larger population. Are there concerns about sample selection bias?
Most often we use these documents because they were available to us (custom made data). In these cases, considering the three questions above is even more important.
RQ: Measure quality of comments on streaming chat platforms during political debates
Population of interest?
Quantity of interest?
Source of bias?
After selecting your documents and converting them to a computer-friendly format, we must decide our unit of analysis
Three things to consider in making this decision:
Features of your data and model fit
Your research question
Iterative model
Language is extraordinarily complex, and involves great subtlety and nuanced interpretation.
Tokenization: What does constitute a feature?
Remove `superfulous’ material: HTML tags, punctuation, numbers, lower case and stop words
Map words to equivalence forms: stemming and lemmatization
Discard less useful features for your task at hand: functional words, highly frequent or rare words
Discard word order: Bag-of-Words Assumption
A first step in any text analysis task is to break documents in meaningful units of analysis (tokens)
Tokens are often words for most tasks. A simple tokenizer uses white space marks to split documents in tokens.
Tokenizer may vary [across tasks]{.red}:
May also vary across languages, in which white space is not a good marker to split text into tokens
Certain tokens, even in english, make more sense together than separate (“White House”, “United States”). These are collocations
There are certain words that serve as linguistic connectors (`function words’) which we can remove.
Add noise to the document. Discard them, focus on signal, meaningful words.
Most TAD packages have a pre-selected list of stopwords. You can add more given you substantive knowledge (more about this later)
Usually not important for unsupervised and mostly supervised tasks, but might matter for authorship detection.
Different forms of words (family, families, familial), or words which are similar in concept (bureaucratic, bureaucrat, bureaucratization) that refer to same basic token/concept.
use algorithms to map these variation to a equivalent form:
All [TAD/NLP packages[{.red}] offer easy applications for these algorithms.
Some other commons steps, which are highly dependent on your contextual knowledge, are:
discard functional words: for example, when working with congressional speeches, remove representative, congress, session, etc...
remove highly frequent words: words that appear in all documents carry very little meaning for most supervised and unsupervised tasks ~ no clustering and not discrimination.
remove rare frequent words: same logic as above, no signal. Commong practice, words appear less 5% fo documents.
Now we have pre-processed our data. So we simplify it even further:
Bag-of-Words Assumption: the order in which words appear does not matter.
Ignore order
But keep multiplicity, we still consider frequency of words
How could this possible work:
it might note: you need validation
central tendency in text: some words are enough to topic detection, classificaiton, measures of similarity, and distance, for example.
humans in the loop: expertise knowledge help you figure it out subtle relationships between words and outcomes
we might retaining word order using n-grams.
can use [n-grams], which are (sometimes contiguous) sequences of two (bigrams) or three (trigrams) tokens.
This makes computations considerably more complex. We can pick some n-grams to keep but not all:
\(PMI_{a,b} = log \frac{p_{a,b}}{p_a \cdot p_b}\)
Text
We use a new dataset containing nearly 50 million historical newspaper pages from 2,700 local US newspapers over the years 1877–1977. We define and discuss a measure of power we develop based on observed word frequencies, and we validate it through a series of analyses. Overall, we find that the relative coverage of political actors and of political offices is a strong indicator of political power for the cases we study
After pre-processing
use new dataset contain near 50 million historical newspaper pag 2700 local u newspaper year 18771977 define discus measure power develop bas observ word frequenc validate ser analys overall find relat coverage political actor political offic strong indicator political power cas study
Starting point: No rigorous way to compare results across different pre-processing steps. Adapting recommendations from supervised learning tasks.
Unsupervised vs Supervised Learning?
What is their solution? (no math needed!)
To represent documents as numbers, we will use the vector space model representation:
A document \(D_i\) is represented as a collection of features \(W\) (words, tokens, n-grams..)
Each feature \(w_i\) can be place in a real line, then a document \(D_i\) is a point in a \(W\) dimensional space
Imagine the sentence below: “If that is a joke, I love it. If not, can’t wait to unpack that with you later.”
Sorted Vocabulary =(a, can’t, i, if, is, it, joke, later, love, not, that, to, unpack, wait, with, you”)
Feature Representation = (1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1)
Features will typically be the n-gram (mostly unigram) frequencies of the tokens in the document, or some function of those frequencies
Now each document is now a vector (vector space model)
Documents
Document 1 = “yes yes yes no no no”
Document 2 = “yes yes yes yes yes yes”
In the vector space, we can use geometry to build well-defined comparison measures between the documents (more about this next week)
Purely descriptive
Simple measure just by counting words.
Theorethically-driven: measure that capture a theorethically relevant concept.
\[ \small \text{Coverage of Mayor}_{it} = \frac{\text{Mayor}_{it}}{\text{Mayor}_{it} + \text{City Manager}_{it} + \text{City Council}_{it}} \]
Text-as-Data