Week 3: Descriptive Inference - Comparing Documents
Thanks for selecting the articles for replication exercise! Next steps:
Get access to the data and code ASAP
If the data is too big for your laptop, use a sample of the data.
If the paper has much more than the text analysis, ignore it, just focus on the TAD component.
Any questions?
Three primary challenges dealing with text as data:
Challenge I: Text is high dimensional
Challenge II: Text is unstructured data source
Challenge III: Outcomes live in the latent space
Last week:
Pre-processing text + bag of words ~> reduces greatly text complexity (dimensions)
Text representation using vectors of numbers ~> document feature matrix (text to numbers)
We will start thinking about latent outcomes. Our first approach will focus on descriptive inference about documents:
Comparing documents
Using similarity to measure text-reuse
Evaluating complexity in text
Weighting (TF-iDF)
To represent documents as numbers, we will use the vector space model representation:
A document \(D_i\) is represented as a collection of features \(W\) (words, tokens, n-grams, etc…)
Each feature \(w_i\) can be placed in a real line
A document \(D_i\) is a point in a \(\mathbb{R}^W\)
Documents, W=2
Document 1 = “yes yes yes no no no”
Document 2 = “yes yes yes yes yes yes”
Using the vector space, we can use notions of geometry to build well-defined comparison/similarity measures between the documents.
The ordinary, straight line distance between two points in space. Using document vectors \(y_a\) and \(y_b\) with \(j\) dimensions
Euclidean Distance
\[ ||y_a - y_b|| = \sqrt{\sum^{j}(y_{aj} - y_{bj})^2} \]
Euclidean Distance
\[ ||y_a - y_b|| = \sqrt{\sum^{j}(y_{aj} - y_{bj})^2} \]
\(y_a\) = [0, 2.51, 3.6, 0] and \(y_b\) = [0, 2.3, 3.1, 9.2]
\(\sum_{j=1}^j (y_a - y_b)^2\) = \((0-0)^2 + (2.51-2.3)^2 + (3.6-3.1)^2 + (9-0)^2\) = \(84.9341\)
\(\sqrt{\sum_{j=1}^j (y_a - y_b)^2}\) = 9.21
Documents, W=3 {yes, no}
Document 1 = “yes yes yes no no no” (3, 3)
Document 2 = “yes yes yes yes yes yes” (6,0)
Document 3= “yes yes yes no no no yes yes yes no no no yes yes yes no no no yes yes yes no no no” (12, 12)
Euclidean distance rewards magnitude, rather than direction
\[ \text{cosine similarity}(\mathbf{y_a}, \mathbf{y_b}) = \frac{\mathbf{y_a} \cdot \mathbf{y_b}}{\|\mathbf{y_a}\| \|\mathbf{y_b}\|} \]
Unpacking the formula:
\(\mathbf{y_a} \cdot \mathbf{y_b}\) ~ dot product between vectors
\(||\mathbf{y_a}||\) ~ vector magnitude, length ~ \(\sqrt{\sum{y_{aj}^2}}\)
normalizes similarity by documents’ length ~ independent of document length be because it deals only with the angle of the vectors
cosine similarity captures some notion of relative direction (e.g. style or topics in the document)
Cosine function has a range between -1 and 1.
The cosine function can range from [-1, 1]. When thinking about document vectors, cosine similarity is actually constrained to vary only from 0 - 1.
There are a large number of distance/similarity metrics out there, just to name a few:
Jaccard Similarity: overlap between documents
Manhattan Distance: absolute distance between documents
Canberra Distance: Weighted version of Manhattan Distance
Minowski: generalized version of Euclidean
No single best measure, depends on your research question.
But some recent research show Document Feature Matrix (DTM) + Cosine similarity works well to perceived similarity on documents
Length refers to the size in terms of: characters, words, lines, sentences, paragraphs, pages, sections, chapters, etc.
Tokens are generally words ~ useful semantic unit for processing
Types are unique tokens.
Typically \(N_{tokens}\) >>>> \(N_{types}\)
Type-to-Token ratio
\[ TTR: \frac{\text{total type}}{\text{total tokens}} \]
So… authors with limited vocabularies will have a low lexical diversity
Guiraud: \(\frac{\text{total type}}{\sqrt{\text{total tokens}}}\)
S Summer’s Index: \(\frac{\text{log(total type)}}{\text{log(total tokens)}}}\)
MTTR: the Moving-Average Type-Token Ratio (Covington and McFall, 2010)
Another way to think about textual complexity is to consider readability.
Readability: ease with which reader (especially of given education) can comprehend a text
Combines both difficulty (text) and sophistication (reader)
Use a combination of syllables and sentence length to indicate difficulty
Human inputs to built parameters
Flesch-Kincaid readability index
Flesch Reading Ease (FRE)
\[ FRE = 206.835 - 1.015\left(\frac{\mbox{total words}}{\mbox{total sentences}}\right)-84.6\left(\frac{\mbox{total syllables}}{\mbox{total words}}\right) \]
Flesch-Kincaid (Rescaled to US Educational Grade Levels)
\[ FRE = 15.59 - 0.39\left(\frac{\mbox{total words}}{\mbox{total sentences}}\right)- 11.8\left(\frac{\mbox{total syllables}}{\mbox{total words}}\right) \]
Interpretation: 0-30: university level; 60-70: understandable by 13-15 year olds; and 90-100 easily understood by an 11-year old student.
Get human judgments of relative textual easiness for specifically political texts.
Use a logit model to estimate latent “easiness” as equivalent to the “ability” parameter in the Bradley-Terry framework.
Use these as training data for a tree-based model. Pick most important parameters
Re-estimate the models using these covariates (Logit + covariates)
Using these parameters, one can “predict” the easiness parameter for a given new text
- Nice plus ~ add uncertainty to model-based estimates via bootstrapping
So far our inputs for the vector representation of documents have relied simply the word frequencies.
Can we do better?
One option: weighting
Weights:
TF-IDF = Term Frequency - Inverse Document Frequency
\(\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)\) - \(\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}\) - \(\text{IDF}(t) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents with term } t \text{ in them}} \right)\)
Text-as-Data