Week 3: Descriptive Inference - Comparing Documents
You have your pairs for the replication exercise.
Next steps:
Select the paper you wish to replicate (until lecture day next week)
Post on Slack
Talk to me if you have any issues
Last week:
Pre-processing text + bag of words ~> reduces greatly text complexity (dimensions)
Text representation using vectors of numbers ~> document feature matrix (text to numbers)
Example of counting words to make inference
We will start thinking about latent outcomes from text. Our first approach will focus on descriptive inference based on documents:
Comparing documents
Using similarity to measure text-reuse
Evaluating complexity in text
Weighting (TF-iDF)
To represent documents as numbers, we use the vector space model representation:
A document \(D_i\) is represented as a collection of features \(W\) (words, tokens, n-grams, etc…)
Each feature \(w_i\) can be placed in a real line
A document \(D_i\) is a point in a \(\mathbb{R}^W\)
Documents, W=2
Document 1 = “yes yes yes no no no”
Document 2 = “yes yes yes yes yes yes”
Using the vector space, we can use notions of geometry to build well-defined comparisons between the documents.
The ordinary, straight line distance between two points in space. Using document vectors \(y_a\) and \(y_b\) with \(j\) dimensions
Euclidean Distance
\[ ||U - V|| = \sqrt{\sum^{i}(U_{i} - {V_i})^2} \]
Euclidean Distance
\[ ||U - V|| = \sqrt{\sum^{i}(u_{i} - v_{i})^2} \]
\(U\) = [0, 2.51, 3.6, 0] and \(yV\) = [0, 2.3, 3.1, 9.2]
\(\sum^{i}(U_{i} - y_{V_i})^2\) = \((0-0)^2 + (2.51-2.3)^2 + (3.6-3.1)^2 + (9-0)^2\) = \(84.9341\)
\(\sqrt{\sum_{j=1}^j (y_a - y_b)^2}\) = 9.21
Documents, W=2 {yes, no}
Document 1 = “yes yes yes no no no” (3, 3)
Document 2 = “yes yes yes yes yes yes” (6,0)
Document 3= “yes yes yes no no no yes yes yes no no no yes yes yes no no no yes yes yes no no no” (12, 12)
Euclidean distance rewards magnitude, rather than direction
\[ \text{cosine similarity}(\mathbf{U}, \mathbf{V}) = \frac{\mathbf{U} \cdot \mathbf{V}}{\|\mathbf{U}\| \|\mathbf{V}\|} \]
Vector: an ordered list of numbers in a space with well-defined addition and scalar multiplication operations
It is defined as an object with both magnitude and direction
Magnitude: how long it is
Direction: where it points
The dimensionality of a vector is the number of elements it has
For example, [1, 2, 3, ] has dimensionality 3
In other words, it is a vector in 3D space
The vector [3,5] is 2D
Given a vector U and a scalar constant k, we can multiply a vector.
\[ U*k = [k*u_1, k*u_2, k*u_3 ... k*u_n] \]
increases the length/magnitude
does not change the direction
The denominator of the cosine similarity formula gives you the magnitude or length of a vector.
Magnitude: Use L2 Norm, Euclidean distance from origin point (0,0)
It converts a vector to a scalar representing the size of a vector
\[ ||\mathbf{U}|| = \sqrt(u1*u1 + u2*u2 ... u_n*u_n) = \sqrt(\sum^i u_i^2) \]
The numerator of the cosine similarity is called ‘dot product’ between two vectors. This a multiplication between two vectors.
\[ \mathbf{U} \cdot \mathbf{V}= u_1*v_1 + u_2*v_2 + u_n*v_n = \sum^i u_i*v_i \]
Intuition:
“the product of two vectors a and b is the projection of a onto b times the length of b (it is symmetric, so equivalently the projection of b onto a times the length of a).” (GMB, pp71)
You can see this by considering the geometric formula for the dot product:
\[ \mathbf{U} \cdot \mathbf{V}= ||U|| \cdot ||V|| \cdot cos(\theta) \]
More interestingly:
\[ ||Proj(U)_v ||= ||U|| cos(\theta) \]
Yet another formula
\[ \mathbf{U} \cdot \mathbf{V} = ||Proj(U)_v || ||V|| \]
“the product of two vectors a and b is the projection of a onto b times the length of b (it is symmetric, so equivalently the projection of b onto a times the length of a).” (GMB, pp71)
\[ \text{cosine similarity}(\mathbf{U}, \mathbf{V}) = \frac{\mathbf{U} \cdot \mathbf{V}}{\|\mathbf{U}\| \|\mathbf{V}\|} \]
\(U \cdot V\) ~ dot product between vectors
\(||\mathbf{U}||\) ~ vector magnitude, length ~ \(\sqrt{\sum{u_{i}^2}}\)
cosine similarity captures some notion of relative direction controlling for different magnitudes.
Cosine function has a range between -1 and 1.
The cosine function can range from [-1, 1]. When thinking about document vectors, cosine similarity is actually constrained to vary only from 0 - 1.
Compiled a dataset of 36,603 episodes from 79 prominent political podcasters
Pre-processed text transcripts
Calculated the cosine similarity between podcast transcripts and fact-checked claims (PolitiFact and Snopes)
Set a cosine similarity threshold (e.g., 0.5) to filter significant matches
Once a claim was detected in a transcript, the analysis moved to determine how the claim was presented (manual review, sentiment analysis)
Quantified the prevalence of unsubstantiated and false claims across the podcast episodes
Magnitude? Euclidean or Manhattan Distance if total size/frequency of terms is important.
Proportions? Cosine Similarity if relative distribution of terms matters more than document length.
Overlap? Jaccard Similarity if interested in shared terms regardless of frequency.
Trends? Pearson Correlation to compare patterns in term distributions across documents.
Sparse, high-dimensional text data? Cosine similarity because it emphasizes proportional relationships and ignores document length.
Dense data with similar distribution? Euclidean distance because length are likely the same
We saw how to compare documents… now let’s move to other features of documents description. Let’s talk about:
Lexical Diversity
Readability
Political Sophistication
Length refers to the size in terms of: characters, words, lines, sentences, paragraphs, pages, sections, chapters, etc.
Tokens are generally words ~ useful semantic unit for processing
Types are unique tokens.
Typically \(N_{tokens}\) >>>> \(N_{types}\)
Type-to-Token ratio
\[ TTR:\frac{\text{total type}}{\text{total tokens}} \]
TTR is very sensitive to overall document length,
Length also correlates with topic variation ~ more types being added to the document
Readability: ease with which reader (especially of given education) can comprehend a text
Combines both difficulty (text) and sophistication (reader)
Average sentence length (based on the number of words) and average word length (based on the number of syllables) to indicate difficulty
Human inputs to built parameters about the readers
Hundred years of literature on the measurement of ‘readability’. The general issue was assigning school texts to students of different ages and abilities.
Flesch Reading Ease (FRE)
\[ FRE = 206.835 - 1.015\left(\frac{\mbox{total words}}{\mbox{total sentences}}\right)-84.6\left(\frac{\mbox{total syllables}}{\mbox{total words}}\right) \]
Flesch-Kincaid (Rescaled to US Educational Grade Levels)
\[ FRE = 15.59 - 0.39\left(\frac{\mbox{total words}}{\mbox{total sentences}}\right)- 11.8\left(\frac{\mbox{total syllables}}{\mbox{total words}}\right) \]
Interpretation: 0-30: university level; 60-70: understandable by 13-15 year olds; and 90-100 easily understood by an 11-year old student.
Research Question: How did suffrage extension affected the behavior of members of parliament (MPs) relative to ministers during the Victorian period?
Data: House of Commons (1832-1915) speeches
Method: FRE scores
Get human judgments of relative textual easiness for specifically political texts.
Use a logit model to estimate latent “easiness” as equivalent to the “ability” parameter in the Bradley-Terry framework.
Use these as training data for a tree-based model. Pick most important parameters
Re-estimate the models using these covariates (Logit + covariates)
Using these parameters, one can “predict” the easiness parameter for a given new text
So far our inputs for the vector representation of documents have relied simply the word frequencies.
Can we do better?
One option: weighting
Weighting function:
\[ \text{TF-IDF}(t, d) = \text{TF} \times \text{IDF} \]
\[ \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} \]
\[ \text{IDF}(t) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents with term } t \text{ in them}} \right) \]
Text-as-Data