Week 10: Scaling Models for Text
Overview of scaling models
Scaling models using text
supervised models: wordscore
unsupervised models:
wordfish
Doc2vec
Scaling models with network data
Many substantive questions in policy and politics depends on estimating ideological preferences:
Many more examples ….
In the past 20 years, computational text analysis has been widely used for building scaling models.
Advantages: fast, reliable, deals with large volumes of text, and easy to translate to other domains/language.
Wordscore: supervised approach, mimic naive bayesian models, start with reference, and score virgin texts.
Wordfish: unsupervised approach, learn word occurrence from the documents using a ideal points models.
Doc2Vec: unsupervised approach, maps documents in the embedding vector space, use PCA to reduce dimensionality.
Step 1: Begin with a reference set (training set) of texts that have known positions.
Step 2: Generate word scores from these reference texts
Step 3: Score the virgin texts (test set) of texts using those word scores
\[ P_{wr} = \frac{F_{wr}}{\sum_r F_{wr}}\] - \(P_{wr}\): Probability of word \(w\) being from reference document \(r\)
\[S_{wd} = \sum_r (P_{wr} \cdot A_{rd}) \] - \(S_{wd}\): Score of word \(w\) in dimension \(d\)
\(A_{rd}\): Pre-defined position of reference text \(r\) in dimension \(d\)
\(A_{rd}\) will be -1 for liberal and +1 for conservative documents, for example.
\[ S_{vd} = \sum_w (F_{wv} · S_{wd}) \]
Republican manifesto uses `wall’ 25 times in 1000 words, while Democrat use it only 5 times. Assume 1 for republican and -1 for democrat
\[P_{wr} = 0.83\]
\[P_{wl} = 0.16 \] \[ S_w = 0.83*1 + -1*.16 = 0.66\]
Virgin text 1 mentions wall 200 times in 1000 words ~ \(0.2 * 0.66 = 0.132\)
Virgin text 2 mentions wall 1 times in 1000 words ~ \(0.001 * 0:66 = 0.0066\)
Text-as-Data