Week 4: Dictionaries and off-the-shelf classifiers
Your first problem set will be assigned today! Some important information:
You will receive and submit your assignment using Github!
Deadline: EOD next Wednesday, February 14th.
Please use an .RMD/.QMD file to submit your assignment. If you prefer to solve using jupyter, let me know!
Any questions?
After learning how to process and represent text as numbers, we started digging in on how to use text on a research pipeline.
Counting words (Ban’s Paper)
Comparing document similarity using vector space model (text re-use)
Measures of lexical diversity and readability
For the next two weeks, we will talk about Measurement
Documents pertaining to certain classes and how we can use statistical assumptions to measure these classes
In the Machine Learning tradition, we are introduced to two core family of models:
Unsupervised Models: learning (hidden or latent) structure in unlabeled data.
Supervised Models: learning relationship between inputs and a labeled set of outputs.
Step 1: label some examples of the concept of we want to measure
Step 2: train a statistical model on these set of label data using the document-feature matrix as input
Step 3: use the classifier - some f(x) - to predict unseen documents.
Step 4: use the measure + metadata|exogenous shocks to learn something new about the world.
Assume you got the delorean to travel back twenty years ago, you want to run a simple sentiment analysis in a corpus of news articles.
Which challenges would you face?
How could you solve it?
Please consider all four steps described before
Use a set of pre-defined words that allow us to classify documents automatically, quickly and accurately.
Instead of optimizing a transformation function using statistical assumption and seen data, in dictionaries we have a pre-assumed recipe for the transformation function.
A dictionary contains:
Weights given to each word ~ same for all words or some continuous variation.
We have a set of key words with weights,
e.g. for sentiment analysis: horrible is scored as \(-1\) and beautiful as \(+1\)
the relative rate of occurrence of these terms tells us about the overall tone or category that the document should be placed in.
For document \(i\) and words \(m=1,\ldots, M\) in the dictionary,
\[\text{tone of document $i$}= \sum^M_{m=1} \frac{s_m w_{im}}{N_i}\]
Where:
Low cost and computationally efficient ~ if using a dictionary developed and validated by others
A hybrid procedure between qualitative and quantitative classification at the fully automated end of the text analysis spectrum
Dictionary construction involves a lot of contextual interpretation and qualitative judgment
Transparency: no black-box model behind the classification task
Created by Pennebaker et al — see http://www.liwc.net
Valence Aware Dictionary and sEntiment Reasoner:
Tuned for social media text
Capture polarity and intensity
Python and R libraries: https://github.com/cjhutto/vaderSentiment
Article: https://ojs.aaai.org/index.php/ICWSM/article/view/14550/14399
Create dictionary specifically for political communication
Combines:
Each words pertains to a single class
Plus
A hierarchical set of categories to distinguish policy domains and policy positions on party manifestos
Five Domains:
Lookes for word occurrences within “word strings with an average length of ten words”
We used the R package quanteda to analyze Twitter and Facebook text. During text preprocessing, we removed punctuation, URLs, and numbers. To classify whether a specific post was referring to a liberal or a conservative, we adapted previously used dictionaries that referred to words associated with liberals or conservatives. Specifically, these dictionaries included 1) a list of the top 100 most famous Democratic and Republican politicians according to YouGov, along with their Twitter handles (or Facebook page names for the Facebook datasets) (e.g., “Trump,” “Pete Buttigieg,” “@realDonaldTrump”); 2) a list of the current Democratic and Republican (but not independent) US Congressional members (532 total) along with their Twitter and Facebook names (e.g., “Amy Klobuchar,” “Tom Cotton”); and 3) a list of about 10 terms associated with Democratic (e.g., “liberal,” “democrat,” or “leftist”) or Republican identity (e.g., “conservative,” “republican,” or “ring-wing”).
We then assigned each tweet a count for words that matched our Republican and Democrat dictionaries (for instance, if a tweet mentioned two words in the “Republican” dictionary, it would receive a score of “2” in that category). We also used previously validated dictionaries that counted the number of positive and negative affect words per post and the number of moral-emotional words per post (LIWC).
We already discussed some of the advantages:
low-cost when working with open sourced dictionaries
bridge qualitative and quantitative
easy to validate
transfer well across languages.
Text-as-Data