Week 4: Dictionaries and off-the-shelf classifiers
Your first problem set will be assigned today! Some important information:
You will receive and submit your assignment using Github!
Github Classroom: Creates automatically a repo for the assignment. You and I are owner of the repo.
Come to my office hours if you don’t know how to work with GitHub
Deadline: EOD next Monday, September 29th.
Please use an .RMD/.QMD/.ipynb file to submit your assignment.
Any questions?
After learning how to process and represent text as numbers, we started digging in on how to use text on a research pipeline.
Descriptive inference:
Counting words (Ban’s Paper)
Comparing document similarity using vector space model
Measures of lexical diversity and readability
For the next two weeks, we will talk about measurement and classification:
Today:
In the machine learning tradition, we are introduced to two core families of models:
Unsupervised Models: learning (hidden or latent) structure in unlabeled data.
Supervised Models: learning relationship between inputs and a labeled set of outputs.
Documents come from certain classes and how we can use statistical assumptions to measure these classes
Measures should have clear goals
Source material should always be identified and ideally made public
Coding should be explainable and reproducible
Validate!
Limitations should be explored, documented, and communicated
Plus: Develop the measure in a training set → Apply the measure in the test set
Step 0: identify a dataset and a concept you would like to measure
Step 1: label some examples of the concept of we want to measure
Step 2: train a statistical model on these set of labelled data using the document-feature matrix as input.
Step 3: use the classifier - some f(x) - to predict unseen documents.
Step 4: use the measure + metadata to learn something new about the world.
Assume you can travel back twenty years ago, you want to run a simple sentiment analysis in a corpus of news articles.
Which challenges would you face?
How could you solve it?
Please consider all four steps described before
Use a set of pre-defined words that allow us to classify documents automatically, quickly and accurately.
Instead of optimizing a transformation function using statistical assumption and seen data, in dictionaries we have a pre-assumed recipe for the transformation function.
A dictionary contains:
Weights given to each word
For document \(i\) and words \(m=1,\ldots, M\) in the dictionary,
\[\text{tone of document $i$}= \sum^M_{m=1} \frac{s_m w_{im}}{N_i}\]
Where:
Glowing reviews from me. A truly magnificent read with a plot ever so deep, characters that you grow incredibly fond of in books to follow, and writing that just flows carrying you through the page after page. Needless to say, I could not put it down and they only get better from book to book. I’ve just finished the third in the series and am beginning the forth. I’m absolutely hooked on this brilliant authors work. Long live the Reaper!! Howlers unite!
Glowing reviews from me. A truly magnificent read with a plot ever so deep, characters that you grow incredibly fond of in books to follow, and writing that just flows carrying you through the page after page. Needless to say, I could not put it down and they only get better from book to book. I am very anxious waiting for the next book I’ve just finished the third in the series and am beginning the forth. I’m absolutely hooked on this brilliant authors work. Long live the Reaper!! Howlers unite!
Low cost and computationally efficient ~ especially using a dictionary developed and validated by others
A hybrid procedure between qualitative and quantitative models
Dictionary construction involves a lot of contextual interpretation and qualitative judgment
Transparency: no black-box model behind the classification task
Created by Pennebaker et al — see http://www.liwc.net
Valence Aware Dictionary and sEntiment Reasoner:
Tuned for social media text; open source and free.
Capture polarity and intensity
Vader Heuristic rules to signal intensity or polarity shift:
Punctuation: ! → increased intensity
Punctuation: ALL-CAPS → increased intensity
Degree modifiers (intensifiers, i.e. ”extremely good”) → increased intensity
Conjunction “but” signals a shift in sentiment polarity
Tri-gram preceding a sentiment-laden lexical feature → catch nearly 90% of cases where negation flips the polarity of the text
Python and R libraries: https://github.com/cjhutto/vaderSentiment
Article: https://ojs.aaai.org/index.php/ICWSM/article/view/14550/14399
Create Lexicoder Sentiment dictionary specifically for political communication (i.e. sentiment in news coverage, legislative speech and other text)
Combines:
Each words pertains to a single class
Plus
Moral foundations: dimensions of difference that explain human moral reasoning
Measures the proportions of virtue and vice words for each foundation:
Link to dictionary: https://moralfoundations.org/other-materials/
We already discussed some of the advantages:
low-cost when working with open sourced dictionaries
bridge qualitative and quantitative
easy to validate
transfer well across languages.
RQ:
Theory:
Data:
Method:
Which year and where this paper was published?
We used the R package quanteda to analyze Twitter and Facebook text. During text preprocessing, we removed punctuation, URLs, and numbers. To classify whether a specific post was referring to a liberal or a conservative, we adapted previously used dictionaries that referred to words associated with liberals or conservatives. Specifically, these dictionaries included 1) a list of the top 100 most famous Democratic and Republican politicians according to YouGov, along with their Twitter handles (or Facebook page names for the Facebook datasets) (e.g., “Trump,” “Pete Buttigieg,” “@realDonaldTrump”); 2) a list of the current Democratic and Republican (but not independent) US Congressional members (532 total) along with their Twitter and Facebook names (e.g., “Amy Klobuchar,” “Tom Cotton”); and 3) a list of about 10 terms associated with Democratic (e.g., “liberal,” “democrat,” or “leftist”) or Republican identity (e.g., “conservative,” “republican,” or “ring-wing”).
We then assigned each tweet a count for words that matched our Republican and Democrat dictionaries (for instance, if a tweet mentioned two words in the “Republican” dictionary, it would receive a score of “2” in that category). We also used previously validated dictionaries that counted the number of positive and negative affect words per post and the number of moral-emotional words per post (LIWC).
Dictionary Methods:
Supervised Learning:
Def: Machine learning tool developed by Jigsaw (Google) to analyze text for harmful or toxic language:
Why?
Limitations:
RQ: Characterize streaming chats during political debates leading up to the 2020 U.S. presidential election
Theory: From passive consumer to active co-creator; changing the (online) channel changes the viewing experience
Data/method: Facebook livestream chats on NBC News, ABC News, and Fox News during 2020 US Presidential debate
Outcomes: Length, speed, toxicity of comments
Hugging Face’s Model Hub: centralized repository for sharing and discovering pre-trained models [https://huggingface.co]
GenAI Models (with zero-shot prompting) can also be used to replace dictionaries
Text-as-Data