PPOL 5203 - Data Science I: Foundations

Week 11: Text-As-Data I: Description and Topics

Professor: Tiago Ventura

2024-11-19

A significant part of how we produce and store data is textual

Official Documents: Congressional Speeches, Bills, Press Releases, Transcripts, from all over the world!!

Social Media

Online Media: News, Comments, Blogs, etc…

Modeling Text as Data

Why do we need models to analyze text data?

Scalability and Dimensionality Reduction

  • Humans are great at understanding the meaning of a sentence or a document in-depth

  • Computers are better at understanding patterns, classify and describe content across millions of subjects

  • Computational Text analysis augments humans – does not replace us

  • Statistical models + Powerful Computers allows us to process data at scale and understand common patterns

  • We know “All text models are wrong” – This is particularly true with text!

Workflow

  • Acquire textual data:

    • Existing corpora; scraped data; digitized text
  • Preprocess the data:

    • Bag-of-words
    • Reduce noise, capture signal

Workflow

  • Apply method appropriate to research goal:

    • Descriptive Analysis:
      • Count Words, Readability; similarity;
    • Classify documents into unknown categories
      • Document clustering
      • Topic models
    • Classify documents into known categories
      • Dictionary methods
      • Supervised machine learning
      • Transfer-Learning - use models trained in text for other purposes

Already did it!

  • Acquire textual data:
    • Existing corpora; scraped data; digitized text

Today:

  • Preprocess the data:

    • Convert text to numbers.
    • Use the Bag-of-words assumption.
    • Reduce noise, capture signal
  • Apply methods appropriate to research goal:

    • Descriptive Analysis:
      • Count Words, Readability; similarity;
    • Classify documents into unknown categories
      • Topic models

Next Week:

  • Classify documents into known categories
    • Dictionary methods
    • Supervised machine learning
    • Transfer-Learning - use models trained in text for other purposes

Very quick introdution! Want learn more? Take the Text-as-Data Class in the future.

From Text-to-Numbers: Representing Texts as Data.

When working with text, we have two critical challenges:

  • Reduce Complexity of Text (Think about every word as a variable. This is huge matrix!!)

    • Pre-Processing steps
  • Convert unstructured text to numbers ~ that we can feed in to a statistical model

    • Document Feature Matrix/Bag-of-Words

Reducing Complexity: Pre-Processing Steps

  • Tokenize: break out larger chunks of text in words (unigram) or n-grams

  • Lowercase: convert all to lower case

  • Remove Noise: stopwords, numbers, punctuation, function words

  • Stemming: chops off the ends of words

  • Lemmatization: doing things properly with the use of a vocabulary and morphological analysis of words

Example

Text

We use a new dataset containing nearly 50 million historical newspaper pages from 2,700 local US newspapers over the years 1877–1977. We define and discuss a measure of power we develop based on observed word frequencies, and we validate it through a series of analyses. Overall, we find that the relative coverage of political actors and of political offices is a strong indicator of political power for the cases we study

After pre-processing

use new dataset contain near 50 million historical newspaper pag 2700 local u newspaper year 18771977 define discus measure power develop bas observ word frequenc validate ser analys overall find relat coverage political actor political offic strong indicator political power cas study

Text Representation

  • Represent text as an unordered set of tokens in a document. Bag-of-Words Assumption

  • While order is ignored, frequency matters!

  • To represent documents as numbers, we will use the vector space model representation:

    • A document \(D_i\) is represented as a collection of features \(W\) (words, tokens, n-grams..)

    • Each feature \(w_i\) can be placed in a real line, then a document \(D_i\) is a vector with \(W\) dimensions

Imagine the sentence below: “If that is a joke, I love it. If not, can’t wait to unpack that with you later.”

  • Sorted Vocabulary =(a, can’t, i, if, is, it, joke, later, love, not, that, to, unpack, wait, with, you”)

  • Feature Representation = (1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1)

  • Features will typically be the n-gram (mostly unigram) frequencies of the tokens in the document, or some function of those frequencies

Each document becomes a numerica vector (vector space model)

  • stacking these vectors will give you our workhose representation for text: Document Feature Matrix

Visualizing Vector Space Model

Documents

Document 1 = “yes yes yes no no no”

Document 2 = “yes yes yes yes yes yes”

From Text to Numbers: Bag-of-Words & Document-Feature Matrix

Text-as-Data methods

Comparing/Describing Documents

Topic Models: Finding hidden structures in a corpus

Comparing Documents: How `far’ is document a from document b?

Using the vector space, we can use notions of geometry to build well-defined comparison/similarity measures between the documents.

  • in multiple dimensions!!

Euclidean Distance

The ordinary, straight line distance between two points in space. Using document vectors \(y_a\) and \(y_b\) with \(j\) dimensions

Euclidean Distance

\[ ||y_a - y_b|| = \sqrt{\sum^{j}(y_{aj} - y_{bj})^2} \]

Can be performed for any number of features J ~ has nice mathematical properties

Euclidean Distance, example

Euclidean Distance

\[ ||y_a - y_b|| = \sqrt{\sum^{j}(y_{aj} - y_{bj})^2} \]

  • \(y_a\) = [0, 2.51, 3.6, 0] and \(y_b\) = [0, 2.3, 3.1, 9.2]

  • \(\sum_{j=1}^j (y_a - y_b)^2\) = \((0-0)^2 + (2.51-2.3)^2 + (3.6-3.1)^2 + (9-0)^2\) = \(84.9341\)

  • \(\sqrt{\sum_{j=1}^j (y_a - y_b)^2}\) = 9.21

Exercise

Documents, W=3 {yes, no}

Document 1 = “yes yes yes no no no” (3, 3)

Document 2 = “yes yes yes yes yes yes” (6,0)

Document 3= “yes yes yes no no no yes yes yes no no no yes yes yes no no no yes yes yes no no no” (12, 12)



  • Which documents will the euclidean distance place closer together?
  • Does it look like a good measure for similarity?
    • Doc C = Doc A * 3

Cosine Similarity

Euclidean distance rewards magnitude, rather than direction

\[ \text{cosine similarity}(\mathbf{y_a}, \mathbf{y_b}) = \frac{\mathbf{y_a} \cdot \mathbf{y_b}}{\|\mathbf{y_a}\| \|\mathbf{y_b}\|} \]

Unpacking the formula:

  • \(\mathbf{y_a} \cdot \mathbf{y_b}\) ~ dot product between vectors

    • projecting common magnitudes
    • measure of similarity (see textbook)
    • \(\sum_j{y_{aj}*y_{bj}}\)
  • \(||\mathbf{y_a}||\) ~ vector magnitude, length ~ \(\sqrt{\sum{y_{aj}^2}}\)

  • normalizes similarity by documents’ length ~ independent of document length be because it deals only with the angle of the vectors

  • cosine similarity captures some notion of relative direction (e.g. style or topics in the document)

Cosine Similarity

Cosine function has a range between -1 and 1.

  • Consider: cos (0) = 1, cos (90) = 0, cos (180) = -1

Topic Models: LDA Approach

Intuition

Topic models are a unsupervised statistical model: purpose of finding a hidden structure in the corpus, across many different documents

  • Capture words that are more likely to occur together across a set of documents.

  • Assign these words a probability of being part of a cluster (topic).

  • Assign documents a probability of being associated of these clusters.

    • Documents: emerge from probability distribution of topics

      • a speech can be 40% about trade, 30% about sports, 10% about health, and 20% spread across topics you don’t think make much sense
    • Topics: emerge from probability distributions over words

      • the topic health will have words like hospital, clinic, dr., sick, cancer

Blei, 2012,

Language Model + Generative Probabilistic Modeling

  • Step 1: For each document:

    • Randomly choose a distribution over topics. That is, choose one of many multinomial distributions, each which mixes the topics in different proportions.
  • Step 2: Then, for every word position in the document

    • Randomly choose a topic from the distribution over topics from step 1.
    • Randomly choose a word from the distribution over the vocabulary that the topic implies.

Inference: How to estimate these models?

Using the observed data, the words, we can estimate latent parameters. We start with the joint distribution implied by our language model (Blei, 2012):

\[ p(\beta_{1:K}, \theta_{1:D}, z_{1:D}, w_{1:D})= \prod_{K}^{i=1}p(\beta_i)\prod_{D}^{d=1}p(\theta_d)(\prod_{N}^{n=1}p(z_{d,n}|\theta_d)p(w_{d,n}|\beta_{1:K},z_{d,n}) \]

To get to the conditional:

\[ p(\beta_{1:K}, \theta_{1:D}, z_{1:D}|w_{1:D})=\frac{p(\beta_{1:K}, \theta_{1:D}, z_{1:D}, w_{1:D})}{p(w_{1:D})} \]

The denominator is hard complicate to be estimate (requires integration for every word for every topic):

  • Simulate with Gibbs Sampling or Variational Inference.

Show me results!

Many different types of topic models

  • Structural topic model: allow (1) topic prevalence, (2) topic content to vary as a function of document-level covariates (e.g., how do topics vary over time or documents produced in 1990 talk about something differently than documents produced in 2020?); implemented in stm in R (Roberts, Stewart, Tingley, Benoit)

  • Correlated topic model: way to explore between-topic relationships (Blei and Lafferty, 2017); implemented in topicmodels in R; possibly somewhere in Python as well!

  • Keyword-assisted topic model: seed topic model with keywords to try to increase the face validity of topics to what you’re trying to measure; implemented in keyATM in R (Eshima, Imai, Sasaki, 2019)

  • BertTopic: BERTopic is a topic modeling technique that leverages transformers and TF-IDF to create dense clusters of words.

Notebooks