PPOL 5203 - Data Science I: Foundations

Week 12: Generative AI: LLMs

Professor: Tiago Ventura

Motivation

We are going through a unique technological revolution with the use of Artificial intelligence.
Most of the recent changes come from the so called Large Langue Models
- They are everywhere — ChatGPT, Claude, Gemini, Copilot, LLaMA.
- Used for writing, coding, summarizing, policy briefs, and internet search.
- They’re reshaping education, research, and communication.

Most importantly: research tasks that until very recently required either sophisticated ML models, heavy dictionaries or a lot of human inputs can be performed with comparable performance by pre-trained LLMs models

Classification Tasks (sentiment, political stance)
Unsupervised task (topic modeling and clustering)
Data generation (summarization, generation of synthetic text and images)
Ideological scaling
Image analysis, object identification, OCR…

Plans for Today

Early Language Models: N-Grams
Neural Language Models (LMs)
LARGE Language Models (Transformers)
Outsourcing Content Analysis Tasks with GenAI LLMs
Some coding applications
- Text Classification
- Synthetic Survey Responses
- If we have time, I can show you another application out these two options:
  - Analyzing open-ended survey questions
  - Image processing and detecting elements on images with LLMs

What is a language model?

Language model

Language Modeling is the task of predicting what word comes next after a sequence of words

the students opened their ______
- books
- laptop
- exams
- minds
More formally, given a sequence of words \(x_1\), \(x_2\), \(x_3\) … compute of the probability of distribution of the next word \(x_{t+1}\)
Where \(x_{t+1}\) is any word in the vocabulary
A system that does this is called a Language Model

Early Language Models: N-grams

Definition: An n-gram is a chunk of n consecutive words.
- unigrams: “the”, “students”, “opened”, ”their”
- bigrams: “the students”, “students opened”, “opened their”
- trigrams: “the students opened”, “students opened their”
- four-grams: “the students opened their
Idea: Collect statistics about how frequent different n-grams are and use these to predict next word.

A little bit of math…

Starting from a simple conditional probability

\[ P(A|B)= \frac{P(A, B)}{P(B)} \]

Suppose we are learning a 4-gram Language Model

students opened their _______

Assume A={“book”}, B={“the students opened their”},

\[ P(book|\text{the students opened their}) \sim \frac{ \text{counts}( \text{the students opened their book})}{\text{counts} (\text{the students opened their})} \]

For example, suppose that in the corpus:

“students opened their” occurred 1000 times
“students opened their books” occurred 400 times
- P(books | students opened their) = 0.4
“students opened their exams” occurred 100 times
- P(exams | students opened their) = 0.1

N-gram language models in practice

today the ____

Get a probability distribution over words
- company 0.153
- bank 0.153
- price 0.077 ….
Sample a word

today the company

Get a probability distribution over words
- of 0.30
- the 0.08
- for 0.077 ….
Sample a word… and repeat

Large Language Models

As you can imagine: N-grams are not great. Cannot memorize long distance, explodes with large text, knows nothing about meaning.
In early 2000s, CS folks started improving language models with large models (neural networks) and large data (the internet)

Fixed Window Neural Language Model

output distribution

\[\hat{y} = \mathrm{softmax}(U h + b_2) \in \mathbb{R}^{|V|}\]

hidden layer

\[h = f(W e + b_1)\]

concatenated word embeddings

\[e = [e^{(1)}; e^{(2)}; e^{(3)}; e^{(4)}]\]

words / one-hot vectors

\[x^{(1)}, x^{(2)}, x^{(3)}, x^{(4)}\]

Reccurrant Neural Network (RNN)

Def: Neural network architecture that processes sentences sequentially with hidden states that carry over-time information.

output distribution

\[\hat{y}^{(t)} = \mathrm{softmax}\left(U h^{(t)} + b_2\right) \in \mathbb{R}^{|V|}\]

hidden states

\[h^{(t)} = \sigma\left(W_h h^{(t-1)} + W_e e^{(t)} + b_1\right)\]

\[h^{(0)} \text{ is the initial hidden state}\]

word embeddings

\[ e^{(t)} = E x^{(t)} \]

words / one-hot vectors

\[x^{(t)} \in \mathbb{R}^{|V|}\]

Challenges with pre-transformers models

Bottleneck problem: RNNs are unrolled from left-to-right. Hard to capture long-term dependency
Nearby words are often more important because they are incorporated more recently
Gradients are unstable (vanishing or exploding) because they depend on continuous chain rules for the calculation
Non-parallelizable: RNNs handles text sequentially, so you cannot really speed things up with GPUs

Transformers

Definition: a Neural Network with a specific structure that includes a mechanism called self-attention.

Publication: first introduced in the paper Attention is All You Need by a group of researchers from Google Brain.

Usage: Core architecture behind most recent developments of Natural Language Processing, such as recent LLMs:

Notable aspects of Transformers (LLMs)

Encoder x Decoder: separates learning embeddings from input (encoding) to text generation (decoding)
Attention + Contextual Knowledge: the REAL DEAL. Allows words to focus on the most relevant words of a particular sequence. Different vectors for “tower”, “eiffel tower” or “beer tower”.
Parallelization: All tokens are processed simultaneously rather than sequentially. This means we can process words in parallel via GPUs!!
Training via Masked Attention Masked inputs to train bi-directional models, looking at all words!

Neural Networks - Transformers (2017)

Large Language Models

Same intuition of n-gram language models: word prediction. Only decoder part of the transformers!
- But using:
  - Billions or trillions of parameters.
  - Trained on petabytes of text (the internet, books, code, websites, pretty much everything you can imagine)
  - trained with self-attention bi-directional masking
After training, these models learn word representation and language generation.
- Then they are fine-tune to answer to instructions from humans (Either via Reinforcement Learning or Direct Preference Optimization)
There are two ways to use LLMs
- Out of the box: refers to utilizing the model without any updating of the parameters of the model
- Finetune: involves further training the model using gradient-based updates

Out of the Box LLMs with Prompts

Prompts: instructions we give to a language model in natural language

Basis of prompting is on contextual generation
Given the prompt as context, the LLM will generate the next token based on its token probability, conditioned on the prompt:

Structure of a Prompt

Instruction: a specific task or instruction you want the model to perform
Context: additional information or context that can steer the model to better responses
Input data: the input or question that we are interested in finding a response to
Output indicator: the type or format of the output
Of course, you don’t need all four elements, but prompts that have all four elements are often effective

Prompt Engineering Examples

Zero Shot: Classify the sentiment of the following review:
Few Shot: Given these examples, Classify the sentiment of the following review:
Role: Acting as a crowdworker, classify the sentiment of following review:
Chain-of-thought: prompting means guiding a language model through a series of connected logical steps or thoughts

Example:

Usage

All these applications are hard to do at scale via LLMs web-browser versions (as ChatGPT).

Instead we will use access via API (programming call either through R and Python) to access and ask the models to perform certain tasks:

OpenAI API: https://openai.com/api/
Claude Antrophic: https://www.claude.com/platform/api
https://ai.google.dev/

Examples of LLMs application

LLMs: Classification

Rathje et. al. “GPT is an effective tool for multilingual psychological text analysis”

Use LLMs with zero-shot prompting for measure psycological concepts (sentiment classification).

“Is the sentiment of this text positive, neutral, or negative? Answer only with a number: 1 if positive, 2 if neutral, and 3 if negative. Here is the text: [tweet, news headline or Reddit comment text]”

To see this as next work prediction, think about this example:
- P(w=positive|I love this class) > P(w=neutral|I love this class)
  - Is positive the most likely next word?
    - No… but it is more likely than the word negative
    - Or other negative words that are close to negative in the embedding space.

LLMs: Synthetic Survey Data

Bisbee et. al. “Synthetic Replacements for Human Survey Data? The Perils of Large Language Models”

Can LLMs be leveraged for public opinion research? Can LLMs be used to generate synthetic survey responses?
Prompt LLMs to:
- First adopt various personas defined by demographic and political characteristics
- Answer a battery of questions about feelings towards social and political groups
- Features are taken from the American National Election Study

Prompts

“It is [YEAR]. You are a [AGE] year-old, [MARST], [RACETH] [GENDER] with [EDUCATION] making [INCOME] per year, living in the United States. You are [IDEO], [REGIS] [PID] who [INTEREST] pays attention to what’s going on in government and politics.”

LLMs: Synthetic Browsing Data

My own early stage work: using LLMs to augment browsing data

System Prompt: This is text completion in English

User Prompt: The url in triple quotations is a url that links to a news item. [URL] Without accessing the url in triple quotations, please provide a possible summary of the content of the url. Without going on the internet, produce the online news article based on this URL. Limit to three paragraphs, and return only the article. Please consider the domain, publication date, and keywords in the URL, when available.

URL Example:https://www.bloomberg.com/news/articles/2024-09-21/chipmaker-qualcomm-is-said-to-explore-friendly-takeover-of-intel

PPOL 5203 - Data Science I: Foundations

Motivation

Plans for Today

What is a language model?

Language model

Early Language Models: N-grams

A little bit of math…

N-gram language models in practice

Large Language Models

Fixed Window Neural Language Model

Reccurrant Neural Network (RNN)

Challenges with pre-transformers models

Transformers

Notable aspects of Transformers (LLMs)

Neural Networks - Transformers (2017)

Large Language Models

Out of the Box LLMs with Prompts

Structure of a Prompt

Prompt Engineering Examples

Example:

Usage

Examples of LLMs application

LLMs: Classification

LLMs: Synthetic Survey Data

Prompts

LLMs: Synthetic Browsing Data

Coding!