Week 1: Introduction and Logistics
Introduction (me)
Motivation for Computational Linguistics
Digital information age
Principles of Computational Linguistics.
What this course is not.
Examples of models and applications for this course
Introductions (you)
Class Logistics ( + 10 min for you to read through the syllabus)
Q&A
Acquiring text in the web (Jupyter notebooks for scrapping)
Professor Tiago Ventura (he/him)
Some Projects I am involved
Outside of work, I enjoy watching soccer, reading sci-fi and running
For many years, social scientists use text in their analysis
Mostly through in-depth reading of documents.
Close Reading. Humans are great at this!
Digital Revolution:
This class covers methods (and many applications) of using text as data to answer social science problems and test social science theories
Computational Linguistics ~ Distant Reading. Computers are better at understanding patterns, classify and describe content across millions of documents.
Principle 1: Social science theories and substantive knowledge are essential for research design
Principle 2: Text analysis does not replace humans—it augments them.
Principle 3: Building, refining, and testing social science theories requires iteration and cumulation.
Principle 4: Text analysis methods distill generalizations from language. (all models are wrong!)
Principle 5: The best method depends on the task. (Qualitative knowledge)
Principle 6: Validations are essential and depend on the theory and the task
From Gentzkow et al 2017:
sample of documents, each \(n_L\) words long, drawn from vocabulary of \(n_V\) words.
The unique representation of each document has dimension \(n_{V}^{n_L}\) .
Most of what you learned in statistics so far does not equip you to deal with this curse of dimensionality.
Acquire textual data:
Map Documents to a numerical representation M
Map M to predicted values \(V^{*}\) of unknown outcomes V
Use \(V^{*}\) in subsequent analysis with other data sources
Descriptive inference: how to convert text to matrices, vector space model, bag-of-words, dissimilarity measures, diversity, complexity, style.
Supervised techniques: dictionaries, classication, scaling, machine learning approaches.
Unsupervised techniques: clustering, topic models, embeddings.
Special topics: Word embeddings and Large Language Models.
Measure text-reuse across thousands of bills from U.S. state legislatures
Estimate levels of toxicity of comments from stremming chats platforms during political debates
Measure how out-group negative makes things go viral on social media
Estimate ideological positions using who a user follows on Twitter, what a user share on social media, political manifestos, or just asking ChatGPT to pair-wise compare politicians
Data acquisition: no scrapping in class. Assume you have learned already.
Regular expressions and basic text manipulation.
CS Stuff: machine translation, OCR, POS, entity recognition.
Name & pronouns
Why are you taking this course?
Your experience (if any) working with text
The most interesting thing you learned in the DSPP so far
10:00
Assume you all have a intro course in statistics and probability (which I know you do)
Math: Basic knowledge of calculus, probability, densities, distributions, statistical tests, hypothesis testing, the linear model, maximum likelihood and generalized linear models is assumed.
Programming: Functional knowledge of R - main programming language of the course. Some Python at the end.
R is excellent for text analysis, and for some social science applications, better than Python
Free, and massive online community writing packages and extending modeling capabilities.
We will divide our learning between using tidytext
and quanteda
for text analysis.
Download RStudio IDE!
I designed this course as PhD style seminar:
So far, you learned a lot of DS techniques (DS I, DS II, DS III)
You haven’t dig deep enough in a particular field. That’s what electives are for!
Heavy on readings - Lot’s of applied and technical readings.
Do the readings before class
Substantive readings are especially important, because they’ll help you understand what an interesting question looks like – in social science/public policy.
Plan ahead – particularly for the replication exercise
If you have a corpus you want work with, please bring it to class!
This is a one meeting per week class. You should expect:
Between 1h-1.5h of lecture based on this week topics + readings
Your participation in the lecture is expected I will ask your insights about the readings.
Break (10min)
Coding.
Communication: via slack. Join the channel!
All materials: hosted on the class website: https://tiagoventura.github.io/PPOL_6801_2024//
Syllabus: also on the website.
My Office Hours: Every Tuesday from 4 to 6pm. Just stop by!
Canvas: Only for communication! Materials will be hosted in the website!
Assignment | Percentage of Grade |
---|---|
Participation/Attendance | 10% |
Problem Sets | 20% |
Replication Exercises | 30% |
Final Project | 40% |
Active involvement during class sessions, fostering a dynamic learning environment.
Contributions made to your group’s ultimate project.
Assisting classmates with slack questions, sharing interesting materials on slack, asking question, and anything that provides healthy contributions to the course.
Assignment | Date Assigned | Date Due |
---|---|---|
No. 1 | Week 4 | Before EOD of Week 5’s class |
No. 2 | Week 7 | Before EOD of Week 8’s class |
No. 3 | Week 9 | Before EOD of Week 10’s class |
You will have a week to complete your assignments
individual assignment
distributed through github
Opportunity to learn how science is made!
Work in randomly assigned pairs I will post on Slack.
Step 1: finding a paper to replicate (from the syllabus)
By the end of the week 2 and week 7, you should select an article from the syllabus to be replicated by your team.
Inform the class on slack
“first come, first served”
Step 2: Acquiring the Data
Step 3: Presentation (weeks 6 and 11)
Step 4: Replication Repository on Github
The project is composed of three parts:
Requirement | Due | Length | Percentage |
---|---|---|---|
Project Proposal | EOD Friday Week 9 | 2 pages | 5% |
Presentation | Week 14 | 10-15 minutes | 10% |
Project Report | Wednesday Week 15 | 10 pages | 25% |
You are allowed to use ChatGPT as you would use google in this class. This means:
Do not copy the responses from chatgpt – a lot of them are wrong or will just not run on your computer
Use chatgpt as a auxiliary source.
If your entire homework comes straight from chatgpt, I will consider it plagiarism.
If you use chatgpt, I ask you to mention on your code how chatgpt worked for you.
As a review, here are some notebooks I developed for Data Science I introducing a full toolkit for acquiring data in the web:
Text-as-Data
Social Media