PPOL 5203 - Data Science I: Foundations
Week 1: Introductions, Installations, IDEs, Command line
Welcome to Data Science I
Plans for Today
Motivating Data Science for Public Policy, or Computational Social Science.
Goals of the course
Introductions
Course Logistics
IDEs
- Jupyter
- Quarto
Introduction to commandline
Why are we here?
Rise of the digital information age
https://www.washingtonpost.com/wp-dyn/content/graphic/2011/02/11/GR2011021100614.html
Powerful and Cheap Computer Power
As a consequence:
Abundance of data we can use for research and governments can use to make better decisions
Novel research questions
New ways to answer old, long-standing research questions
New technologies also have social implications and can generate important policy questions.
Privacy issues
Use of technology by bad actors.
Use of technology by governments to censor/monitor citizens.
etc…
Policy scholars (but pretty much any researcher) need to be equipped to properly deal with these challenges
Data Science for Public Policy
Data Scientist for Public Policy focuses on computational approaches to solve/understand Policy Problems.
Part of a larger and new field called computational social sciente
But with a more policy-focus.
What is social science? It refers to a domain of study - social phenomena:
- Encompasses many scales: human psychology, language, economic behavior, political systems, policy problems
- Involve many approaches: qualitative interviews, statistical analysis, simulations
What is Data Science?:
- Use often large-scale data + algorithms to answer questions
An example: Data Donation WhatsApp Groups
All the steps + Tools ....
so far ...
Step 1: Recruiting participants online
- Online Panels + Facebook Ads
Step 2: Running online surveys
- Qualtrics + R
Step 3: Development of data donation pipeline (with MDI)
- JavaScript (JS) + Python
Step 4: Analyze the data
- SQL + Python + R
Readings for this week
Bit by Bit: Social Research in the Digital Age By Mathew Salganik
Training Computational Social Science PhD Students for Academic and Non-Academic Careers - Written by me and some colleagues in academia, industry and non-profits
Goals of the course
The goal of this course is to teach you:
Computational thinking: how to approach problems and devise solutions from a computational perspective.
Get you started on Python and a bit of SQL for applied data science; lay the foundations for the remainder of the core sequence
Workflows and tools: Git/Github + Commandline.
PPOL 5203 - Data Science I: Foundations
Course Schedule
Introductions
About me
Professor Tiago Ventura (he/him)
- Assistant Professor at McCourt School.
- Political Science Ph.D.
- Postdoc at Center for Social Media and Politics at NYU.
- Researcher at Twitter.
Research Interests:
- Social media and politics
- Computational methods, NLP and LLMs
- Focus on Global South
Outside of work, I enjoy watching soccer and reading sci-fi.
Sometimes I enjoy soccer while working!
And I am from Brazil!
Quiz!
Which programming language did I use the most at?
PhD
Postdoc
Twitter
As a Faculty
A comment from the pre-course survey (from last year)
Hi professor Ventura! I noticed that we gonna learn multiple data analysis tool this semester and I am definitely a novice of data science. I am little worried about how can I master all of them without being confused, because some commands might be very similar.
Your turn!
Name
(Briefly) what you were up to prior to the DSPP
If you could have any data source at your disposal, what would it be?
Logistics
Communication: via slack. Join the workspace!
All materials: hosted on the class website: https://tiagoventura.github.io/ppol5203/
Syllabus: also on the website.
My Office Hours: Every tuesday from 4 to 6pm. Just stop by!
Canvas: Only for official communication! Materials will be hosted in the website!
Datacamp: Additional exercises! I will assign modules for you! Access our free account here
05:00
TAs
- Aastha Jha (DSPP Second-Year Student)
- Email: aj935@georgetown.edu
- Office Hours:
- Every Wednesdays, from 1pm to 2pm.
- Shirui Zhou (DSPP Alumni)
- Email: sz614@georgetown.edu
- Office Hours:
- Every Monday, from 1pm to 2pm
Evaluation
Assignment | Percentage of Grade |
---|---|
Participation/Attendance | 5% |
Coding Discussion | 5% |
Problem sets | 50% |
Final Project | 40% |
Problem Sets
Individual submission through GitHub.
Assignment | Date Assigned | Date Due |
---|---|---|
No. 1 | Week 2 | Before EOD of Friday of Week 3 |
No. 2 | Week 4 | Before EOD of Friday of Week 5 |
No. 3 | Week 6 | Before EOD of Friday of Week 7 |
No. 4 | week 8 | Before EOD of Friday of Week 9 |
No. 5 | November 10 | Before EOD of Friday of Week 111 |
EOD = 11:59pm!
Final Project
You will work on randomly assigned groups!
The project is composed of three parts:
- a 2 page project proposal: (which should be discussed and approved by me)
- an in-class presentation,
- A 10-page project report.
Due dates and Points:
Requirement | Due | Length | Percentage |
---|---|---|---|
Project Proposal | October 31 | 2 pages | 5% |
Presentation | December 10 | 10-15 minutes | 10% |
Project Report | December 17 | 10 pages | 25% |
ChatGPT
You are allowed to use ChatGPT as you would use google in this class. This means:
Do not copy the responses from chatgpt – a lot of them are wrong or will just not run on your computer
Use chatgpt as a auxiliary source.
If your entire homework comes straight from chatgpt, I will consider it plagiarism.
If you use chatgpt, I ask you to mention on your code how chatgpt worked for you.
Be mature and make smart decisions. You will not be able to cheat on a coding interview, remember you are a master student now!
Let’s take a break!
10:00
Survey Results
Summary of the survey
72% of you have some experience with Python.
Only three of you were using primarily Python in your work before!
- Most others are using R and Excel!
You all have Python in your laptops, some still do not have a github account. If you are having issue after today, talk to your TAs.
Main Policy Areas:
- Social Media/Tech (Talk to me!)
- Election (aha! great timing for it!)
- Education (Talk with Professor Johnson)
Open Ended
- Most of you are worried, slightly anxious Python is hard, and you will not be able to keep up.
Python is definitely harder than excel and stata.
But you will be fine!
Our approach: We start slow, cover the basics, and move fast!
Transiton: Coding!
Set up your course infra-structure
Jupyter:
See Jupyter Notebook in the Class Website
Note on my approach on Notebooks: I will go over quite quickly through the notebooks. You shouls run them by yourselves at a later point!
Command Line
See Command Line Tutorial in the Class Website
Quarto
See Quarto Notebook in the Class Website