PPOL 5203 - Data Science I: Foundations


Week 1: Introductions, Installations, IDEs, Command line

Professor: Tiago Ventura

Welcome to Data Science I

Plans for Today

  • Motivating Data Science for Public Policy.

  • Goals of the course

  • Introductions

  • Course Logistics

  • IDEs

    • Jupyter
  • Introduction to commandline

Motivation: Digital Information Age

Abundance of data

Powerful and Cheap Computing Power

World of Generative AI Models

As a consequence, we live in a world:

  • With abundance of data we can use for research and to recommend policy decisions:

    • Internet data, social media, geo-tracking tools, etc…
  • With easily available technologies to analyze this data at scale

    • Your laptops, cloud computing, ChatBots…
  • But with new technologies that also have novel social implications:

    • Privacy issues

    • Use of technology by bad actors.

    • Use of technology by governments to censor/monitor citizens.

  • Policy scholars (but pretty much any researcher) need to be equipped to properly work with this data and understand the effects of these news technologies

Data Science for Public Policy

Data Scientist for Public Policy focuses on computational approaches to solve/understand policy problems.

  • Social science

    • Understands human behavior: human psychology, language, economic behavior, political systems, policy problems
    • Involve many approaches: qualitative interviews, statistical analysis, simulations
  • Data Science:

    • Tools to work with large-scale data + learning models + novel data sources

Computational Social Science in Practice

Misinformation on WhatsApp

Data Donation WhatsApp Groups

All the steps + Tools .... so far ...

  • Step 1: Recruiting participants online (SS)

    • Online Panels + Facebook Ads
  • Step 2: Running online surveys (SS)

    • Qualtrics + R
  • Step 3: Development of data donation pipeline (with MDI) (DS/CS)

    • JavaScript (JS) + Python
  • Step 4: Analyze the data (SS/DS)

    • SQL + Python + R

Readings for this week

Goals of the course


The goal of this course is to teach you:

  • Computational thinking: how to approach problems and devise solutions from a computational perspective.

  • Get you started on Python, introduce some useful tools (scrapping, text analysis and API applications of GenAI models), and bit of SQL for applied data science; lay the foundations for the remainder of the core DS sequence

  • Workflows and tools: Git/Github + Commandline.

PPOL 5203 - Data Science I: Foundations

Course Schedule

Week Topic Date
Week 01 Introduction, Installations, IDEs, Command line September 09, 2025
Week 02 Version Control, Workflow and Reproducibility: Or a bit of Git & GitHub September 16, 2025
Week 03 Intro to Python - OOP, Data Types, Control Statements and Functions September 23, 2025
Week 04 Intro to Python II: Scaling up your code - Iteration, Comprehension and Functions September 30, 2025
Week 05 From Nested lists to Dataframes: Numpy and Intro to Pandas October 07, 2025
Week 06 Pandas II: Data Wrangling October 14, 2025
Week 07 Joining, Tidying and Visualizing Data October 21, 2025
Week 08 Scraping + APIs October 28, 2025
Week 09 Statistical Learning November 04, 2025
Week 10 Text as Data I: Discovery and Topics November 11, 2023
Week 11 Text as Data II: Supervised Learning November 18, 2025
Week 12 Generative AI: Classification, Surveys and Prompting (Invited Speaker - Dr. Patrick Wu, American University) November 25, 2025
Week 13 SQL December 03, 2024
Week 14 Presentations of Final Projects December 09, 2023

Between Text-as-Data I and II… I might have some updates…

Introductions

Professor Tiago Ventura (he/him)

  • Assistant Professor at McCourt School.
  • Political Science PhD
  • Postdoc at Center for Social Media and Politics - NYU.
  • Researcher at Twitter.

My Research: Effects of technology in politics + applications of computational models to social science:

  • Global Social Media Deactivation.
  • Developing a data donation pipeline for WhatsApp data.
  • Measuring Humanness vs AI-Generated Content on Social Media.
  • Using LLMs to augment web-browsing data with synthetic data

Outside of work, I enjoy watching soccer, reading sci-fi and running

Quiz!

Which programming language did I use the most at?

  • PhD

  • Postdoc

  • Twitter

  • As a Faculty

A comment from the pre-course survey (from last year)

Hi professor Ventura! I noticed that we gonna learn multiple data analysis tool this semester and I am definitely a novice of data science. I am little worried about how can I master all of them without being confused, because some commands might be very similar.

Your turn!

  • Name

  • (Briefly) what you were up to prior to the DSPP

  • If you could have any data source at your disposal, what would it be?

Logistics

  • Communication: via slack. Join the workspace!

  • All materials: hosted on the class website: https://tiagoventura.github.io/ppol5203/

  • Syllabus: also on the website.

  • My Office Hours: Every tuesday from 4 to 5pm. Just stop by!

  • Canvas: Only for official communication! Materials will be hosted in the website!

  • Datacamp: Additional exercises! Access our free account here

  • Task: go on slack and send me a message about the data source you choose in the last answer, and if you feel comfortable add a picture to your profile

05:00

TA

  • Rebecca Wagner (DSPP Second-Year Student)
    • Email:rlw137@georgetown.edu
    • Office Hours:
      • in person: Every Monday 2pm, McCourt, Room 602
      • virtual: Every Tuesday 7pm. Zoom link

Evaluation

Assignment Percentage of Grade
Participation/Attendance 5%
Coding Discussion 5%
Problem sets 50%
Final Project 40%

Problem Sets

Individual submission through GitHub.

Assignment Date Assigned Date Due
No. 1 Week 2 Before EOD of Friday of Week 3
No. 2 Week 4 Before EOD of Friday of Week 5
No. 3 Week 6 Before EOD of Friday of Week 7
No. 4 week 8 Before EOD of Friday of Week 9
No. 5 November 10 Before EOD of Friday of Week 111

EOD = 11:59pm!

Final Project

  • You will work on groups!

  • The project is composed of three parts:

    • a 2 page project proposal: (which should be discussed and approved by me)
    • an in-class presentation,
    • A 10-page project report.

Due dates and Points:

Requirement Due Length Percentage
Project Proposal October 31 2 pages 5%
Presentation December 09 10-15 minutes 10%
Project Report December 16 10 pages 25%

GenAI

You are allowed to use GenAI as you would use google in this class. This means:

  • Do not copy the responses from GenAI Chatbots – a lot of them are wrong or will just not run on your computer

  • Use GenAI Chatbots as a auxiliary source.

  • If your entire homework comes straight from GenAI Chatbots, I will consider it plagiarism.

  • If you use GenAI Chatbots, I ask you to mention on your code how chatgpt worked for you.

Be mature and make smart decisions. You will not be able to cheat on a coding interview, remember you are a master student now!

Let’s take a break!

10:00

Survey Results

Summary of the survey

  • Most of you have some experience with Python.

  • Very few of you were using primarily Python in your work before!

    • Most others are using R and Excel!
  • Most of you have Python in your laptops, some still do not have a github account. If you are having issues after today, talk to your TAs.

  • Main Policy Areas:

    • Social Media/Tech (Talk to me!)
    • Election (Talk to Professor Warshaw, Bailey or Ladd)
    • Education (Talk with Professor Johnson)

Open Ended

  • this is a big group of international students, and some are concerned about not having english as you first language!

    • THIS IS FINE ! I hope my non-native and strongly accented English will encourage you to participate in class and speak up!

“That’s just what translation is, I think. That’s all speaking is. Listening to the other and trying to see past your own biases to glimpse what they’re trying to say. Showing yourself to the world, and hoping someone else understands.” - R.F. Kuang, Babel

Open Ended

Some of you are slightly anxious about Python and the pace of the class

  • Python is definitely harder than excel and stata.

  • But you will be fine!

  • Our approach: We start slow, cover the basics, and move fast!

Transiton: Coding!

Set up your course infra-structure

See Course Website

  • Install Python

  • Install Jupyter

  • Setup your Git/Github account

    • Homework: Try to make one successful Git push before class next week.

Jupyter:

Jupyter Notebook Tutorial in the Class Website

Note on my approach on Notebooks: I will go over quite quickly through the notebooks. You should run them by yourselves at a later point!

Command Line

Command Line Tutorial in the Class Website

Datacamp Course

Additional Materials: Quarto

See Quarto Notebook in the Class Website

See you next week!