PPOL 5203 - Data Science I: Foundations

Week 10: Introduction to Models and Statistical Learning

Professor: Tiago Ventura

2024-11-12

Where we are…

  • We started data science workflow (git) and best practices

  • We moved over to the primitives of Python as your main DS tool

  • Learned how to work with tabular data using Numpy, Pandas and plotnine

  • Then we moved to unestructured data sources:

    • Collect/parse digital data (website, Apis, and dynamic websites)
  • Today: we will learn how to build statistical models in Python.

    • Intro to DS II.

    • Foundational DS Knowledge.

    • Basis for the last two lecture on working with textual data.

Plans for Today

  • Introduction to Machine Learning

    • Statistical Learning

    • Inferential Models vs Predictive Models

    • Machine Learning:

      • Supervised vs Unsupervised
      • Training models
      • Bias and Variance Trade-off
    • Coding

Logistics

  • Your problem set 3 has been graded.

    • Comments on GitHub

    • Grades on Canvas

  • The problem set 5 will be posted today, it is due next Friday, Nov 22.

  • Your project proposal is due this Friday, Nov 15.

    • If you still have not done, remember we should meet and discuss your draft before submission.

    • Ideally at office hours today!

Introduction to Machine Learning

What is a Model?

A model is “a simplified representation of the reality created to serve a purpose.” (Provost & Fawcett 2013)

  • As social scientists, we are often interested in answering complex questions about human behavior. To answer them, we build models (theoretical and statistical).

  • By definition:

    • Models are always a simplification of reality.

      • We simplify so that we can understand and generalize a complex social process.
    • Models’ components come from assumptions we make about the world.

    • Models are all wrong, but some are useful.

An Statistical Model

The aim of statistical models is to estimate the relationship between the outcome and some set of variables:

\[ y = f(X) + \epsilon\]

Where:

  • \(y\) is the outcome/dependent/response variable

  • \(X\) is a matrix of predictors/features/independent variables

  • \(f()\) is some fixed but unknown function mapping X to y. The “signal” in the data

  • \(\epsilon\) is some random error term. The “noise” in the data.

Statistical Learning

Statistical learning refers to a set of methods/approaches for estimating \(f(.)\)

\[ \hat{y} = \hat{f}(X)\]

  • Where \(\hat{f}(X)\) is an approximation of the “true” functional form, \(f(X)\)

  • \(\hat{y}\) is the predicted value of a true value y.

Reducible vs. Irreducible Error

  • When we build models, the aim is to find a \(\hat{f}(X)\) that minimizes the error in the model.

\[E(y - \hat{y})^2\] \[E[f(X) + \epsilon - \hat{f}(X) ]^2\]

\[\underbrace{E[f(X) -\hat{f}(X)]^2}_{\text{Reducible}} + \underbrace{var(\epsilon)}_{\text{Irreducible}}\]

  • The “reducible” error is the systematic signal. We can reduce this error by using different functional forms, better data, or a mixture of those two.

  • The “irreducible” error is associated with the random noise around \(y\).

  • Statistical learning is concerned with minimizing the reducible error.

  • Our predictions will never be perfect given the irreducible error.

Inference (Social Science) vs Prediction (Machine Learning)

Two reasons we want to estimate \(f(\cdot)\):

  • Inference

    • Goal is interpretation

      • Which predictors are associated with the response?
      • What is the relationship (parameters) between the response and the predictors?
      • Is the relationship causal?
    • Key limitation:

      • using functional forms that are easy to interpret (e.g. lines) might be far away from the true function form of \(f(X)\).
    • Classic Social Science Approach:

      • Example: Which socio and demographic features correlates with voting turnout? Education? Gender? Age?

\[ y_i = \beta_1*education_i + beta_2*gender_i + \sigma_i \]

Inference (Social Science) vs Prediction (Machine Learning)

  • Prediction

    • Goal is to predict values of the outcome, \(\hat{y}\)

    • \(\hat{f}(X)\) is treated as a black box

      • model doesn’t need to be interpretable as long as it provides an accurate prediction of \(y\).
    • Key limitation:

      • Interpretation: it is difficult to know which variables are doing the heavy lifting and the exact influence of \(x\) on \(y\).
    • Example: giving a k number of features, and all possible interactions between them,how likely is each voter i to turnout?

    • This is the machine learning tradition/predictive modeling/AI

\[ y_i = \hat{f}(X) + \sigma_i \]

Machine Learning

Supervised and Unsupervised Learning

  • Unsupervised Learning (DS III)

    • we observe a vector of measurements \(x_i\) but no associated response \(y_i\).

    • “unsupervised” because we lack a response variable that can supervise our analysis.

Supervised and Unsupervised Learning

  • Supervised Learning (DS II)

    • for each observation of the predictor measurement \(x_i\) there is an associated response measurement \(y_i\). In essence, there is an outcome we are aiming to accurately predict or understand.

    • Quantitative outcome: Regression Models ~ linear, penalization, generalized additive models

    • Qualitative Outcomes: Classification methods ~ logistic regression, naive Bayes, support vector machines, neural networks

Types of Models

Machine Learning: what does learning mean?

For inferential work (in your stats class), you get some data, and you estimate a model using the entire data. This is not what we do in machine learning.

  • Main objective: predict responses (from new data) accurately by learning from previously seen data

  • Goal:

    • capture signal in the best way possible \(f(X)\)

    • Ignore noise (irreducible error)

  • Where learning happens?

    • Model type

    • Features (variables)

    • Hyper-Parameters

A Machine Learning Pipeline

The purpose of training the model is to capture signal and ignore noise.

To do so:

  • Define a accuracy metric:

    • In the regression setting, the most common accuracy metric is mean squared error (MSE).
    • Mean Squared Error = \(\frac{\sum^N_{i=1} (y_i - \hat{f}(X_i))^2}{N}\)
  • Train-Test Split

    • Split your data between training and test

    • Training data: find the best model and tune the parameters of these models

    • Test data: assess the accuracy of your model.

  • Select the model based on smaller out of sample predictive accuracy, using UNSEEN data.

Workflow: Inference vs ML

Challenge: avoid overfitting the data

Bias and Variance Trade-off

  • Bias: Difference \(Y\) and \(\hat{Y}\)

    • High Bias ~ Underfitting, rigid models

    • Low Bias ~ Overfitting, flexible models

  • Variance: Sensitivity to fluctuations in the training data

    • High Variance ~ huge error when new data is seen (usually associated with low bias)

    • Low Variance ~ small error when new data is seen (usually associated with high bias)

  • Trade-off: as we reduce bias, we risk overfiting. Training properly is key to find a middle-of-the-road solution.

MSE Training Data

Model Accuracy: Out-Sample Prediction

Cross-Validation

  • If you go back and forth between training and test to decide which model to use, you might also be over-fitting. This is concept known as data leakage.

    • We can use re-sampling techniques to generate estimates for the test error.

    • “Re-sampling” involves repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model.

    • We can use re-sampling techniques to generate estimates for the test error.

K-Fold Cross Validation

  • K-Fold Cross Validation involves randomly dividing the data into \(k\) groups (or folds). Model is trained on \(k-1\) folds, then test on the remaining fold.

  • Process is repeated \(k\) times, each time using a new fold.

  • Offers \(k\) estimates of the test error, which we average to calculate the error

Important Concepts

  • Inference vs Prediction

  • Prediction: black-box function, highly flexible to reduce error.

  • Learning: train the model in seen data. Test and calculate metrics in seen

  • Avoid overfitting and data leakage by using resampling methods.

Coding!

https://tiagoventura.github.io/ppol5203/weeks/week-10.html