Week 10: Introduction to Models and Statistical Learning
2024-11-12
We started data science workflow (git) and best practices
We moved over to the primitives of Python as your main DS tool
Learned how to work with tabular data using Numpy, Pandas and plotnine
Then we moved to unestructured data sources:
Today: we will learn how to build statistical models in Python.
Intro to DS II.
Foundational DS Knowledge.
Basis for the last two lecture on working with textual data.
Introduction to Machine Learning
Statistical Learning
Inferential Models vs Predictive Models
Machine Learning:
Coding
Your problem set 3 has been graded.
Comments on GitHub
Grades on Canvas
The problem set 5 will be posted today, it is due next Friday, Nov 22.
Your project proposal is due this Friday, Nov 15.
If you still have not done, remember we should meet and discuss your draft before submission.
Ideally at office hours today!
A model is “a simplified representation of the reality created to serve a purpose.” (Provost & Fawcett 2013)
As social scientists, we are often interested in answering complex questions about human behavior. To answer them, we build models (theoretical and statistical).
By definition:
Models are always a simplification of reality.
Models’ components come from assumptions we make about the world.
Models are all wrong, but some are useful.
The aim of statistical models is to estimate the relationship between the outcome and some set of variables:
\[ y = f(X) + \epsilon\]
Where:
\(y\) is the outcome/dependent/response variable
\(X\) is a matrix of predictors/features/independent variables
\(f()\) is some fixed but unknown function mapping X to y. The “signal” in the data
\(\epsilon\) is some random error term. The “noise” in the data.
Statistical learning refers to a set of methods/approaches for estimating \(f(.)\)
\[ \hat{y} = \hat{f}(X)\]
Where \(\hat{f}(X)\) is an approximation of the “true” functional form, \(f(X)\)
\(\hat{y}\) is the predicted value of a true value y.
\[E(y - \hat{y})^2\] \[E[f(X) + \epsilon - \hat{f}(X) ]^2\]
\[\underbrace{E[f(X) -\hat{f}(X)]^2}_{\text{Reducible}} + \underbrace{var(\epsilon)}_{\text{Irreducible}}\]
The “reducible” error is the systematic signal. We can reduce this error by using different functional forms, better data, or a mixture of those two.
The “irreducible” error is associated with the random noise around \(y\).
Statistical learning is concerned with minimizing the reducible error.
Our predictions will never be perfect given the irreducible error.
Two reasons we want to estimate \(f(\cdot)\):
Inference
Goal is interpretation
Key limitation:
Classic Social Science Approach:
\[ y_i = \beta_1*education_i + beta_2*gender_i + \sigma_i \]
Prediction
Goal is to predict values of the outcome, \(\hat{y}\)
\(\hat{f}(X)\) is treated as a black box
Key limitation:
Example: giving a k number of features, and all possible interactions between them,how likely is each voter i to turnout?
This is the machine learning tradition/predictive modeling/AI
\[ y_i = \hat{f}(X) + \sigma_i \]
Unsupervised Learning (DS III)
we observe a vector of measurements \(x_i\) but no associated response \(y_i\).
“unsupervised” because we lack a response variable that can supervise our analysis.
Supervised Learning (DS II)
for each observation of the predictor measurement \(x_i\) there is an associated response measurement \(y_i\). In essence, there is an outcome we are aiming to accurately predict or understand.
Quantitative outcome: Regression Models ~ linear, penalization, generalized additive models
Qualitative Outcomes: Classification methods ~ logistic regression, naive Bayes, support vector machines, neural networks
For inferential work (in your stats class), you get some data, and you estimate a model using the entire data. This is not what we do in machine learning.
Main objective: predict responses (from new data) accurately by learning from previously seen data
Goal:
capture signal in the best way possible \(f(X)\)
Ignore noise (irreducible error)
Where learning happens?
Model type
Features (variables)
Hyper-Parameters
The purpose of training the model is to capture signal and ignore noise.
To do so:
Define a accuracy metric:
Train-Test Split
Split your data between training and test
Training data: find the best model and tune the parameters of these models
Test data: assess the accuracy of your model.
Select the model based on smaller out of sample predictive accuracy, using UNSEEN data.
Bias: Difference \(Y\) and \(\hat{Y}\)
High Bias ~ Underfitting, rigid models
Low Bias ~ Overfitting, flexible models
Variance: Sensitivity to fluctuations in the training data
High Variance ~ huge error when new data is seen (usually associated with low bias)
Low Variance ~ small error when new data is seen (usually associated with high bias)
Trade-off: as we reduce bias, we risk overfiting. Training properly is key to find a middle-of-the-road solution.
If you go back and forth between training and test to decide which model to use, you might also be over-fitting. This is concept known as data leakage.
We can use re-sampling techniques to generate estimates for the test error.
“Re-sampling” involves repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model.
We can use re-sampling techniques to generate estimates for the test error.
K-Fold Cross Validation involves randomly dividing the data into \(k\) groups (or folds). Model is trained on \(k-1\) folds, then test on the remaining fold.
Process is repeated \(k\) times, each time using a new fold.
Offers \(k\) estimates of the test error, which we average to calculate the error
Inference vs Prediction
Prediction: black-box function, highly flexible to reduce error.
Learning: train the model in seen data. Test and calculate metrics in seen
Avoid overfitting and data leakage by using resampling methods.
https://tiagoventura.github.io/ppol5203/weeks/week-10.html
Data science I: Foundations