In the class today, we will learn about model in Python. We will cover:
Inferential models using statsmodels
Statistical learning with sklearn
In Data Science II, you will learn all there is to be learned about predictive modeling and machine learning. You will spend every week going through a different type of model and understanding the mathematics behind it.
The purpose of this class is to provide you with a agnostic overview of inferential and predictive workflow of building models in Python. These are my broader goals on having this class in DS I:
For inferential models, my goal is to show you how to use Python to estimate the models you are learning at Statistics I.
For the predictive modeling, I want to go give you a foundational introduction to Machine Learning. In case you get an interview for an internship before you start Data Science II, this class should make you feel confortable describing the components of a machine learning pipeline, even though you have not been properly introduced properly to each different model.
The predictive modeling components will also help you understand the next two lecture of this course focusing on modeling text data in Python.
Let's go!
We will work with the Diabetes dataset provided by the sklearn
library. Here you can find a full description of the dataset.
Some basic information:
Number of cases: 442
Number of variables: First 10 columns are numeric predictive values
Outcome (Target in ML): Column 11 is a quantitative measure of disease progression one year after baseline
Attribute Information:
age:
age in years
sex:
bmi:
body mass index
bp:
average blood pressure
s1 tc:
, total serum cholesterol
s2 ldl:
, low-density lipoproteins
s3 hdl:
, high-density lipoproteins
s4 tch:
, total cholesterol / HDL
s5 ltg:
, possibly log of serum triglycerides level
s6 glu:
, blood sugar level
# load dataset
import numpy as np
import pandas as pd
from sklearn import datasets
# load the dataset as a pandas dataframe
diabetes = datasets.load_diabetes(as_frame=True)["frame"]
# check the data
diabetes.head()
As we discussed in class, the goal of inferential models is interpretation. These are the models we often build when doing statistics in social science. We often ask:
To work with inference, we will use the library statsmodels
which will allow us to estimate and easily retrive parameters for a wide set of models.
# import library for models
import statsmodels.api as sm
import statsmodels.formula.api as smf
# library for plotting
from plotnine import *
# estimate the model using R style formula
model_ols_r = smf.ols(formula='target ~ age + sex + bmi + bp', data=diabetes).fit()
# or using a more pythonic way
X = sm.add_constant(diabetes[["age", "sex", "bmi", "bp"]])
y = diabetes[["target"]]
model_ols = sm.OLS(y, X).fit()
print(model_ols.summary())
model_ols_int= smf.ols(formula='target ~ age + sex + bmi + bp + age*bmi', data=diabetes).fit()
print(model_ols_int.summary())
The API for stats.model
is the same for different types of models. So the learning costs of estimating different models are quite low. Let's estimate a logit model now.
# estimate the model using R style formula
diabetes["target_bin"] = np.where(diabetes["target"]> np.mean(diabetes["target"]), 1, 0)
# model
model_logit = smf.logit(formula='target_bin ~ age + sex + bmi + bp', data=diabetes).fit()
# see output
print(model_logit.summary())
If we are interested in understanding relationships, a huge part of fitting inferential models consists on presenting the results as understanble quantities of interests. statsmodels
has a set of functions that allow us to easily analyze the model.
# get parameters
params = model_ols.params
pd.DataFrame(params).rename(columns={0:"coef"})
# confidence intervals
model_ols.conf_int().rename(columns={0:"lower", 1:"upper"})
# p-values
pd.DataFrame(model_ols.pvalues).rename(columns={0:"p-values"})
# function to get a tidy data frame with results
def tidy_ols(model):
"""
input: ols stats model
output: tidy pandas dataframe with models parameters
"""
# parameters
params = pd.DataFrame(model.params).rename(columns={0:"coef"})
# confidence intervals
coinf = model.conf_int().rename(columns={0:"lower", 1:"upper"})
# p-values
pvalues = pd.DataFrame(model.pvalues).rename(columns={0:"p-values"})
return pd.concat([params, coinf, pvalues], axis=1).reset_index()
# run the function
ols_tidy = tidy_ols(model_ols)
# see
ols_tidy.head()
# Plot
plot = (ggplot(ols_tidy, aes(y='coef', x='index'))
+ geom_point(aes(y='coef'), size=3, fill="red", color="")
+ geom_errorbar(aes(ymin="lower", ymax="upper"), width=0.01, size=2, alpha=.3)
+ theme_light()
)
plot
Time for you to practice.
Tips:
model.fittedvalues
geom_abline(intercept = 0, slope = 1, size = 0.5)
Read more here for other diagnostics: https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html
## add your code here.
As we discussed in the lecture, in the machine learning tradition, we are interest in predictive modeling, instead of understanding relationship. These are the key features on machine learning models:
The goal i is to predict values of the outcome, $\hat{y}$
Models are treated as a black box
the model doesn't need to be interpretable as long as it provides an accurate prediction of $y$.
The figure below from Jorge Cimentada's book Machine Learning for Social Science provides a nice summary of the traditional machine learning workiflow.
In words:
All these steps will be performed using the sklearn
library. The documentation for sklearn
is super rich. So I strongly encourage you to check their website and their tutorials: https://scikit-learn.org/stable/tutorial/index.html
# load dataset
import numpy as np
import pandas as pd
from sklearn import datasets
# load the dataset as a pandas dataframe
diabetes = datasets.load_diabetes(as_frame=True)["frame"]
# features
X = diabetes.drop(columns="target")
# target
y = diabetes["target"]
train_test_split
method:
# import function
from sklearn.model_selection import train_test_split
# split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.15,
random_state = 417)
# any missing?
X.isna().sum()
# stardandization?
X.describe().loc[["mean", "std"],:]
# any categorical?
X.dtypes
This dataset is pretty much clean for us, we can skip pre-processing steps.
If you want see more about pre-processing steps with skelearn
, check here: https://scikit-learn.org/stable/modules/preprocessing.html
IMPORTANT: The modeling API follows the same framework (despite the model). You can fit many different models with very similar code. Let's start with a simple OLS.
# import model
from sklearn.linear_model import LinearRegression
# Instantiate the modeling object
model = LinearRegression()
# Fit the model
model.fit(X_train,y_train)
# Fit statistic (metric for how wrong we are) -- R^2
model.score(X_test,y_test)
## if you want to calculate the mean squared error
from sklearn.metrics import mean_squared_error
# predicted values
yhat = model.predict(X_test)
print(yhat)
## estimate out of sample predictions
mean_squared_error(y_test, yhat)
Our performance with the simple linear regression was not great. Let's see if we get better using more complex models. We will try three different families of models. Again, you will learn about these models in DS II
# setup for models
from sklearn.linear_model import ElasticNet
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
# setup for metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
# setup for dataset
from sklearn import datasets
# load the dataset as a pandas dataframe
diabetes = datasets.load_diabetes(as_frame=True)["frame"]
# features
X = diabetes.drop(columns="target")
# target
y = diabetes["target"]
# split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.15,
random_state = 417)
# linear model
# Instantiate the modeling object
model_ols = LinearRegression()
# Fit the model
model_ols.fit(X_train,y_train)
# Predict and calculate MSE in the test
y_pred_ols = model_ols.predict(X_test)
mse_ols = mean_squared_error(y_test, y_pred_ols)
# Initialize and train the Elastic Net model
# alpha and l1_ratio are hyper parameter (learning is here!)
elastic_net = ElasticNet(alpha=0.01, l1_ratio=1)
elastic_net.fit(X_train, y_train)
# Predict and calculate MSE in the test
y_pred_en = elastic_net.predict(X_test)
mse_en = mean_squared_error(y_test, y_pred_en)
# Initialize and train the SVM model
svm = SVR(kernel='linear') # Using linear kernel for demonstration
svm.fit(X_train, y_train)
# Predict and calculate MSE in the test
y_pred_svm = svm.predict(X_test)
mse_svm = mean_squared_error(y_test, y_pred_svm)
# Initialize and train the Decision Tree model
tree = DecisionTreeRegressor(random_state=0, max_depth=50)
tree.fit(X_train, y_train)
# Predict and calculate MSE in the test
y_pred_tree = tree.predict(X_test)
mse_tree = mean_squared_error(y_test, y_pred_tree)
This is really important: we always evaluate the models using out of sample, unseen data.
pd.DataFrame({"values":[mse_ols, mse_en,mse_svm, mse_tree],
"models":["ols", "elastic-net", "svm", "tree"]}).sort_values("values")
# linear model
# Instantiate the modeling object
model_ols = LinearRegression()
# Fit the model
model_ols.fit(X_train,y_train)
# predic and calculate MSE
y_pred_ols = model_ols.predict(X_train)
mse_ols = mean_squared_error(y_train, y_pred_ols)
# Initialize and train the Elastic Net model
# alpha is a hyper parameter (learning is here!)
elastic_net = ElasticNet(alpha=0.01, l1_ratio=1)
elastic_net.fit(X_train, y_train)
# Predict and calculate MSE
y_pred_en = elastic_net.predict(X_train)
mse_en = mean_squared_error(y_train, y_pred_en)
# Initialize and train the SVM model
svm = SVR(kernel='linear') # Using linear kernel for demonstration
svm.fit(X_train, y_train)
# Predict and calculate MSE
y_pred_svm = svm.predict(X_train)
mse_svm = mean_squared_error(y_train, y_pred_svm)
# Initialize and train the Decision Tree model
tree = DecisionTreeRegressor(random_state=0, max_depth=20)
tree.fit(X_train, y_train)
# Predict and calculate MSE
y_pred_tree = tree.predict(X_train)
mse_tree = mean_squared_error(y_train, y_pred_tree)
pd.DataFrame({"values":[mse_ols, mse_en,mse_svm, mse_tree],
"models":["ols", "elastic-net", "svm", "tree"]}).sort_values("values")
Why do tree-based models do so much better in the training data compared to the test data?
As you saw above, more advanced machine learning models have hyper-parameters.
Hyperparameters are parameters that are not learned directly from the data but are set in advance before the training process begins, bur their best-values can be learned with training. Examples include the learning rate in gradient descent in deep-learning, the depth of a decision tree, or the number of neighbors in a k-nearest neighbors algorithm.
The way we choose the values for hyper-parameters is by looking at the performance of the models.
If we were to find the best-values for hyperparameters looking always to the test-set, we would be pretty much training the model on the test set. Remember, we should always avoid this. The test set should be unseen data, and we should keep its integrity. This process is commonly described as data leakage
For this reason, when training machine learning models, we often rely on resampling methods, such as cross-validation. These methods allow us to fine tune our models by taking multiple samples of the training data, and training the model multiple times.
Read here a nice and shor introduction to cross-validation: https://neptune.ai/blog/cross-validation-in-machine-learning-how-to-do-it-right
# lets see the mechanics first
from sklearn.model_selection import train_test_split # Train-test split
from sklearn.model_selection import LeaveOneOut # Leave One Out Cross Validation
from sklearn.model_selection import KFold # K-fold Cross validation
# Intialize the K-Folds (splits)
kf = KFold(n_splits=5)
# Split the data
k_splits = kf.split(X_train)
# Let's look at the splits
n_models =0
for train, test in k_splits:
print(f"N Obs. to train on = {train.shape[0]}, N obs to test on = {test.shape[0]}") #
n_models += 1 # Count the number of models
print(f"Total numbers of models run = {n_models}")
# what the splits are? Index of the data
kf = KFold(n_splits=5)
# Split the data
k_splits = kf.split(X_train)
# containers to store the data
train_data = list()
val_data = list()
# iterate
for fold_number, (train_indices, val_indices) in enumerate(k_splits):
print(f"Fold {fold_number + 1}:")
# Split your data into training and validation sets using the indices
train_data.append([X_train.iloc[i] for i in train_indices])
val_data.append([X_train.iloc[i] for i in val_indices])
# see data
pd.DataFrame(train_data[2])
we could iterate over these with a loop....
Let's implement using sklearn
. We actually don't need a loop!
# split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# create my splits
# what the splits are? Index of the data
kf = KFold(n_splits=5)
# Split the data
k_splits = kf.split(X_train)
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
# Define hyperparameters to tune
params = {
'alpha': [0.1, 0.5, 1, 2, 5],
'l1_ratio': [0, 0.25, 0.5, 0.75, 1]
}
# elastic net
elastic_net = ElasticNet()
# Grid search with 5-fold cross-validation
grid_en = GridSearchCV(elastic_net, params, cv=kf, scoring='r2')
grid_en.fit(X_train, y_train)
# Best parameters and corresponding MSE
print("Best Elastic Net parameters:", grid_en.best_params_)
print("Best MSE:", grid_en.best_score_)
from sklearn.tree import DecisionTreeRegressor
params = {
'max_depth': [None, 2, 4, 6, 8, 10],
'min_samples_split': [2, 5, 10]
}
tree = DecisionTreeRegressor()
grid_tree = GridSearchCV(tree, params, cv=kf, scoring='r2')
grid_tree.fit(X_train, y_train)
print("Best Decision Tree parameters:", grid_tree.best_params_)
print("Best MSE:", grid_tree.best_score_)
from sklearn.neighbors import KNeighborsRegressor
params = {
'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}
knn = KNeighborsRegressor()
grid_knn = GridSearchCV(knn, params, cv=kf, scoring='r2')
grid_knn.fit(X_train, y_train)
print("Best KNN parameters:", grid_knn.best_params_)
print("Best MSE:", grid_knn.best_score_)
# Select the best model based on cross-validation results
best_model = min([grid_en, grid_tree, grid_knn], key=lambda x: -x.best_score_)
# which model?
best_model.best_estimator_
# what else
best_model.best_params_
# error
best_model.best_score_
# Predict on the test set
y_pred = best_model.predict(X_test)
# Calculate the test set MSE
mse_test = r2_score(y_test, y_pred)
print(f"Test set MSE for the best model: {mse_test:.2f}")
All of this only touches the top of the iceberg on Machine Learning. If you want to go ahead and learn more before DS II, I would suggest you to:
Work through the tutorials on sklearn webpage
!jupyter nbconvert _week-9-models.ipynb --to html --template classic