PPOL 5203 Data Science I: Foundations

Inferential Models and Machine Learning

Tiago Ventura


Learning Goals

In the class today, we will learn about model in Python. We will cover:

  • Inferential models using statsmodels

    • Ordinary Least Squares
    • Logistic Regression
    • retrieving parameters of interest
  • Statistical learning with sklearn

    • an workflow to rule them all
    • iterating through many models
    • model selection
    • cross-validation

Introdutory notes

In Data Science II, you will learn all there is to be learned about predictive modeling and machine learning. You will spend every week going through a different type of model and understanding the mathematics behind it.

The purpose of this class is to provide you with a agnostic overview of inferential and predictive workflow of building models in Python. These are my broader goals on having this class in DS I:

  • For inferential models, my goal is to show you how to use Python to estimate the models you are learning at Statistics I.

  • For the predictive modeling, I want to go give you a foundational introduction to Machine Learning. In case you get an interview for an internship before you start Data Science II, this class should make you feel confortable describing the components of a machine learning pipeline, even though you have not been properly introduced properly to each different model.

  • The predictive modeling components will also help you understand the next two lecture of this course focusing on modeling text data in Python.

Let's go!

Datasets

We will work with the Diabetes dataset provided by the sklearn library. Here you can find a full description of the dataset.

Some basic information:

  • Number of cases: 442

  • Number of variables: First 10 columns are numeric predictive values

  • Outcome (Target in ML): Column 11 is a quantitative measure of disease progression one year after baseline

  • Attribute Information:

    • age: age in years

    • sex:

    • bmi: body mass index

    • bp: average blood pressure

    • s1 tc:, total serum cholesterol

    • s2 ldl:, low-density lipoproteins

    • s3 hdl:, high-density lipoproteins

    • s4 tch:, total cholesterol / HDL

    • s5 ltg:, possibly log of serum triglycerides level

    • s6 glu:, blood sugar level

In [2]:
# load dataset
import numpy as np
import pandas as pd
from sklearn import datasets

# load the dataset as a pandas dataframe
diabetes = datasets.load_diabetes(as_frame=True)["frame"]

Notice:

  • ten features (independent variables)
  • one outcome (target variable)
In [3]:
# check the data
diabetes.head()
Out[3]:
age sex bmi bp s1 s2 s3 s4 s5 s6 target
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019907 -0.017646 151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068332 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592 0.002861 -0.025930 141.0
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022688 -0.009362 206.0
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031988 -0.046641 135.0

Inferential Models

As we discussed in class, the goal of inferential models is interpretation. These are the models we often build when doing statistics in social science. We often ask:

  • Which predictors are associated with the response?
  • What is the relationship (parameters) between the response and the predictors?
  • Is the relationship causal?

To work with inference, we will use the library statsmodels which will allow us to estimate and easily retrive parameters for a wide set of models.

In [4]:
# import library for models
import statsmodels.api as sm
import statsmodels.formula.api as smf

# library for plotting
from plotnine import *

Fitting a OLS model

In [5]:
# estimate the model using R style formula
model_ols_r = smf.ols(formula='target ~ age + sex + bmi + bp', data=diabetes).fit()
In [6]:
# or using a more pythonic way
X = sm.add_constant(diabetes[["age", "sex", "bmi", "bp"]])
y = diabetes[["target"]]
model_ols = sm.OLS(y, X).fit()

See outputs

In [7]:
print(model_ols.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 target   R-squared:                       0.400
Model:                            OLS   Adj. R-squared:                  0.395
Method:                 Least Squares   F-statistic:                     72.91
Date:                Sun, 10 Nov 2024   Prob (F-statistic):           2.70e-47
Time:                        13:15:05   Log-Likelihood:                -2434.2
No. Observations:                 442   AIC:                             4878.
Df Residuals:                     437   BIC:                             4899.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        152.1335      2.853     53.329      0.000     146.527     157.740
age           37.2406     64.117      0.581      0.562     -88.776     163.258
sex         -106.5762     62.125     -1.716      0.087    -228.677      15.525
bmi          787.1817     65.424     12.032      0.000     658.597     915.766
bp           416.6725     69.495      5.996      0.000     280.087     553.258
==============================================================================
Omnibus:                        9.858   Durbin-Watson:                   1.933
Prob(Omnibus):                  0.007   Jarque-Bera (JB):                6.464
Skew:                           0.146   Prob(JB):                       0.0395
Kurtosis:                       2.485   Cond. No.                         28.4
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Fitting OLS models with interactions

In [8]:
model_ols_int= smf.ols(formula='target ~ age + sex + bmi + bp + age*bmi', data=diabetes).fit()
print(model_ols_int.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 target   R-squared:                       0.407
Model:                            OLS   Adj. R-squared:                  0.400
Method:                 Least Squares   F-statistic:                     59.77
Date:                Sun, 10 Nov 2024   Prob (F-statistic):           2.40e-47
Time:                        13:15:08   Log-Likelihood:                -2431.8
No. Observations:                 442   AIC:                             4876.
Df Residuals:                     436   BIC:                             4900.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    150.9570      2.892     52.203      0.000     145.274     156.640
age           52.4516     64.228      0.817      0.415     -73.783     178.686
sex         -102.7047     61.887     -1.660      0.098    -224.339      18.930
bmi          813.1607     66.233     12.277      0.000     682.985     943.336
bp           413.1195     69.219      5.968      0.000     277.075     549.164
age:bmi     2809.4786   1291.944      2.175      0.030     270.265    5348.692
==============================================================================
Omnibus:                        8.252   Durbin-Watson:                   1.934
Prob(Omnibus):                  0.016   Jarque-Bera (JB):                6.273
Skew:                           0.184   Prob(JB):                       0.0434
Kurtosis:                       2.547   Cond. No.                         455.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Logit models

The API for stats.model is the same for different types of models. So the learning costs of estimating different models are quite low. Let's estimate a logit model now.

In [9]:
# estimate the model using R style formula
diabetes["target_bin"] = np.where(diabetes["target"]> np.mean(diabetes["target"]), 1, 0)

# model
model_logit = smf.logit(formula='target_bin ~ age + sex + bmi + bp', data=diabetes).fit()

# see output
print(model_logit.summary())
Optimization terminated successfully.
         Current function value: 0.536502
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:             target_bin   No. Observations:                  442
Model:                          Logit   Df Residuals:                      437
Method:                           MLE   Df Model:                            4
Date:                Sun, 10 Nov 2024   Pseudo R-squ.:                  0.2182
Time:                        13:15:18   Log-Likelihood:                -237.13
converged:                       True   LL-Null:                       -303.31
Covariance Type:            nonrobust   LLR p-value:                 1.229e-27
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.2809      0.113     -2.495      0.013      -0.502      -0.060
age            1.1068      2.569      0.431      0.667      -3.929       6.142
sex           -3.7533      2.459     -1.526      0.127      -8.573       1.066
bmi           19.5943      2.852      6.871      0.000      14.005      25.183
bp            14.6430      2.875      5.092      0.000       9.007      20.279
==============================================================================

Understand the Model Quantities

If we are interested in understanding relationships, a huge part of fitting inferential models consists on presenting the results as understanble quantities of interests. statsmodels has a set of functions that allow us to easily analyze the model.

Retrieving model parameters

In [10]:
# get parameters
params = model_ols.params
pd.DataFrame(params).rename(columns={0:"coef"})
Out[10]:
coef
const 152.133484
age 37.240607
sex -106.576199
bmi 787.181650
bp 416.672511
In [11]:
# confidence intervals
model_ols.conf_int().rename(columns={0:"lower", 1:"upper"})
Out[11]:
lower upper
const 146.526670 157.740298
age -88.776333 163.257546
sex -228.677180 15.524781
bmi 658.596846 915.766454
bp 280.087493 553.257528
In [12]:
# p-values
pd.DataFrame(model_ols.pvalues).rename(columns={0:"p-values"})
Out[12]:
p-values
const 2.048854e-193
age 5.616622e-01
sex 8.696030e-02
bmi 5.342370e-29
bp 4.245775e-09
In [13]:
# function to get a tidy data frame with results

def tidy_ols(model):
    """
    input: ols stats model
    output: tidy pandas dataframe with models parameters
    
    """
    # parameters
    params = pd.DataFrame(model.params).rename(columns={0:"coef"})
    
    # confidence intervals
    coinf = model.conf_int().rename(columns={0:"lower", 1:"upper"})
    
    # p-values
    pvalues = pd.DataFrame(model.pvalues).rename(columns={0:"p-values"})
    
    return pd.concat([params, coinf, pvalues], axis=1).reset_index()
In [14]:
# run the function
ols_tidy = tidy_ols(model_ols)

# see
ols_tidy.head()
Out[14]:
index coef lower upper p-values
0 const 152.133484 146.526670 157.740298 2.048854e-193
1 age 37.240607 -88.776333 163.257546 5.616622e-01
2 sex -106.576199 -228.677180 15.524781 8.696030e-02
3 bmi 787.181650 658.596846 915.766454 5.342370e-29
4 bp 416.672511 280.087493 553.257528 4.245775e-09

Visualizing results

In [15]:
# Plot
plot = (ggplot(ols_tidy, aes(y='coef', x='index'))
        + geom_point(aes(y='coef'), size=3, fill="red", color="")
        + geom_errorbar(aes(ymin="lower", ymax="upper"), width=0.01, size=2, alpha=.3)
        + theme_light()
       )
plot
Out[15]:
<Figure Size: (640 x 480)>

Pratice

Time for you to practice.

  • Estimate a different OLS model using the stats.models api. Add any variable you want to examine the effects
  • Build a plot with:
    • Points with the predicted outcomes for all observations based on your model (y-axis)
    • Points with observed valeus (x-axis)
    • Line with the perfect fit between y and x

Tips:

  • You can get the fitted values with model.fittedvalues
  • the perfect fit comes with a 45 degree line geom_abline(intercept = 0, slope = 1, size = 0.5)

Read more here for other diagnostics: https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html

In [23]:
## add your code here. 

Statistical learning with sklearn

As we discussed in the lecture, in the machine learning tradition, we are interest in predictive modeling, instead of understanding relationship. These are the key features on machine learning models:

  • The goal i is to predict values of the outcome, $\hat{y}$

  • Models are treated as a black box

  • the model doesn't need to be interpretable as long as it provides an accurate prediction of $y$.

Machine Learning Workflow

The figure below from Jorge Cimentada's book Machine Learning for Social Science provides a nice summary of the traditional machine learning workiflow.

In words:

  • Start with a dataset
  • Split between training and test
  • Do some pre-processing
  • Train the model (with or without cross-validation)
  • Select best parameters (fine-tuning the model)
  • Evaluate the model in the test set.

All these steps will be performed using the sklearn library. The documentation for sklearn is super rich. So I strongly encourage you to check their website and their tutorials: https://scikit-learn.org/stable/tutorial/index.html

Simple Example: OLS just to learn the mechanics of the sklearn API

Open the dataset

In [24]:
# load dataset
import numpy as np
import pandas as pd
from sklearn import datasets

# load the dataset as a pandas dataframe
diabetes = datasets.load_diabetes(as_frame=True)["frame"]

# features
X = diabetes.drop(columns="target")

# target
y = diabetes["target"]

Split between training and test

  • train_test_split method:

    • returns: train and test for X and y
In [25]:
# import function
from sklearn.model_selection import train_test_split

# split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.15, 
                                                    random_state = 417)

Pre-processing

In [26]:
# any missing? 
X.isna().sum()
Out[26]:
age    0
sex    0
bmi    0
bp     0
s1     0
s2     0
s3     0
s4     0
s5     0
s6     0
dtype: int64
In [27]:
# stardandization?
X.describe().loc[["mean", "std"],:]
Out[27]:
age sex bmi bp s1 s2 s3 s4 s5 s6
mean -2.511817e-19 1.230790e-17 -2.245564e-16 -4.797570e-17 -1.381499e-17 3.918434e-17 -5.777179e-18 -9.042540e-18 9.293722e-17 1.130318e-17
std 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02
In [28]:
# any categorical?
X.dtypes
Out[28]:
age    float64
sex    float64
bmi    float64
bp     float64
s1     float64
s2     float64
s3     float64
s4     float64
s5     float64
s6     float64
dtype: object

This dataset is pretty much clean for us, we can skip pre-processing steps.

If you want see more about pre-processing steps with skelearn, check here: https://scikit-learn.org/stable/modules/preprocessing.html

Train the model

IMPORTANT: The modeling API follows the same framework (despite the model). You can fit many different models with very similar code. Let's start with a simple OLS.

In [29]:
# import model 
from sklearn.linear_model import LinearRegression

# Instantiate the modeling object
model = LinearRegression()

# Fit the model
model.fit(X_train,y_train)
Out[29]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Evaluate the model - use the test set!!

In [30]:
# Fit statistic (metric for how wrong we are) -- R^2
model.score(X_test,y_test)
Out[30]:
0.4799898648936971
In [32]:
## if you want to calculate the mean squared error
from sklearn.metrics import mean_squared_error

# predicted values
yhat = model.predict(X_test)
print(yhat)
[115.49491094 210.09290977 189.5242481  175.43515176 201.46238906
  81.15639425 152.8271523  253.37232049 104.76783223 229.40236393
  88.53943327 137.94591634 111.7907669   75.00586818 102.69608919
  88.82851727 144.16536153 166.08420474 190.44728589  95.72804239
 236.09610547 142.32618368 208.17277143  81.42304444 136.95821211
 114.61153969 110.98084478 157.09470122 161.16279034 148.62899149
 121.12546219 232.37582345 166.18001126 183.23857245 228.6220361
 120.43129965  76.13668714  85.79445075 213.28227144 258.72884379
 174.40941255 118.59833036 181.34022491 163.09109404 184.74115281
 206.51010643 124.40167753 132.45563373  86.41756416 154.62456976
 115.99654319 174.18705985 125.04652635 138.79780373 145.15929403
  62.46630334 153.85176433 200.92345233 210.50146136 134.74179524
 253.20969321  95.26282909  56.77716228 260.1568294  187.06904303
 249.2484531  275.68356911]
In [31]:
## estimate out of sample predictions
mean_squared_error(y_test, yhat)
Out[31]:
3325.658847637309

Many, many models, and an API to rule them all!

Our performance with the simple linear regression was not great. Let's see if we get better using more complex models. We will try three different families of models. Again, you will learn about these models in DS II

  • Elastic-net: linear model with regularization parameters.
  • Support Vector Machine: a supervised model that allows more non-linearity between the parameters
  • Decision Tree: A model that uses a set of boolean rules in a non-parametric fashion that splits your data in multiple groups for supervised tasks.
In [33]:
# setup for models
from sklearn.linear_model import ElasticNet
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

# setup for metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score


# setup for dataset
from sklearn import datasets
In [34]:
# load the dataset as a pandas dataframe
diabetes = datasets.load_diabetes(as_frame=True)["frame"]

# features
X = diabetes.drop(columns="target")

# target
y = diabetes["target"]

# split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.15, 
                                                    random_state = 417)

Linear model

In [35]:
# linear model

# Instantiate the modeling object
model_ols = LinearRegression()

# Fit the model
model_ols.fit(X_train,y_train)

# Predict and calculate MSE in the test
y_pred_ols = model_ols.predict(X_test)
mse_ols = mean_squared_error(y_test, y_pred_ols)

Elastic Net

In [40]:
# Initialize and train the Elastic Net model
# alpha and l1_ratio are hyper parameter (learning is here!)
elastic_net = ElasticNet(alpha=0.01, l1_ratio=1)  
elastic_net.fit(X_train, y_train)

# Predict and calculate MSE in the test
y_pred_en = elastic_net.predict(X_test)
mse_en = mean_squared_error(y_test, y_pred_en)

Support Vector Machine

In [41]:
# Initialize and train the SVM model
svm = SVR(kernel='linear')  # Using linear kernel for demonstration
svm.fit(X_train, y_train)

# Predict and calculate MSE in the test
y_pred_svm = svm.predict(X_test)
mse_svm = mean_squared_error(y_test, y_pred_svm)

Decision Tree

In [42]:
# Initialize and train the Decision Tree model
tree = DecisionTreeRegressor(random_state=0,  max_depth=50)
tree.fit(X_train, y_train)

# Predict and calculate MSE in the test
y_pred_tree = tree.predict(X_test)
mse_tree = mean_squared_error(y_test, y_pred_tree)

Out-of-sample predictions

This is really important: we always evaluate the models using out of sample, unseen data.

In [43]:
pd.DataFrame({"values":[mse_ols, mse_en,mse_svm, mse_tree], 
               "models":["ols", "elastic-net", "svm", "tree"]}).sort_values("values")
Out[43]:
values models
1 3324.324519 elastic-net
0 3325.658848 ols
2 7071.125649 svm
3 8250.611940 tree

Let's see the fit in the training data

In [44]:
# linear model

# Instantiate the modeling object
model_ols = LinearRegression()

# Fit the model
model_ols.fit(X_train,y_train)

# predic and calculate MSE
y_pred_ols = model_ols.predict(X_train)
mse_ols = mean_squared_error(y_train, y_pred_ols)
In [45]:
# Initialize and train the Elastic Net model
# alpha is a hyper parameter (learning is here!)
elastic_net = ElasticNet(alpha=0.01, l1_ratio=1)  
elastic_net.fit(X_train, y_train)

# Predict and calculate MSE
y_pred_en = elastic_net.predict(X_train)
mse_en = mean_squared_error(y_train, y_pred_en)
In [46]:
# Initialize and train the SVM model
svm = SVR(kernel='linear')  # Using linear kernel for demonstration
svm.fit(X_train, y_train)

# Predict and calculate MSE
y_pred_svm = svm.predict(X_train)
mse_svm = mean_squared_error(y_train, y_pred_svm)
In [47]:
# Initialize and train the Decision Tree model
tree = DecisionTreeRegressor(random_state=0,  max_depth=20)
tree.fit(X_train, y_train)

# Predict and calculate MSE
y_pred_tree = tree.predict(X_train)
mse_tree = mean_squared_error(y_train, y_pred_tree)
In [48]:
pd.DataFrame({"values":[mse_ols, mse_en,mse_svm, mse_tree], 
               "models":["ols", "elastic-net", "svm", "tree"]}).sort_values("values")
Out[48]:
values models
3 0.006667 tree
0 2793.124358 ols
1 2802.447120 elastic-net
2 5765.841372 svm

Quizz: Discuss with your colleague.

  • Why do tree-based models do so much better in the training data compared to the test data?

    • Take five minutes to read a bit about what tree-based models are, and di

Hyper-Parameter Tunning and Cross Validation

As you saw above, more advanced machine learning models have hyper-parameters.

Hyperparameters are parameters that are not learned directly from the data but are set in advance before the training process begins, bur their best-values can be learned with training. Examples include the learning rate in gradient descent in deep-learning, the depth of a decision tree, or the number of neighbors in a k-nearest neighbors algorithm.

The way we choose the values for hyper-parameters is by looking at the performance of the models.

Cross-validation

If we were to find the best-values for hyperparameters looking always to the test-set, we would be pretty much training the model on the test set. Remember, we should always avoid this. The test set should be unseen data, and we should keep its integrity. This process is commonly described as data leakage

For this reason, when training machine learning models, we often rely on resampling methods, such as cross-validation. These methods allow us to fine tune our models by taking multiple samples of the training data, and training the model multiple times.

Read here a nice and shor introduction to cross-validation: https://neptune.ai/blog/cross-validation-in-machine-learning-how-to-do-it-right

In [49]:
# lets see the mechanics first
from sklearn.model_selection import train_test_split # Train-test split
from sklearn.model_selection import LeaveOneOut # Leave One Out Cross Validation
from sklearn.model_selection import KFold # K-fold Cross validation

# Intialize the K-Folds (splits)
kf = KFold(n_splits=5)

# Split the data
k_splits = kf.split(X_train)

# Let's look at the splits
n_models =0
for train, test in k_splits:
    print(f"N Obs. to train on = {train.shape[0]}, N obs to test on = {test.shape[0]}") #
    n_models += 1 # Count the number of models

print(f"Total numbers of models run = {n_models}")
N Obs. to train on = 300, N obs to test on = 75
N Obs. to train on = 300, N obs to test on = 75
N Obs. to train on = 300, N obs to test on = 75
N Obs. to train on = 300, N obs to test on = 75
N Obs. to train on = 300, N obs to test on = 75
Total numbers of models run = 5
In [50]:
# what the splits are? Index of the data
kf = KFold(n_splits=5)

# Split the data
k_splits = kf.split(X_train)

# containers to store the data
train_data = list()
val_data = list()

# iterate
for fold_number, (train_indices, val_indices) in enumerate(k_splits):
    print(f"Fold {fold_number + 1}:")
    
    # Split your data into training and validation sets using the indices
    train_data.append([X_train.iloc[i] for i in train_indices])
    val_data.append([X_train.iloc[i] for i in val_indices])
Fold 1:
Fold 2:
Fold 3:
Fold 4:
Fold 5:
In [51]:
# see data
pd.DataFrame(train_data[2])
Out[51]:
age sex bmi bp s1 s2 s3 s4 s5 s6
31 -0.023677 -0.044642 -0.065486 -0.081413 -0.038720 -0.053610 0.059685 -0.076395 -0.037129 -0.042499
243 0.016281 0.050680 -0.046085 0.011544 -0.033216 -0.016032 -0.010266 -0.002592 -0.043984 -0.042499
290 0.059871 0.050680 0.076786 0.025315 0.001183 0.016849 -0.054446 0.034309 0.029935 0.044485
236 0.027178 -0.044642 0.006728 0.035644 0.079612 0.070710 0.015505 0.034309 0.040673 0.011349
423 0.009016 0.050680 -0.039618 0.028758 0.038334 0.073529 -0.072854 0.108111 0.015568 -0.046641
... ... ... ... ... ... ... ... ... ... ...
353 -0.052738 -0.044642 -0.055785 -0.036656 0.089244 -0.003193 0.008142 0.034309 0.132376 0.003064
242 -0.103593 0.050680 -0.023451 -0.022885 -0.086878 -0.067701 -0.017629 -0.039493 -0.078140 -0.071494
57 -0.027310 -0.044642 -0.063330 -0.050427 -0.089630 -0.104340 0.052322 -0.076395 -0.056153 -0.067351
331 0.081666 0.050680 -0.025607 -0.036656 -0.070367 -0.046407 -0.039719 -0.002592 -0.041176 -0.005220
115 -0.030942 0.050680 0.001339 -0.005670 0.064477 0.049416 -0.047082 0.108111 0.083799 0.003064

300 rows × 10 columns

we could iterate over these with a loop....

Easy implementation with sklearn

Let's implement using sklearn. We actually don't need a loop!

In [52]:
# split the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# create my splits
# what the splits are? Index of the data
kf = KFold(n_splits=5)

# Split the data
k_splits = kf.split(X_train)
In [53]:
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV

# Define hyperparameters to tune
params = {
    'alpha': [0.1, 0.5, 1, 2, 5],
    'l1_ratio': [0, 0.25, 0.5, 0.75, 1]
}

# elastic net
elastic_net = ElasticNet()

# Grid search with 5-fold cross-validation
grid_en = GridSearchCV(elastic_net, params, cv=kf, scoring='r2')
grid_en.fit(X_train, y_train)

# Best parameters and corresponding MSE
print("Best Elastic Net parameters:", grid_en.best_params_)
print("Best MSE:", grid_en.best_score_)
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.279e+05, tolerance: 1.706e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.498e+05, tolerance: 1.758e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.511e+05, tolerance: 1.755e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 7.584e+05, tolerance: 1.559e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.670e+05, tolerance: 1.794e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.477e+05, tolerance: 1.706e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.729e+05, tolerance: 1.758e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.719e+05, tolerance: 1.755e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 7.749e+05, tolerance: 1.559e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.906e+05, tolerance: 1.794e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.503e+05, tolerance: 1.706e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.760e+05, tolerance: 1.758e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.746e+05, tolerance: 1.755e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 7.771e+05, tolerance: 1.559e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.937e+05, tolerance: 1.794e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.517e+05, tolerance: 1.706e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.775e+05, tolerance: 1.758e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.760e+05, tolerance: 1.755e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 7.782e+05, tolerance: 1.559e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.953e+05, tolerance: 1.794e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.525e+05, tolerance: 1.706e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.785e+05, tolerance: 1.758e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.769e+05, tolerance: 1.755e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
Best Elastic Net parameters: {'alpha': 0.1, 'l1_ratio': 1}
Best MSE: 0.45551049830889384
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 7.788e+05, tolerance: 1.559e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
/Users/tb186/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.963e+05, tolerance: 1.794e+02 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
In [54]:
from sklearn.tree import DecisionTreeRegressor

params = {
    'max_depth': [None, 2, 4, 6, 8, 10],
    'min_samples_split': [2, 5, 10]
}

tree = DecisionTreeRegressor()

grid_tree = GridSearchCV(tree, params, cv=kf, scoring='r2')
grid_tree.fit(X_train, y_train)

print("Best Decision Tree parameters:", grid_tree.best_params_)
print("Best MSE:", grid_tree.best_score_)
Best Decision Tree parameters: {'max_depth': 2, 'min_samples_split': 2}
Best MSE: 0.35160918571106714
In [55]:
from sklearn.neighbors import KNeighborsRegressor

params = {
    'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}

knn = KNeighborsRegressor()

grid_knn = GridSearchCV(knn, params, cv=kf, scoring='r2')
grid_knn.fit(X_train, y_train)

print("Best KNN parameters:", grid_knn.best_params_)
print("Best MSE:", grid_knn.best_score_)
Best KNN parameters: {'n_neighbors': 9}
Best MSE: 0.37833624073149236
In [56]:
# Select the best model based on cross-validation results
best_model = min([grid_en, grid_tree, grid_knn], key=lambda x: -x.best_score_)
In [57]:
# which model?
best_model.best_estimator_
Out[57]:
ElasticNet(alpha=0.1, l1_ratio=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [58]:
# what else
best_model.best_params_
Out[58]:
{'alpha': 0.1, 'l1_ratio': 1}
In [60]:
# error
best_model.best_score_
Out[60]:
0.45551049830889384
In [61]:
# Predict on the test set
y_pred = best_model.predict(X_test)

# Calculate the test set MSE
mse_test = r2_score(y_test, y_pred)
print(f"Test set MSE for the best model: {mse_test:.2f}")
Test set MSE for the best model: 0.47

All of this only touches the top of the iceberg on Machine Learning. If you want to go ahead and learn more before DS II, I would suggest you to:

In [62]:
!jupyter nbconvert _week-9-models.ipynb --to html --template classic
[NbConvertApp] Converting notebook _week-9-models.ipynb to html
[NbConvertApp] Writing 440502 bytes to _week-9-models.html