Regularized Models in Action

Scikit-learn provides a wide variety of machine learning models that provide different mechanisms to control model complexity and implement different strategies to find the optimal model coefficients.

In this section we will apply scikit-learn to reduce model complexity by:

  • manually dropping variables

  • PCA and PLS regression

  • Ridge and LASSO

In this section, we will not optimize the hyperparameters, only select a single hyperparameter value. In the next section we will demonstrate how selecting the hyperparameters can result in double-dipping, a.k.a. feature leakage. In the next chapter, we will see how feature leakage during hyperparameter-tuning can be prevented by so-called nested cross-validation.

Manual feature selection

Here we do something we have already done before: we reduce model complexity by decreasing the number of predictors.

Below, we simply select the 5 best features (from of all 68) in an independent dataset and then use only these five features of the (previously used) training set to construct cross-validated estimates of predictive performance..

#imports
!pip install scikit-learn > /dev/null 2>&1
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression
#load data
df_full = pd.read_csv("https://raw.githubusercontent.com/pni-lab/predmod_lecture/master/ex_data/IXI/ixi.csv")

# select a subset for cross-validation
df_cv = df_full.loc[:100, :]
# select an independent subset for searching for the k best features
df_ext = df_full.loc[100:200, :]

# names of the columns to be used
target = 'Age'
features = df_cv.columns[2:]

# select the 5 best features based on an independent sample
k_best_selector = SelectKBest(k=5).fit(y=df_ext[target], X=df_ext[features])
# then we choose these 5 features from the training set (without looking at the target in the training set)
k_best_features = k_best_selector.transform(df_cv[features])
# cross validation
cv = KFold(5) # 5 folds, 20 participants in each fold

cv_predictions = cross_val_predict(estimator=LinearRegression(), y=df_cv[target], X=k_best_features, cv=cv)
sns.regplot(x=df_cv[target], y=cv_predictions)
print('MAE = ', mean_absolute_error(y_true=df_cv[target], y_pred=cv_predictions), 'years')
/home/tspisak/src/RPN-signature/venv/lib/python3.8/site-packages/sklearn/feature_selection/_univariate_selection.py:116: RuntimeWarning: divide by zero encountered in true_divide
  f = msb / msw
MAE =  8.854781824723458 years
../_images/practice_regularization_3_2.png

Apparently, this results in a large improvement of cross-validated predictive performance, as compared to the example in the previous chapter, comprising all features.

Note

Take a look at how scikit-learn’s SelectKBest feature selector is used. By calling the fit function, it selects the best features from the external independnet dataset and the calling the fitted selector’s transform function, it applies what it has “learned” from the extarnal dataset, i.e. chooses the “best features”. The fit, tarnsform, and fit_transform (doing both operations on the same data) functions are central in scikit-learn, most of its functionality can be accessed trough these functions. Separating the learning process (fit) and the application of the learned knowledge (transform) always allows a clean separation of training and test datasets. We will see numerous examples to the usage of this scheme in the followings.

Despite the improvement, intuitively, you might feel that we have thrown out some information of the window, as we might have included variables that - even tough weaker - but still significantly predict the target.

This issue can be tackled by converting the original features into new ‘latent’ variables, which contain the same information, but in a potentially more useful form. That’s what PCA and PLS do!

Dimensionality Reduction: PCA and PLS

As discussed in the theory section, with PCA we can transform our features into ‘latent’ variables that are possibly more useful than the original ones. With PCA, the “essence” of a group of highly correlated features is “condensed down” into a single variable. Thus, taking the principal components as predictive features often allows decreasing the complexity of the model to a higher degree, as fewer PCA variables will be able to provide the same explanatory power as a larger number of original features.

Let’s see it in practice! Below, we take the first five principal components calculated from all features.

from sklearn.decomposition import PCA

features = df_cv.columns[2:]

pca = PCA(n_components=5)
pca_features = pca.fit_transform(df_cv[features])

# same as in the previous example except:                                          VVV here VVV
cv_predictions = cross_val_predict(estimator=LinearRegression(), y=df_cv[target], X=pca_features, cv=KFold(5))
sns.regplot(x=df_cv[target], y=cv_predictions)
print('MAE = ', mean_absolute_error(y_true=df_cv[target], y_pred=cv_predictions), 'years')
MAE =  8.658108451519261 years
../_images/practice_regularization_6_1.png

Again a small improvement. And this time we didn’t have to use the independent external dataset, as - unlike SelectKBest - PCA does not incorporate any information about the target variable, only looks at the features.

Let’s look at PLS!

from sklearn.cross_decomposition import PLSRegression

# same as in the previous example except:            VVV here VVV
cv_predictions = cross_val_predict(estimator=PLSRegression(n_components=5), y=df_cv[target], X=df_cv[features], cv=KFold(5))
sns.regplot(x=df_cv[target], y=cv_predictions)
print('MAE = ', mean_absolute_error(y_true=df_cv[target], y_pred=cv_predictions), 'years')
MAE =  8.190707243962333 years
../_images/practice_regularization_8_1.png

Performance slightly improved again. Note that in this case the dimensionality reduction and fitting the regression model happens in on e step, in PLSRegression. Indeed, the power of PLS lies in tailoring the dimensionality reduction to the characteristics of the target variable.

Looks like with PCA and PLS we have a performance gain, as compared to using the original features. But do we also loose something? Yes, interpretability. Checking out the next exercise may shed some light on this. More details will be given in chapter 7 “Model Explanation”

Regularized models: Ridge and LASSO

Exercise 4.4 illustrates that transforming our original features to latent variables may have some disadvantages in terms of interpretability. Let’s look at how regularized regression models can reduce model complexity (and, thereby, fight overfitting), but at the same time, retain all the original variables.

from sklearn.linear_model import Ridge

# we taka all features
features = df_cv.columns[2:]

# Changed:                                       VVV here VVV
cv_predictions = cross_val_predict(estimator=Ridge(alpha=10000000), y=df_cv[target], X=df_cv[features], cv=KFold(5))
sns.regplot(x=df_cv[target], y=cv_predictions)
print('MAE = ', mean_absolute_error(y_true=df_cv[target], y_pred=cv_predictions), 'years')
MAE =  8.00562203685471 years
../_images/practice_regularization_11_1.png
from sklearn.linear_model import Lasso

features = df_cv.columns[2:]

# Changed:                                     VVV here VVV
cv_predictions = cross_val_predict(estimator=Lasso(alpha=1000), y=df_cv[target], X=df_cv[features], cv=KFold(5))
sns.regplot(x=df_cv[target], y=cv_predictions)
print('MAE = ', mean_absolute_error(y_true=df_cv[target], y_pred=cv_predictions), 'years')
MAE =  8.028189984232535 years
../_images/practice_regularization_12_1.png

Both Ride and LASSO seem to be able to provide a predictive performance that is comparable (even slightly better), than using the dimension reduction-based approaches.

Curious, which regions are driving the prediction? Then we will now have a sneak-preview into model interpretation, which will be discussed in detail in chapter 7

Remember that Ridge and LASSO are just linear models with some extra steps. So we can simply obtain the model coefficients (\(\boldsymbol{b}\)). As the model’s prediction is just a weighted average of the feature values, where the weights are the model’s coefficients, the coefficients themselves can be considered as a kind of ‘feature importance’: the degree to which that feature influences the predictions.

But there’s a problem: in case of cross-validation we have actually multiple model.

Therefore, we must finalize the model first. Here we finalize our Ridge model by training it on the whole sample (without cv):

model = Ridge(alpha=1000)
model.fit(y=df_cv[target], X=df_cv[features])
Ridge(alpha=1000)

And then, we can simply print out the coefficients, together with the names of the features.

pd.DataFrame([model.coef_], columns=features).T
0
lh_bankssts_volume 0.000092
lh_caudalanteriorcingulate_volume -0.001514
lh_caudalmiddlefrontal_volume -0.001669
lh_cuneus_volume -0.003873
lh_entorhinal_volume -0.000299
... ...
rh_supramarginal_volume 0.002263
rh_frontalpole_volume -0.007186
rh_temporalpole_volume 0.006122
rh_transversetemporal_volume -0.006611
rh_insula_volume 0.003572

68 rows × 1 columns

With a few lines of code, we can visualize feature importance in 3D, to show the most important predictors of brain age:

!pip install ggseg > /dev/null 2>&1
import ggseg
features_plot = features.values.copy()
# convert labels:
for i, f in enumerate(features):
    parts = f.split('_')
    if parts[0] == 'rh':
        parts[0] = 'right'
    else:
        parts[0] = 'left'
    features_plot[i] = parts[1] + '_' + parts[0]

ggseg.plot_dk(dict(zip(features_plot, model.coef_)), cmap='bwr', figsize=(10,10),
              background='w', edgecolor='gray', bordercolor='gray',
              ylabel='Predictive Importance', title='Morphological predictors of age')
../_images/practice_regularization_19_0.png

To recap, being able to control the complexity can drastically reduce the complexity of the model and thereby significantly improve predictive performance on unseen data. The complexity of machine learning models can usually be set via one or more hyperparameters.

But how should we find the optimal values for the hyperparameters? Should we simply do it in a trial and error fashion? Or should we systematically search for the value with which the cross-validated prediction look best?

Definitely not! To understand the reason why, we must get to know the second villain of machine learning: leakage!