The first Villain: overfitting

Overfitting can not only happen with linear models. Overfitting has many faces. Overfitting is always ready to ruin your attempt to train predictive models.

Here we will look at some examples of overfitting, so that you easily spot it in the future.

Inreasing model complexity with poliyomials

Let’s import necessary packages, and load our dataset again. This time we will randomly sample 50 participants from our example dataset and plot age versus the volume of another region: the right middle temporal gyrus.

import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
df = pd.read_csv("https://raw.githubusercontent.com/pni-lab/predmod_lecture/master/ex_data/IXI/ixi.csv").sample(n=50, random_state=1)
df.plot.scatter(y='rh_superiorparietal_volume', x='Age')
<AxesSubplot:xlabel='Age', ylabel='rh_superiorparietal_volume'>
../_images/overfitting_ex_2_1.png

Now use the linear modelling framework to fit so-called nth order polynomial (i.e. the powers of the predictor up to the degree n) models.

poly_model_1 = smf.ols('rh_middletemporal_volume ~ Age', df).fit()
poly_model_2 = smf.ols('rh_middletemporal_volume ~ -1 + np.vander(Age, 3, increasing=True)', df).fit()
poly_model_3 = smf.ols('rh_middletemporal_volume ~ -1 + np.vander(Age, 5, increasing=True)', df).fit()
poly_model_10 = smf.ols('rh_middletemporal_volume ~ -1 + np.vander(Age, 10, increasing=True)', df).fit()
df.plot.scatter(x='Age', y='rh_middletemporal_volume')
sns.lineplot(x=df['Age'], y=poly_model_1.predict(df), color='red')
sns.lineplot(x=df['Age'], y=poly_model_2.predict(df), color='green')
sns.lineplot(x=df['Age'], y=poly_model_3.predict(df), color='blue')
sns.lineplot(x=df['Age'], y=poly_model_10.predict(df), color='purple')
<AxesSubplot:xlabel='Age', ylabel='rh_middletemporal_volume'>
../_images/overfitting_ex_5_1.png

See also

Note that this time we used another package for plotting: seaborn. With seaborn it is very simple to create nice visualizations of pandas dataframes. As many other pythin packages, seaborn, too, has an excellent documentation and a very active community.

Using polynomials of the orginal variable can be useful. Adding e.g. the squared of a pradictor makes linear models able to capture non-linear (squared) relationships, e.g. U-shaped associations. But the higher the number of polynomials, the more complex the model is and the more capacity it has to overfit your data. At a small sample size, even a 2nd order polynomial regression might already overfit. This is a very common type of model overfitting.

Ptolemaic model of the solar system

The Ptolemaic astronomic model postulates that Earth is in the center of the Universe. In this system, the apparent movement of the planets on the sky can only be explained by sophisticated nested circular orbits. Astronomers aiming to provide more accurate predictions of the planets apparent movement had to keep adding more and more nested circles to adjust the model to observations, until the system got so convoluted that folks started to doubt it.

And then, Copernicus came up with a more realistic model, the heliocentric model, where the center of the solar system is the sun. All of a sudden, there was no more need for complex models, ellipsoid orbits provided highly accurate predictions.

The take home message here is, that overfitting is not necessary a direct consequence of a too complex model. In fact, and ideal model would never overfit. You can only overfit a misspecified model.

But unfortunately, basically all of our models in science are misspecified.

Relation to Occam’s Razor

Occam’s Razor, a.k.a the principle of parsimony, is problem-solving heuristic that is central to science. It states that, when presented with competing models that provide the same prediction, one should select the solution with the fewest assumptions and lower complexity.

From the predictive modelling perspective, overfitting is strongly related to Occam’s Razor. Applying Occam’s Razor can help to avoid overwriting by preferring less complex models. In chapter 4, we will see how regularization and related techniques make Occam’s Razor cut out the unnecessary part of model complexity.

In the next Chapter we will see how we can obtain unbiased estimates of prediction error by testing.