The first Villain: overfitting

The first Villain: overfitting¶

Overfitting can not only happen with linear models. Overfitting has many faces. Overfitting is always ready to ruin your attempt to train predictive models.

Here we will look at some examples of overfitting, so that you easily spot it in the future.

Inreasing model complexity with poliyomials¶

Let’s import necessary packages, and load our dataset again. This time we will randomly sample 50 participants from our example dataset and plot age versus the volume of another region: the right middle temporal gyrus.

import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf

df = pd.read_csv("https://raw.githubusercontent.com/pni-lab/predmod_lecture/master/ex_data/IXI/ixi.csv").sample(n=50, random_state=1)
df.plot.scatter(y='rh_superiorparietal_volume', x='Age')

<AxesSubplot:xlabel='Age', ylabel='rh_superiorparietal_volume'>

Now use the linear modelling framework to fit so-called nth order polynomial (i.e. the powers of the predictor up to the degree n) models.

poly_model_1 = smf.ols('rh_middletemporal_volume ~ Age', df).fit()
poly_model_2 = smf.ols('rh_middletemporal_volume ~ -1 + np.vander(Age, 3, increasing=True)', df).fit()
poly_model_3 = smf.ols('rh_middletemporal_volume ~ -1 + np.vander(Age, 5, increasing=True)', df).fit()
poly_model_10 = smf.ols('rh_middletemporal_volume ~ -1 + np.vander(Age, 10, increasing=True)', df).fit()

df.plot.scatter(x='Age', y='rh_middletemporal_volume')
sns.lineplot(x=df['Age'], y=poly_model_1.predict(df), color='red')
sns.lineplot(x=df['Age'], y=poly_model_2.predict(df), color='green')
sns.lineplot(x=df['Age'], y=poly_model_3.predict(df), color='blue')
sns.lineplot(x=df['Age'], y=poly_model_10.predict(df), color='purple')

<AxesSubplot:xlabel='Age', ylabel='rh_middletemporal_volume'>

Google Flu Trends¶

Google searches become part of our life (and Google will be most probably your friend during solving the exercises in this book). Google of course collects all these data and, in 2008, tried to exploit this rich database in order to predict the severity of the upcoming influenza season. They published the idea and called it Google Flu Trends (GFT). But then GFT failed - and failed spectacularly [7] - missing at the peak of the 2013 U.S. flu season by 140 percent.

It turned out that GFT was quite prone to overfit to seasonal terms unrelated to the flu, like “high school basketball”.

Of course, related approaches are now revised and perform significantly better. But the big fail of GFT remains a very illustrative example of how overfitting can be detrimental on predictive models.

Ptolemaic model of the solar system¶

The Ptolemaic astronomic model postulates that Earth is in the center of the Universe. In this system, the apparent movement of the planets on the sky can only be explained by sophisticated nested circular orbits. Astronomers aiming to provide more accurate predictions of the planets apparent movement had to keep adding more and more nested circles to adjust the model to observations, until the system got so convoluted that folks started to doubt it.

And then, Copernicus came up with a more realistic model, the heliocentric model, where the center of the solar system is the sun. All of a sudden, there was no more need for complex models, ellipsoid orbits provided highly accurate predictions.

The take home message here is, that overfitting is not necessary a direct consequence of a too complex model. In fact, and ideal model would never overfit. You can only overfit a misspecified model.

But unfortunately, basically all of our models in science are misspecified.

Relation to Occam’s Razor¶

Occam’s Razor, a.k.a the principle of parsimony, is problem-solving heuristic that is central to science. It states that, when presented with competing models that provide the same prediction, one should select the solution with the fewest assumptions and lower complexity.

From the predictive modelling perspective, overfitting is strongly related to Occam’s Razor. Applying Occam’s Razor can help to avoid overwriting by preferring less complex models. In chapter 4, we will see how regularization and related techniques make Occam’s Razor cut out the unnecessary part of model complexity.

In the next Chapter we will see how we can obtain unbiased estimates of prediction error by testing.

Linear Model in action

3. Unbiased predictive performance estimates

A Gentle Introduction to Predictive Modelling