What is overfitting? How to detect and avoid overfitting?

Easy Last updated on May 7, 2022, 1:19 a.m.

Overfitting happens when the model focuses so much on the training data that it starts learning noise and biasness. While training a machine learning model, if we have less training data or the training data is biased towards one class/type, the trained model could learn unnecessary features and fail to generalize in terms of real-world data.

For example, if we train our model to classify a cat/ no cat in an image, and our training data only consists of the ginger-colored cats displayed in shows. So, the model will learn the ginger color to be a feature of those cats but will fail to work on the real-world cases of detecting a cat in wild.

Markdown Monster icon
Fig 1: A Simplistic Regression based example showcasing scenarios of Underfitting, Good Fit, and Overfitting.

How to detect Overfitting?

The best method to recognize overfitting is by looking at the performance of the validation set; if the observed training error is significantly low compared to the validation error, it is a strong indicator of model overfitting. Generally, while training, we observe a similar training loss and validation loss trend. But suppose after a certain point, the validation metrics (loss, accuracy, R-square, etc.) remain the same, whereas the training metrics improve. In that case, the model is now seeking the best fit on training data and will fail to generalize on real-world data.

Accuracy vs Epoch Graph
Fig 2: Sample accuracy curve by epoch to detect overfitting

How to avoid Overfitting?

There are multiple ways to avoid overfitting, some of the most common methods are listed as follows:

  1. Regularization: It is a method to penalize the weight vectors so they won’t overfit. In general, L1 and L2 norms are added to the objective/loss function to regularize weights. In the case of advanced models like neural networks, drop-out and batch normalization can also be used.

  2. Cross-Validation: Cross-validation is a method to resample training data and use permutation parts of data obtained to train, test, and validate a model. It helps the model to generalize, consequently reducing overfitting.

  3. Ensembles can also be used to reduce overfitting by subsampling the data and training multiple models. Using ensembles, we can reduce both bias and variance of the model. To learn more about ensembles, check out blog post on What is Ensemble Learning? How many types of ensemble methods are there?