Overfitting and Underfitting Models
Dylan | Aug 08, 2019
When designing and evaluating a model, we must make sure our model accurately identifies existing trends and generalizations in our data. Without capturing the essence of our data, there’s no way our model can make useful predictions in the future.
In very broad terms, there are two primary reasons our models fail to identify critical underlying trends in our data. The first reason results from forcing an overly complex model on our data while the second reason occurs when we choose a model that is simply not complex enough to accurately capture trends.
When considering how well a model captures trends in our data, two very important terms to understand are overfitting and underfitting.
Overfitting occurs when our model learns to predict the target values of our training dataset with perfect or extremely high accuracy. While this may seem desirable on the surface, overfitting will lead to problems later in deployment because the model hasn’t adequately learned the general relationships between the different features.
For example, let’s consider the following graph of a simple single-feature regression model below. The red points represent our training data and the red line maps our model’s predictions of y (target value) given new x (feature) input data.
If we feed our x training data back into our model, we’ll see that it accurately predicts every single target value. Our model has essentially "memorized" the training data.
However, if we introduce new data to our model (represented below by green points) our model’s prediction performance is underwhelming.
Using the mean absolute error (a risk metric corresponding to the expected value of the absolute error loss or l1-norm loss) to gauge the expected error of our model, we see that it has an expected error of 2.1259. On average, our model’s predictions on the newly introduced data was incorrect by 2.1259.
Now let’s try reducing the complexity of our original model to better capture the general underlying trend of the training data instead of just memorizing it.
Again, the red points represent the original training data, the blue line maps our model’s prediction path, and the green points represent the newly introduced data. It’s immediately obvious that this model isn’t as perfectly "fit" to the points in our training data as our first model. In fact, it’s often quite far off the actual training values. Let’s see how it performs predicting the target values of new data.
Calculating the mean absolute error for this model, we discover an expected error of 0.7233! This second model managed to improve its prediction accuracy by three-times when compared to the first model!
As a general rule of thumb, the more complex a model, the higher risk that the model will overfit the training data. Now let’s explore what happens when models we try to apply models that are not complex enough.
Imagine a scenario where we have the following data represented by red points.
Now imagine applying a simple linear regression model to make predictions on this dataset. The blue line represents the regression model’s prediction path.
Unfortunately, this model leaves a lot to be desired. Simple Linear Regression isn’t dynamic enough to capture the essence of this particular dataset, perhaps we would be better off using a polynomial regression model to properly capture the curve in the data.
Avoiding Over & Underfitting
Hopefully, the importance of finding a "fitting sweet spot" between overfitting and underfitting has been clearly illustrated. One of the most popular techniques for gauging a model’s "fitness" is by performing cross-validation on random batches of holdout data.
Holdout data is simply data that was removed from the training dataset and kept hidden during our model’s training phase. After completing the training, we present the holdout data to the model and ask for predictions. By performing this crucial step, we’re able to evaluate our model’s performance by comparing its predicted values with the true values of our holdout data.
There are many techniques for evaluating a model’s performance but they break down into two groups. The first group are techniques that can be applied to regression models and the second group are techniques for evaluating classification models.
Read more about the most common evaluation metrics for regression models, such as R-Squared, Adjusted R-Squared, and Root Mean Squared Error.
Read more about the most common evaluation metrics for classification models, such as Accuracy, F1 Score, Precision, and Recall.
When looking for the fitness sweetspot of our model, it’s critical to properly understand the concept of bias-variance tradeoff. If you’re interested in learning more about overfitting and underfitting in greater depth, my post on bias-variance tradeoff is a fantastic place to start!
Many thanks for reading and as always make sure to comment below with any questions.. I’m looking forward to further discussions about overfitting, underfitting, and how to prevent them in the comments below! Until next time, happy coding!