What's a good way to make sure your model generalizes well to unseen data? Try cross-validation.

Intro to Cross-Validation: Holdout and k-Fold

Dylan | Aug 19, 2019

Post Thumbnail

Model validation is a critical step in the machine learning process. Before pushing our model into production, we must be sure it has accurately captured the underlying trends in our data. In other words, our model must not overfit or underfit our training data.

Once we understand the dangers of overfitting and underfitting, cross-validation becomes our indispensable go-to tool for tuning our models to the optimal fit. Although there are several cross-validation techniques, we'll only focus on two on this post: Holdout and K-Folds Cross-Validation.

The Holdout Method

This is the quickest and simplest form of cross-validation. Generally, it's the first the first form of cross-validation encountered in data science and machine learning; however, it's simplicity results in some significant drawbacks. We'll dive into those a bit later.

The Holdout Method works by splitting our entire dataset available for training our model into two groups. The first group will actually be used to train our model, while the second group will be used as our holdout data.

The training group will use both the independent variables (x) and the independent variables (y) to train our model; to have it learn the relationships between x and y.

By hiding a portion of the data while training our model, we're now able to present that hidden data's x values to our model as new data and ask it to predict it's corresponding y values. Once our model has generated predictions, we can easily compare it's predictions of the holdout data's x values to the true y values which allows us to simply assess our model's prediction performance. In other words, how well our model learned the relationship between x and y.

Unfortunately, by hidding some instances from our training dataset, we risk removing crucial data crucial that our model may need to properly learn the underlying relationship between x and y. Additionally, evaluating our model's true performance through this method is unreliable because we can receive completely different performance results based on how the data happened to be split into training and testing groups.

So, how can we validate our models without hiding potentially crucial data in a way that produces consistent and (mostly) replicable performance results? Let’s see what k-Fold Cross-Validation has to offer.

k-Fold Method

The k-Fold method requires the hyperparameter k which determines how many groups our dataset will be split into. If we assign k=5, we could then refer to this method as 5-Fold Cross-Validation.

The intuition behind the k-Fold method is easy to understand after grasping the Holdout method. Essentially, we take our entire dataset and split it into k number of groups. So, 5-fold would split the dataset into 5 groups each containing 20% of all our data. Next, we use the holdout method and use the first group in our split as the holdout (test) set and all the other remaining groups as our training set. After training and recording our model's performance, we can discard our first model and train a new one. This time using the second group in our split as our holdout (test) set and all the other groups for training. After evaluating our model, we'll save the scores and move on to using the third, fourth, fifth, all the way up to the k group as test data.

Once we've trained and evaluated k number of machine learning models, our last step is to calculate the means of all our recorded evaluation metrics from each fold. By performing the holdout method several times and using their mean scores, we have a much more accurate overall reflection of our model's true performance.

Cross-Validation to Compare Different Models and Parameters

When trying to determine which machine learning model to apply to our specific problem, cross-validation provides a consistent performance measurement we can use to compare the effectiveness of different models.

As a simple example, when trying to determine between using linear or polynomial regression, we can evaluate both a linear model and a polynomial model using the k-Fold method. After comparing the average performance ratings of each model, we can confidently choose one model over the other.

Furthermore, for some machine learning models, there are unclear hyperparameters. For example, in a decision tree model, how many branches should we choose to avoid overfitting our training data? We could use the k-Fold method to evaluate the performance of models with differing maximum numbers of branches.

Thanks for reading and I hope you learned a thing or two! I'll be happy to answer any further questions in the comments below. As always, the best way to truly grasp difficult topics is by rolling up your sleeves and trying to apply it in the real world! Good luck with your projects and check back next week for another post!