Improve machine learning models with the right evaluation metric.

Evaluation Metrics: Accuracy, Precision, Recall, and F1 Score

Dylan | Jul 29, 2019

Post Thumbnail

Evaluating the performance of a machine learning model on holdout data is a fundamental step in the data science process. Fortunately, there are several metrics available to determine the performance of a model. Depending on the task at hand, some metrics will prove more important than others. In this post, we'll explore the most popular metrics: accuracy, precision, recall, and F1 scores.

Before diving into the nitty-gritty details, it will be important to understand the basics of confusion matrices to understand the following examples in this post. If you're unfamiliar or could use a brush-up, read my brief introduction to confusion matrices, otherwise we're ready to dive in!


Often, "accuracy" is the first measurement that we intuitively seek to explore. Although important, as we'll see, accuracy often isn't the best reflection of performance for every data science problem. Accuracy is calculated using the following formula
Accuracy = (True Positive + True Negative) / (True Positive + True Negative + False Positive + False Negative)
That is, the sum of correctly classified instances divided by the total instances in the holdout data.

Imagine the following confusion matrix contains the results of a model's predictions of whether an email is spam or not
Spam Email Predictions Confusion Matrix
the accuracy of this model would be calculated
96% accuracy, awesome! Well, not too fast… Although our model correctly labeled all 10 of the spam emails we threw at it, our model mislabeled 4 non-spam emails as spam. Although our accuracy was 96%, our model threatens to hide 4 potentially important emails from the user out of every 90 received.

Since we don't want to hide potentially important emails from our users, let's see if another evaluation metric would help us understand how our model is performing in this regard. Perhaps precision could be a more suitable metric for this problem.


Precision is calculated by dividing all of the True Positive predictions by the total number of instances our model labeled Positive.
Precision = True Positive / (True Positive + False Positive)
Returning to our spam email classifier, its precision would be
precision = 0.71
Ouch, 71% isn't nearly as nice as 96%.

In cases where False Positives (Type I Errors) are costly, such as hiding important emails from users, precision becomes an important metric to consider!

What if we have a scenario where False Negatives (Type II Errors) are very costly? In this case, Recall to the rescue!


Recall is calculated by dividing all of the True Positive predictions by the total number of actual Positive instances in the holdout dataset.
Recall = True Positive / (True Positive + False Negative)
Imagine a scenario with 1000 flight passengers, 2 of which are terrorists. We feed this data to our model whose job is to correctly identify terrorists. After labeling the passengers, our model returns the following confusion matrix
Terrorists Prediction Confusion Matrix
By the accuracy and precision metrics, our model scores 99.9% and 100% respectively; however out of 2 terrorists, failing to identify even one of them can end in tragedy. Our model's recall is
Recall = 0.5
Only 50%? Because the potential consequences of mislabeling a terrorist are very severe, we'll want to tune and adjust our model to make sure we improve recall.

Finally, perhaps the most popular metric, what exactly does the F1 score reveal about our model's performance?

F1 Score

F1 scores seek to strike a balance between precision and recall. Recall that recall is concerned with False Negatives while precision is concerned with False Positives. Naturally, because the F1 score attempts to strike a balance, its concern is with all False predictions. It is derived from the formula
F1 = 2*((precision*recall)/(precision+recall))

Why is F1 generally superior to plain accuracy?

As we saw in our example with terrorists, when one possible class is much larger (998 non-terrorists) than the other (2 terrorists), this mismatch significantly impacts the model's accuracy. In this particular example, the model was 99.9% accurate.

Because F1 never considers True Positives, it isn't affected by a mismatch in class sizes. Using this example problem, the model's F1 score would be
F1 Score = 0.66
An F1 score of only 66.6% better reflects the model's need for improvement compared to what the 99.9% accuracy rating leads us to believe.


All four metrics are important for accurately evaluating a model and using them together permits us to quickly understand the strengths and weaknesses of a given model.

As always, if you have any remaining questions or additional contributions on the topic, please share it with us in the comments below! I hope this post was informative and I look forward to further interaction in the comments! Thanks for reading and happy coding!