[5 min read] Metrics to measure the performance of your classification ML models
‘If you cannot measure it, you cannot manage it.’
Whilst there are many metrics available to evaluate a Classification ML model, in this post, I am going to focus on the ones I have seen being used most frequently.
Confusion Matrix
Confusion matrix, also referred to as error matrix is usually used to describe the performance of a classification model against a test dataset for which the values are known.
It can be best visualized as the following:
For further clarity, let’s quickly elaborate on a few terms:
True Positive (TP): When the actual class of the data point was True and the predicted value is also True.
True Negative (TN): When the actual class of the data point was False and the predicted value is also False.
False Positive (FP): When the actual class of the data point was False, but the predicted value is True. So the model falsely thinks it’s positive.
False Negative (FN): When the actual class of the data point was True, but the predicted value is False. So the model falsely thinks it’s negative.
The confusion matrix defines the Accuracy of the model as below:
So if we had a model with the following error matrix values :
Then the accuracy of the model would be 121/216 or 0.56. The confusion matrix forms the basis for a number of other metrics.
F1 Score
Before we get to the formula for the F1 Score, let's start off by defining a few terms:
a) Precision: It’s a ratio of all the positives that the model got right versus all the number of total positives predicted by the model.
b) Recall or Sensitivity: It’s a ratio of all the positives that the model got right versus the total number of actual positives in the dataset.
c) Specificity or True Negative Rate: It’s actually the inverse of Recall and is a ratio of all the negatives that the model got right versus the total number of actual negatives in the dataset.
Depending on your business need, the focus can be adjusted in terms of the metrics. For instance, if you are running a marketing campaign you are probably more focused on ensuring that you are reaching out to all the possible candidates, as opposed to how precise the predictions are. So, the model would be tuned to have a higher recall value. Alternatively, if you are focused on a use case where you use the model to identify if a patient has cancer or not, you want to be very sure about it and so the model is tuned for precision.
Now that we have a sense of what each of the terms means, let's define the overall F1 Score metric :
Simply put it's a harmonic mean of Precision and Recall. Why harmonic mean, you ask? Well since the F1 Score is a compound metric, it's defined as a harmonic mean so that even if one of the underlying values of Precision or Recall is small, it gets flagged ( as the F1 score will be closer to the smaller value as opposed to the larger one). That would not be the case with an arithmetic mean.
Area Under the Curve
Here we need to define another term :
False Positive Rate corresponds to a ratio of negative data points that the model has incorrectly classified as positive, with respect to all the negative data points in the dataset. It forms one of the axes in the AUC graph.
AUC (Area Under Curve) of a model is a plot of the True Positive Rate vs the False Positive Rate and signified the probability that the classification model will rank a randomly chosen positive data point higher than a randomly chosen negative data point.
The AUC (Area Under the Curve) has a range between 0 and 1, and of course the higher the value the better is the performance of the model.
Other noteworthy metrics
Other metrics worth mentioning are:
a) Lograthmic loss: It is a measurement of accuracy that incorporates the idea of probabilistic confidence given by the following expression for binary class:
b) Mean absolute error: It is the average of the difference between the actual value and the predicted value expressed as
c) Mean squared error: It is the average of the square of the difference between the actual value and the predicted value. It offers better visibility on the gradient compared to the Mean absolute error metric. Its represented as :
Whilst the above list is in no way an exhaustive one, it hopefully has given you a sense of some of the metrics likely to be used for Classification models.
For the sake of completeness, I am also listing down some of the metrics used for the other types of ML models :
a) Regression models: MSPE, MSAE, R Square, Adjusted R Square, etc.
b) Unsupervised models: Rand Index, Mutual Information, etc.
My intention is to keep this post down to an under 5 min read, so will close it now. But do please write in with your comments or queries. Hope you found the post useful. May your predicted positives always come true :).
APAC Lead - AI, Data and Digital
4 年Consice and clear!