登录查看更多内容

[5 min read] Metrics to measure the performance of your classification ML models

Somnath Biswas

发布日期: 2021年3月14日

‘If you cannot measure it, you cannot manage it.’

Whilst there are many metrics available to evaluate a Classification ML model, in this post, I am going to focus on the ones I have seen being used most frequently.

Confusion Matrix

Confusion matrix, also referred to as error matrix is usually used to describe the performance of a classification model against a test dataset for which the values are known.

It can be best visualized as the following:

For further clarity, let’s quickly elaborate on a few terms:

True Positive (TP): When the actual class of the data point was True and the predicted value is also True.

True Negative (TN): When the actual class of the data point was False and the predicted value is also False.

False Positive (FP): When the actual class of the data point was False, but the predicted value is True. So the model falsely thinks it’s positive.

False Negative (FN): When the actual class of the data point was True, but the predicted value is False. So the model falsely thinks it’s negative.

The confusion matrix defines the Accuracy of the model as below:

So if we had a model with the following error matrix values :

Then the accuracy of the model would be 121/216 or 0.56. The confusion matrix forms the basis for a number of other metrics.

F1 Score

Before we get to the formula for the F1 Score, let's start off by defining a few terms:

a) Precision: It’s a ratio of all the positives that the model got right versus all the number of total positives predicted by the model.

b) Recall or Sensitivity: It’s a ratio of all the positives that the model got right versus the total number of actual positives in the dataset.

c) Specificity or True Negative Rate: It’s actually the inverse of Recall and is a ratio of all the negatives that the model got right versus the total number of actual negatives in the dataset.

Depending on your business need, the focus can be adjusted in terms of the metrics. For instance, if you are running a marketing campaign you are probably more focused on ensuring that you are reaching out to all the possible candidates, as opposed to how precise the predictions are. So, the model would be tuned to have a higher recall value. Alternatively, if you are focused on a use case where you use the model to identify if a patient has cancer or not, you want to be very sure about it and so the model is tuned for precision.

Now that we have a sense of what each of the terms means, let's define the overall F1 Score metric :

Simply put it's a harmonic mean of Precision and Recall. Why harmonic mean, you ask? Well since the F1 Score is a compound metric, it's defined as a harmonic mean so that even if one of the underlying values of Precision or Recall is small, it gets flagged ( as the F1 score will be closer to the smaller value as opposed to the larger one). That would not be the case with an arithmetic mean.

Area Under the Curve

Here we need to define another term :

False Positive Rate corresponds to a ratio of negative data points that the model has incorrectly classified as positive, with respect to all the negative data points in the dataset. It forms one of the axes in the AUC graph.

AUC (Area Under Curve) of a model is a plot of the True Positive Rate vs the False Positive Rate and signified the probability that the classification model will rank a randomly chosen positive data point higher than a randomly chosen negative data point.

The AUC (Area Under the Curve) has a range between 0 and 1, and of course the higher the value the better is the performance of the model.

Other noteworthy metrics

Other metrics worth mentioning are:

a) Lograthmic loss: It is a measurement of accuracy that incorporates the idea of probabilistic confidence given by the following expression for binary class:

b) Mean absolute error: It is the average of the difference between the actual value and the predicted value expressed as

c) Mean squared error: It is the average of the square of the difference between the actual value and the predicted value. It offers better visibility on the gradient compared to the Mean absolute error metric. Its represented as :

Whilst the above list is in no way an exhaustive one, it hopefully has given you a sense of some of the metrics likely to be used for Classification models.

For the sake of completeness, I am also listing down some of the metrics used for the other types of ML models :

a) Regression models: MSPE, MSAE, R Square, Adjusted R Square, etc.

b) Unsupervised models: Rand Index, Mutual Information, etc.

My intention is to keep this post down to an under 5 min read, so will close it now. But do please write in with your comments or queries. Hope you found the post useful. May your predicted positives always come true :).

Sushil Asar

APAC Lead - AI, Data and Digital

4 年

Consice and clear!

要查看或添加评论，请登录

Somnath Biswas的更多文章

Product Spotlight: Conversational Shopping -Amazon Rufus

2025年3月8日

Product Spotlight: Conversational Shopping -Amazon Rufus

Reading time: 4 minutes Named after the very first dog allowed inside the Amazon premises, Rufus is a conversational AI…

2 条评论
EU AI Act?—?2nd Feb Was the Deadline for Prohibited Systems. What’s Next?

2025年2月2日

EU AI Act?—?2nd Feb Was the Deadline for Prohibited Systems. What’s Next?

Reading time: ~5 mins ; Audience: Product professionals building AI systems Source: Solicitors Journal February 2…

1 条评论
Building AI Saas: 7 key product decisions

2024年7月15日

Building AI Saas: 7 key product decisions

Sharing my notes from a recent session on the key decision points when building a LLM SaaS (Software-as-a-service)…

5 条评论
Exploration of GPT-3, ChatGPT and the Large Language Models landscape

2023年1月28日

Exploration of GPT-3, ChatGPT and the Large Language Models landscape

Before Nov 30th 2022, ‘chatbot’ used to be a bad word — not anymore! Open AI’s ChatGPT beta has been the best thing…

5 条评论
[1 min read] Finding similarity between objects #ml #first principles

2021年3月21日

[1 min read] Finding similarity between objects #ml #first principles

Since I have promised an under 1 min reading time, let me deliver the punchline upfront — similarity between two…
Building Chatbots for Enterprises: 5 things to keep in mind.

2018年5月12日

Building Chatbots for Enterprises: 5 things to keep in mind.

According to a survey conducted by Drift earlier this year, 15% of consumers have communicated with business via a…

13 条评论
Regulating Artificial Intelligence - Definitely, Maybe

2018年4月4日

Regulating Artificial Intelligence - Definitely, Maybe

Statista has it at $60 Billion by 2025, the McKinsey Global Institute puts its between $644 Million and $126 Billion by…

9 条评论
Intelligent Agents : Machine Reading Comprehension

2018年2月28日

Intelligent Agents : Machine Reading Comprehension

On 5th of Jan this year, an AI model for the first time outperformed humans in reading comprehension. The SLQA+…

6 条评论
Lets talk about Natural Language Processing

2018年2月19日

Lets talk about Natural Language Processing

"Handle them carefully, for words have more power than atom bombs." - Pearl Strachan Hurd I am fairly sure, that when…

6 条评论
AI vs Machine Learning vs Deep Learning

2017年5月5日

AI vs Machine Learning vs Deep Learning

If one was to go through the technology roadmap of any organization worth its salt, amongst other things, the presence…

14 条评论

See all articles

[5 min read] Metrics to measure the performance of your classification ML models

Somnath Biswas

Confusion Matrix

F1 Score

Area Under the Curve

Other noteworthy metrics

Somnath Biswas的更多文章

社区洞察

其他会员也浏览了

SHAP: Bridging the Gap Between Machine Predictions and Actionable Recommendations

Hypothesis Testing & Commonly used Statical Tests

The F1 Score: A Comprehensive Measure of Classification Performance

AI/GenAI to analyze contact center volume drivers

Types of DAX Functions in Power BI

Comparison of Dimensionality Reduction Methods

Understanding Dimensionality Reduction and the Curse of Dimensionality

Day 16 - LightGBM (Light Gradient Boosting Machine)

AI_Part_4_What is K-fold Cross Validation?

Model Fine-Tuning

Confusion Matrix

F1 Score

Area Under the Curve

Other noteworthy metrics

Somnath Biswas的更多文章

Product Spotlight: Conversational Shopping -Amazon Rufus

EU AI Act?—?2nd Feb Was the Deadline for Prohibited Systems. What’s Next?

Building AI Saas: 7 key product decisions

Exploration of GPT-3, ChatGPT and the Large Language Models landscape

[1 min read] Finding similarity between objects #ml #first principles

Building Chatbots for Enterprises: 5 things to keep in mind.

Regulating Artificial Intelligence - Definitely, Maybe

Intelligent Agents : Machine Reading Comprehension

Lets talk about Natural Language Processing

AI vs Machine Learning vs Deep Learning

社区洞察

其他会员也浏览了

SHAP: Bridging the Gap Between Machine Predictions and Actionable Recommendations

Hypothesis Testing & Commonly used Statical Tests

The F1 Score: A Comprehensive Measure of Classification Performance

AI/GenAI to analyze contact center volume drivers

Types of DAX Functions in Power BI

Comparison of Dimensionality Reduction Methods

Understanding Dimensionality Reduction and the Curse of Dimensionality

Day 16 - LightGBM (Light Gradient Boosting Machine)

AI_Part_4_What is K-fold Cross Validation?

Model Fine-Tuning