登录查看更多内容

F1 Score is Overrated. Instead, use this!

Karamjit Das

Python | Cloud | Generative AI | Analytics | Architect

发布日期: 2023年8月23日

When it comes to evaluating the performance of machine learning models, the F1 score has long been hailed as the go-to metric. However, I dare to challenge the status quo and argue that the F1 score is overrated. In this article, we will explore an alternative metric that deserves more attention and recognition: the Matthews Correlation Coefficient (MCC).

Brace yourself as we dissect the F1 score, uncover its limitations, and unveil the hidden gem that is MCC.

The F1 Score: A Flawed Hero

Ah, the F1 score — the metric that seems to be on every machine learning practitioner’s lips. Calculated by harmonizing precision and recall, the F1 score is intended to strike a balance between the two. But does it always deliver accurate results? Let’s find out.

Precision, Recall, and a Harmonious Blend

To understand the F1 score, we need to take a step back and look at its components: precision and recall. Precision measures how many correctly predicted positive instances out of all predicted positive instances. Recall, on the other hand, quantifies how many positive instances were correctly identified out of all actual positive instances. Both metrics have their merits, but blending them together might not always be the best idea.

Example: When F1 Score Falls Short

Consider a hypothetical scenario where you’re building a model to detect a rare disease. Suppose we have a dataset with the following confusion matrix for a binary classification problem:

In this scenario, the dataset represents a medical test for a rare disease where only a small number of positive cases exist compared to negative cases. The confusion matrix suggests that the model has a high True Negative (TN) rate but a low True Positive (TP) rate. Here are the calculations for precision, recall, and F1 score:

Precision = TP / (TP + FP) = 25 / (25 + 10) ≈ 0.714
Recall = TP / (TP + FN) = 25 / (25 + 5) = 0.833
F1 Score = 2 (Precision Recall) / (Precision + Recall) ≈ 0.769

In this case, the F1 score is around 0.769, which might seem like a reasonable performance. However, the low number of True Positives is a concern, especially in the context of medical diagnostics. The rare nature of the disease means that even a small number of missed positive cases can have significant real-world consequences.

F1 score, are you really the superhero we thought you were?

Introducing the Matthews Correlation Coefficient (MCC)

Amidst the F1 score’s limitations, a silent hero awaits its time to shine: the Matthews Correlation Coefficient. Named after its creator, Brian Matthews, this coefficient ranges from -1 to +1, providing a holistic measure of a model’s performance. It takes into account true positives, true negatives, false positives, and false negatives, all wrapped up in a single number. Isn’t that impressive?

领英推荐

All models are wrong...especially after a pandemic

Glenn Lyons 4 年前

The Missing Piece: Why Innovaccer + Humbi AI Will…

Abhinav Shashank 1 个月前

What You Missed at NRF 2025

Rosie Bailey 2 个月前

Formulating the MCC : Let’s break down the MCC formula step by step:

Calculate the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) from your model’s predictions.
Use these values to calculate the numerator: (TP TN) — (FP FN).
Compute the denominator: √((TP + FP) (TP + FN) (TN + FP) * (TN + FN)).
Divide the numerator by the denominator to obtain the MCC.

MCC = (TP * TN - FP * FN) / sqrt((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))

Let’s take the same example above and see how MCC is better than F1.

Using the values from the confusion matrix:

TP = 25
TN = 9000
FP = 10
FN = 5

Substituting these values into the MCC formula:

MCC = (25 * 9000 - 10 * 5) / sqrt((25 + 10) * (25 + 5) * (9000 + 10) * (9000 + 5))
MCC ≈ 0.517

The MCC value is approximately 0.517.

The Matthews Correlation Coefficient takes into account all four elements of the confusion matrix and considers the balance between true positives, true negatives, false positives, and false negatives. Unlike the F1 score, the MCC considers the relative sizes of different classes in the dataset and is suitable for imbalanced datasets or situations where the consequences of false positives and false negatives are not equal.

As we bid farewell to the F1 score, let’s acknowledge its merits as a starting point for evaluation. However, in the ever-evolving landscape of machine learning, we must embrace alternatives that can better capture the nuances of our models’ performance. The Matthews Correlation Coefficient (MCC) shines as a comprehensive metric, accommodating imbalanced datasets and considering all possible outcomes. So, let’s set aside our F1 score goggles and give MCC the recognition it truly deserves.

要查看或添加评论，请登录

Karamjit Das的更多文章

Why Small Language Models (SLMs) Are the Future of AI Over Large Language Models (LLMs)

2024年11月13日

Why Small Language Models (SLMs) Are the Future of AI Over Large Language Models (LLMs)

Large Language Models (LLMs) like GPT-4 have received a lot of attention due to their impressive capabilities across a…

5 条评论
AI’s Impact on Internet Search and Economy

2024年6月13日

AI’s Impact on Internet Search and Economy

The rise of AI-powered search engines like Perplexity AI and Google’s AI Overview is transforming how people discover…
How to Force Clean Docker and Start Fresh (Also Free Up Disk Space)

2024年1月1日

How to Force Clean Docker and Start Fresh (Also Free Up Disk Space)

Docker is a powerful tool for building, deploying, and running applications in a containerized environment. However…
Utilizing AWS Auto Scaling and Containerization for Scalable and Cost-Effective Applications

2023年12月19日

Utilizing AWS Auto Scaling and Containerization for Scalable and Cost-Effective Applications

In today’s digital age, cloud computing has become an essential part of modern businesses. Amazon Web Services (AWS) is…

4 条评论
Embrace the LLM Mayhem: From RAG to rags.

2023年11月6日

Embrace the LLM Mayhem: From RAG to rags.

Enterprises are increasingly recognizing the potential of integrating Large Language Models (LLMs) into their…
The Challenges Associated With Building Products Using Large Language Models (LLMs).

2023年6月3日

The Challenges Associated With Building Products Using Large Language Models (LLMs).

Hello there. I hope you all are doing well.

See all articles

F1 Score is Overrated. Instead, use this!

Karamjit Das

Python | Cloud | Generative AI | Analytics | Architect

The F1 Score: A Flawed Hero

Introducing the Matthews Correlation Coefficient (MCC)

领英推荐

Karamjit Das的更多文章

社区洞察

其他会员也浏览了

5 Changes to Expect as Healthcare Transitions to a Technology-Driven Data-Based Industry

How to Spec Out a Computer Vision System and Improve Business Metrics

I must be only one in a million

All of the above

IRB is Our Solution to California SB 1047

Download our Comprehensive AI Data Center White Paper Series

Final List of 38 Vendors in Upcoming Buyers' Guide to LLMs

LLMs know more than they show. Researchers investigated LLMs' hallucinations.

Design your next core with TandemViz

The F1 Score: A Flawed Hero

Introducing the Matthews Correlation Coefficient (MCC)

领英推荐

Karamjit Das的更多文章

Why Small Language Models (SLMs) Are the Future of AI Over Large Language Models (LLMs)

AI’s Impact on Internet Search and Economy

How to Force Clean Docker and Start Fresh (Also Free Up Disk Space)

Utilizing AWS Auto Scaling and Containerization for Scalable and Cost-Effective Applications

Embrace the LLM Mayhem: From RAG to rags.

The Challenges Associated With Building Products Using Large Language Models (LLMs).

社区洞察

其他会员也浏览了

5 Changes to Expect as Healthcare Transitions to a Technology-Driven Data-Based Industry

How to Spec Out a Computer Vision System and Improve Business Metrics

I must be only one in a million

All of the above

IRB is Our Solution to California SB 1047

Download our Comprehensive AI Data Center White Paper Series

Final List of 38 Vendors in Upcoming Buyers' Guide to LLMs

LLMs know more than they show. Researchers investigated LLMs' hallucinations.

Design your next core with TandemViz