登录查看更多内容

Let's Talk About Performance Evaluation Metrics for Machine Learning

Kay Chansiri, Ph.D.

Research Scientist | ML & GenAI for Social Impacts | Human-Computer Interaction

发布日期: 2024年6月5日

Performance Evaluation Metrics

Hello, all data scientists and researchers! Today, I will talk about one of the most important topics in machine learning - performance evaluation metrics. We will never know if a model is effective unless we deeply understand which metric to choose to evaluate the model. I will discuss metrics in two categories: 1) metrics for regressors, and 2) metrics for classifiers.

Metrics for Classifiers

Introduction

This type of metric is used for models predicting categorical outcomes, which could be binary (e.g.,predicting if patients have or do not have cancer) or multi-categorical (e.g., predicting if customers would buy Porsche, Lexus, or Mercedez as their first car). Typically, classifiers categorize data into specific groups/labels. However, certain algorithms, such as logistic regression, produce a continuous output that represents the probability of a case belonging to a class, typically ranging from 0 to 1. This probability is often converted into a categorical outcome for easier interpretation and decision-making by applying a threshold. For instance, if the probability of a patient having cancer is 0.78, and our cutoff is 0.5, we would classify this case as '1' (having cancer). Now that you have an idea of what a classifier is, let's discuss how many metrics are available to evaluate this type of model.

1. Confusion Matrix

A Confusion Matrix is a standard evaluation metric used for classifiers, including those targeting binary and multicategorical outcomes. The matrix is typically presented as a table indicating true positives, true negatives, false positives, and false negatives.
For example, imagine you build a model for cancer detection example, your confusion matrix may look like this:

The table indicates that there are 10,000 observations that do not have cancer. The classifier accurately predicts '0' for 8,000 of them (i.e., true negatives: TN) and inaccurately predicts '1' for 2,000 of them (false positives: FP). Similarly, there are 150 observations that have cancer. The classifier accurately predicts '1' for 123 of them (true positives: TP) and inaccurately predicts '0' for 27 of them (false negatives: FP).
However, the matrix alone may not fully illustrate a model’s performance. You might then wonder: What can we do with this confusion matrix? The answer is that you can use the values in the table to calculate key metrics such as accuracy (i.e., the percentage of correctly classified instances), sensitivity or recall (i.e., the proportion of truly positive instances that are correctly identified), specificity (i.e., the proportion of truly negative instances that are correctly identified), and precision (i.e., the proportion of instances classified as positive that are truly positive).

2. Accuracy

As mentioned earlier, accuracy assesses the percentage of correctly classified samples. The metric can be represented by the following formular:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

This metric is particularly useful when there is a class balance in the outcome, such as having a binary outcome where the ratio of 0 and 1 is closed to 50:50. This is becasue accuracy considers all instances in the denominator.
However, the metric can be biased if you have an imbalanced outcome. In the cancer example mentioned previosly, 10,000 instances do not have cancer whereas only 150 instances have cancer.

Predicted: No CancerPredicted: CancerActual: No Cancer8000 (True Negative)2000 (False Positive)Actual: Cancer27 (False Negative)123 (True Positive)

If you use the formular, the accuracy of the model is 8,000 + 123/ 8,000 + 2,000 + 27 + 123 = ~80%, indicating that the model has a good performance, despite not having learned anything meaningful about the minority class or cases with cancer.

3. Sensitivity (AKA Recall)

You may wonder if accuracy does not work well with a class imbalance, which metric should you then use. The answer is sensitivity.
Sensitivity measures how often an algorithm correctly identifies true positives from all the actual positive samples.The metric can be represented by the following formula:

Sensitivity (Recall): TP / (TP + FN)

The values range from 0 to 1, with higher scores indicating more precise detection of positive cases.
The metric works well for problems with imbalanced classes, especially when the cost of false negatives is high. Consider the cancer detection example; the sensitivity score here is 123/(27 + 123) ≈ 82, which indicates that the model learns quite well to predict cancer cases out of all actual cancer cases, even though those are the minority class.
Sensitivity is a desirable metric, especially when the cost of a false negative is high. In other words, it should be used as the gold standard metric when you do not want to overlook false negatives or cases where people actually have cancer but the model predicts they do not.
Note that sensitivity is often reported along with precision or specificity to provide a more comprehensive evaluation.

4. Specificity

This metric assesses the truly negative samples, as the formula below illustrates:

Specificity: TN / (TN + FP)

Similar to sensitivity, specificity is a good evaluation metric if there is an imbalance between the number of real positive and negative instances.

5. Precision

Precision assesses the proportion of correct positive predictions made by the model when it predicts that the outcome is positive. The fomular is as below:
Precision = TP/ TP + FP
The precision scores range from 0 to 1, with higher scores indicating greater accuracy in predicting positive instances.
Like sensitivity, precision is desirable, especially when the cost of a false positive is high. For example, imagine you work in real estate and you would like to offer discounts to previous customers who tend to be serious about buying their next home only, as it would be impractical and costly for your company to offer discounts to everyone. In this case, you may want to consider specificity, which integrates false positives into the formula.
Despite the advantage of considering false positives, the drawback of precision is that it does not account for false negatives, i.e., customers who are wrongly classified by the model as unlikely to buy but actually would buy. If you use this metric, you may miss targeting these potential buyers.
Choosing between precision or recall depends on what aspect you value more. Precision is a suitable metric when you care more about being correct when assigning the positive class (i.e., when a false positive is costly). Recall is preferable when you care more about capturing all possible positive cases (i.e., when missing a positive case has serious consequences), such as in cancer detection or public health interventions where the stakes involve significant financial costs or life-and-death situations. On the other hand, precision could be chosen when the cost of incorrectly assigning positives is high, such as in banking, finance, commercials, or real estate.

6. F-1 scores

If you care about both precision and recall, you may consider F-1 scores as your evaluation metric.
The score can be thought of as a harmonic mean of precision and recall, ranging from 0 to 1. When F1 score is 1, it means that we have perfect precision and recall. When F 1 cores is 0, it means we have bad precision and recall.
The formular is: F1 = 2 x precision x recall/precision + recall

7. Area under the curve (AUC) (also known as the area under the Receiver Operating Characteristic (ROC) curve)

The AUC is a plot that represents the dynamic relationship between sensitivity (recall) and specificity. When you have a binary outcome for which you need to identify a threshold to categorize the outcome probabilities into 0 or 1, the AUC can help in identifying the best cut-off values.
As I mentioned earlier, the most common cutoff point is 0.5. However, this also depends on the nature of the project you are working on. For example, if you are working on a project where a lower threshold should be set, such as estimating whether a person should be isolated during a new pandemic, the threshold of 0.3 might be used. You wouldn’t want to use a higher threshold and then let people walk out, potentially spreading the virus!
Imagine that you are working with a cancer detection dataset and you would like to see if changing the cutoff point would impact the sensitivity or specificity, starting from using a standard cutoff point which is 0.5. You can plot the AUC plot that would look something like the following:

According to the plot, the threshold may not be visibly specified. However, imagine that the threshold is located somewhere on the curve. If you set a very low threshold (e.g., 0.1), there is a high likelihood that most cases would be classified as positive. If your model performs well at this threshold, your true positive rate would be high (closer to 1) and your false positive rate (i.e., cases classified as positive but are actually negative) would also be high, or closer to 1. This is because more negatives are incorrectly labeled as positives.
Therefore, this liberal 0.1 threshold would likely be positioned closer to the upper right corner of the ROC curve. Selecting this threshold indicates that you prioritize sensitivity (or recall) more than specificity (also known as the true negative rate). This threshold is suitable in scenarios where missing true positives (e.g., failing to detect a serious disease like cancer) is more critical than dealing with false positives, such as cases where people might be misdiagnosed with cancer. It is better to be safe than sorry.
Conversely, if you set a more conservative threshold like 0.8, most cases would be classified as negative. Consequently, both your true positive and false positive rates would be lower, as most cases would be classified as negative. Thus, your more conservative 0.8 threshold would appear towards the lower left part of the ROC curve. Selecting this threshold means that you are prioritizing specificity over sensitivity. This approach could be appropriate in scenarios where false positives lead to high costs or consequences, such as avoiding unnecessary medical treatments based on false diagnoses when insurance may not cover the treatment costs.
You can use roc_curve in Python to see the array of thresholds that correspond to each point on the ROC curve, giving you a better idea of which threshold corresponds with which true positive or false positive values.
Note that the closer the AUC is to 1, the more predictive power the model has. Conversely, the closer the AUC is to 0.5, the closer the predictive power is to random guessing (i.e., the model does not effectively categorize positive and negative cases from each other).

Metrics for Classifiers (Multicategorical Outcomes)

Beyond models predicting binary outcomes, a confusion matrix can be used to evaluate models predicting multicategorical outcomes. The confusion matrix for multicategorical outcomes could look like this:

领英推荐

Decision Trees in Machine Learning

Blockchain Council 5 个月前

Using Machine Learning for Better Insights: How to Get…

Digital Software Inc 1 年前

Ensuring Data Integrity: Techniques for Handling…

Gundala Nagaraju (Raju) 7 个月前

Similar to the matrix for binary outcomes, the matrix for multicategorical outcomes can be used to calculate accuracy, sensitivity, specificity, and precision. However, certain points need to be adjusted using two primary methods:

Macro-Averaging

This method calculates each metric separately for each class and then takes the average. This approach gives equal importance to each class, regardless of its size. If your data has class imbalance, macro-averaging is reliable in evaluating model performance because the metric is calculated independently for each class before taking the average, giving equal weight to each class regardless of its size. However, just like metrics for binary outcomes, certain metrics such as accuracy could be biased by class imbalance. For macro-averaging, each class is given equal weight, so the performance of minority classes can disproportionately affect the overall accuracy. Note that in cases where classes are balanced, both micro- and macro-averaging will yield similar results for most metrics.

Micro-Averaging

This method considers the contributions of all classes to compute the average metric. If you have class imbalance, this approach may not be appropriate and could be dominated by the majority class. Although micro-averaging might not be a good method to evaluate the performance of each class separately if you have a class imbalance, the method might be appropriate for evaluating the overall effectiveness of the model, especially if the majority class is the most important. This is typically the case in banking or consumer application industries.

In summary,if you care more about all classes being detected, macro-averaging is the best option. If you care more about overall model performance, micro-averaging is the better option.

Let's take a look at an example using the confusion matrix above:

Accuracy Calculation for Each Class:

Lung Cancer: (50 / (50 + 8 + 5)) ≈ 79%
Breast Cancer: (75 / (7 + 75 + 3)) ≈ 88%
Skin Cancer: (80 / (4 + 2 + 80)) ≈ 93%

Macro-Averaging: Overall Accuracy: (79% + 88% + 93%) / 3 ≈ 86.67%

Micro-Averaging:

Total Correct Predictions: 50 (i.e., Lung) + 75 (i.e., Breast) + 80 (i..e, Skin) = 205
Total Predictions Made: 63 (Lung) + 85 (Breast) + 86 (Skin) = 234
Micro-Averaged Accuracy: 205 / 234 ≈ 87.61%

Now you have an understanding of how to calculate accuracy using macro- and micro-averaging. This concept can be applied to other metrics such as precision, recall, and specificity.

Evaluation Metrics for Continuous Outcomes

For continuous outcomes, the model aims to predict numeric values instead of categorical ones. The goal of all regressors (i.e., models predicting continuous outcomes) is to minimize the distance between observed and predicted values, as shown in the equation:

ei = yi - y^i

To better understand errors and how they are produced, consider one of the simplest ML algorithms: linear regression. This can be represented by the equation:

y = β0 + β1X1 + β2X2 + ?

Initially, the algorithm does not know the best values for β1 and β2. It begins by plugging in random numbers for these parameters and iteratively adjusts them to minimize ? (i.e.,the distance between observed and predicted values). When the process is performed iteractively, you can get something like the plot below:

From the plot, you can see the best pair of values for β1 and β2 that minimizes ? would be at somewhere at the bottom right corner where I circled. When you substitute these values back into the equation β0 + β1X1 + β2X2 + ?, and replace X1 and X2 with observed values across all instances, you will get the error for each instance. These errors can be used to evaluate the model using the following methods:

Mean Absolute Error (MAE)

This is calculated by summing the absolute values of errors across all samples.
Some people prefer using Mean Squared Error (MSE) over MAE because MSE penalizes larger errors more severely and avoids issues with calculus that may arise with MAE.

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)**

K-Fold Cross-Validation

Now that you learned all of important evaluatiom metrics for machine learning, it is noteworthy that it would never be sufficient to evaluate a model just once and then assume it performs well or poorly.

A standard practice in machine learning is to use multiple samples, ideally independent from each other, and average the evaluation metrics to assess overall model performance.

The process is known as K-fold cross-validation, where data is separated into n sets. n-1 sets are used as the training set and the remaining set as the testing set. Some may suggest having an additional validation set to fine-tune the model before final testing and deployment. You can read more about cross-validation techniques in my previous post, which mainly discusses ensemble techniques but also covers cross-validation in detail.
In addition to obtaining average metrics for regressors (i.e., models predicting continuous outcomes) or majority votes for classifiers (i.e., models predicting categorical outcomes), we may also use statistical tests to determine if one model performs better than another. A common method is the paired t-test to see if the averages across models are statistically different. This is suitable because K-fold cross-validation samples are dependent on each other, being resampled from the same population.
However, some researchers argue that the paired sample t-test may not be ideal due to its sensitivity to outliers. Alternatives such as Wilcoxon’s signed-rank test or DeLong's test are recommended depending on the metric being compared. I encourage you to read the article by Rainio, Teuho, and Keen, which explains which statistical methods to use for comparing evaluation metrics across models. Their Figure 3 is particularly informative. You can find the article here.
In summary, I would say that the choice of statistical test depends on your data (e.g., evaluation metric values for parametric or non-parametric metrics), the number of models you want to compare (two or more), and whether you are comparing means or variances of the evaluation outcomes.

There you have it! I hope this post helps you learn the basics of evaluation metrics for machine learning. Let me know in the comments below if you have any questions!

Ping Wongphothiphan

UX Researcher | Healthcare and Social Research Scientist | Quantitative and Mixed Method Researcher

9 个月

Thanks for sharing! This is helpful for my everyday task.

查看更多评论

要查看或添加评论，请登录

Kay Chansiri, Ph.D.的更多文章

The Art of Gradient Boosting Machines: A Practical Approach

2024年9月19日

The Art of Gradient Boosting Machines: A Practical Approach

Welcome to the third article in my series on ensemble techniques for machine learning. In my previous post, I discussed…

1 条评论
Understanding Logistic Regression in Machine Learning: Sigmoid Function, Log-Likelihood Estimation, Class Imbalance Adjustment, and More

2024年7月31日

Understanding Logistic Regression in Machine Learning: Sigmoid Function, Log-Likelihood Estimation, Class Imbalance Adjustment, and More

In my previous post, I discussed linear regression from a machine learning (ML) perspective. Today, let’s delve into a…

4 条评论
Linear Regression from a Machine Learning Perspective

2024年7月3日

Linear Regression from a Machine Learning Perspective

Let's discuss one of the simplest machine learning (ML) algorithms today: regression. Like other ML algorithms…

4 条评论
From Trees to Forests: Exploring the Power of Random Forest in Machine Learning

2024年5月22日

From Trees to Forests: Exploring the Power of Random Forest in Machine Learning

Hello, machine learning enthusiasts! Today, I'm excited to share the second installment in my series on random forests.…

2 条评论
Insights from the National AI Expo: Navigating the Future of AI with Ethics and Innovation

2024年5月7日

Insights from the National AI Expo: Navigating the Future of AI with Ethics and Innovation

Wrapping up Day 1 of The AI Expo for National Competitiveness, an event teeming with bright-minded individuals from…
Classification Tree - Read This Before Applying Your Random Forest Algorithms

2024年2月28日

Classification Tree - Read This Before Applying Your Random Forest Algorithms

Hello, LinkedIn people! Today, I am back with another post about a simple yet powerful machine learning algorithm:…

3 条评论
Longitudinal Multilevel Modeling: A Fundamental Pillar in the Architecture of Machine Learning and Deep Learning Algorithms

2024年1月30日

Longitudinal Multilevel Modeling: A Fundamental Pillar in the Architecture of Machine Learning and Deep Learning Algorithms

In today's post, we will diverge from our usual focus on machine learning and AI to delve into the world of…

2 条评论
Comparing AI Paradigms: Reflex, State, Bayseian, and Logic Learning

2023年12月18日

Comparing AI Paradigms: Reflex, State, Bayseian, and Logic Learning

In my previous post, I discussed categorizing machine learning (ML) based on supervision styles (such as supervised…

4 条评论
Discovering the Spectrum of Machine Learning: A Beginner's Comprehensive Guide

2023年12月12日

Discovering the Spectrum of Machine Learning: A Beginner's Comprehensive Guide

Hello there, data enthusiasts! I'm back with another article, where we unravel the various types of Machine Learning…

2 条评论
Unraveling the World of Data Structures: Vectors to Arrays Explained!

2023年9月8日

Unraveling the World of Data Structures: Vectors to Arrays Explained!

Hello, data enthusiast out there! If you've spent any time in the world of social science, you've undoubtedly stumbled…

2 条评论

See all articles

Let's Talk About Performance Evaluation Metrics for Machine Learning

Kay Chansiri, Ph.D.

Research Scientist | ML & GenAI for Social Impacts | Human-Computer Interaction

Performance Evaluation Metrics

Metrics for Classifiers

Introduction

1. Confusion Matrix

2. Accuracy

3. Sensitivity (AKA Recall)

4. Specificity

5. Precision

6. F-1 scores

7. Area under the curve (AUC) (also known as the area under the Receiver Operating Characteristic (ROC) curve)

Metrics for Classifiers (Multicategorical Outcomes)

领英推荐

Macro-Averaging

Micro-Averaging

Let's take a look at an example using the confusion matrix above:

Evaluation Metrics for Continuous Outcomes

Mean Absolute Error (MAE)

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)**

K-Fold Cross-Validation

Kay Chansiri, Ph.D.的更多文章

社区洞察

其他会员也浏览了

Understanding Support Vector Machines (SVM) and Decision Trees in Machine Learning

Conquering Class Imbalance: Techniques and Strategies for Robust Models

Understanding Simple Linear Regression: The Foundation of Predictive Modeling

Machine Learning - Hyperparameter Tuning

TECHNIQUES FOR BUILDING PREDICTIVE MODELS

Simplifying Machine Learning’s Orthogonality and Orthonormality

Unveiling the Art of Feature Selection in Machine Learning

Machine Learning (Classification models)

Understanding the Essentials of Machine Learning: A Deep Dive into Module 6 / Chapter 3 of Tom M. Mitchell, Machine Learning Book -Decision Trees

Hypothesis Testing & Commonly used Statical Tests

Performance Evaluation Metrics

Metrics for Classifiers

Introduction

1. Confusion Matrix

2. Accuracy

3. Sensitivity (AKA Recall)

4. Specificity

5. Precision

6. F-1 scores

7. Area under the curve (AUC) (also known as the area under the Receiver Operating Characteristic (ROC) curve)

Metrics for Classifiers (Multicategorical Outcomes)

领英推荐

Macro-Averaging

Micro-Averaging

Let's take a look at an example using the confusion matrix above:

Evaluation Metrics for Continuous Outcomes

Mean Absolute Error (MAE)

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)**

K-Fold Cross-Validation

Kay Chansiri, Ph.D.的更多文章

The Art of Gradient Boosting Machines: A Practical Approach

Understanding Logistic Regression in Machine Learning: Sigmoid Function, Log-Likelihood Estimation, Class Imbalance Adjustment, and More

Linear Regression from a Machine Learning Perspective

From Trees to Forests: Exploring the Power of Random Forest in Machine Learning

Insights from the National AI Expo: Navigating the Future of AI with Ethics and Innovation

Classification Tree - Read This Before Applying Your Random Forest Algorithms

Longitudinal Multilevel Modeling: A Fundamental Pillar in the Architecture of Machine Learning and Deep Learning Algorithms

Comparing AI Paradigms: Reflex, State, Bayseian, and Logic Learning

Discovering the Spectrum of Machine Learning: A Beginner's Comprehensive Guide

Unraveling the World of Data Structures: Vectors to Arrays Explained!

社区洞察

其他会员也浏览了

Understanding Support Vector Machines (SVM) and Decision Trees in Machine Learning

Conquering Class Imbalance: Techniques and Strategies for Robust Models

Understanding Simple Linear Regression: The Foundation of Predictive Modeling

Machine Learning - Hyperparameter Tuning

TECHNIQUES FOR BUILDING PREDICTIVE MODELS

Simplifying Machine Learning’s Orthogonality and Orthonormality

Unveiling the Art of Feature Selection in Machine Learning

Machine Learning (Classification models)

Understanding the Essentials of Machine Learning: A Deep Dive into Module 6 / Chapter 3 of Tom M. Mitchell, Machine Learning Book -Decision Trees

Hypothesis Testing & Commonly used Statical Tests