ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Handbook for metric selection and model evaluation

Aishwarya Srinivasan

å‘å¸ƒæ—¥æœŸ: 2023å¹´1æœˆ25æ—¥

Evaluating machine learning (ML) models is a crucial step in the model development process. It allows us to determine how well our model is performing and make necessary adjustments to improve its performance. There are various metrics that can be used to evaluate ML models subject to (1) Machine learning problem (2) specific use-case/business problem, and it is important to choose the right one for your specific use case.

In this article, we will discuss the different metrics for evaluating ML models, why it is important to choose the right metric, and how to choose which metric to use for your model.

I. Classification Models

The most common metrics used for classification models include accuracy, precision, recall, F1-score, and area under the receiver operating characteristic (ROC) curve.

Accuracy is the proportion of correctly classified instances out of the total number of instances. It is a simple and easy-to-understand metric, but it can be misleading when the class distribution is imbalanced.
Precision is the proportion of true positive instances out of the total number of positive instances predicted by the model. It is a measure of how many of the positive instances predicted by the model are actually positive.
Recall is the proportion of true positive instances out of the total number of actual positive instances. It is a measure of how many of the actual positive instances were correctly identified by the model.
F1-score is the harmonic mean of precision and recall. It is a balance between precision and recall, and it is particularly useful when the class distribution is imbalanced.
Area under the ROC curve (AUC-ROC) is a metric that plots the true positive rate against the false positive rate at various thresholds. It is a measure of the model's ability to distinguish between positive and negative instances.
One of the good ways to see how your classification model is doing is to look at the confusion matrix. The confusion matrix provides a summary of the model's performance, where the number of correct predictions and incorrect predictions are summarized with count values and broken down by each class.

A confusion matrix is typically represented as a table with the following four main elements:

True Positives (TP): These are the number of instances that were correctly classified as positive.
False Positives (FP): These are the number of instances that were incorrectly classified as positive.
True Negatives (TN): These are the number of instances that were correctly classified as negative.
False Negatives (FN): These are the number of instances that were incorrectly classified as negative.

II. Clustering Models

Clustering models are used to group similar instances together into clusters. The goal of a clustering model is to find the underlying structure of the data and divide it into meaningful clusters. There are several evaluation metrics that can be used to evaluate the performance of clustering models. Some of the most commonly used metrics include:

Adjusted Rand Index (ARI): This metric compares the similarity of the predicted clusters with the true clusters. It ranges from -1 to 1, with a higher value indicating a better clustering.
Normalized Mutual Information (NMI): This metric compares the mutual information of the predicted clusters and the true clusters. It ranges from 0 to 1, with a higher value indicating a better clustering.
Fowlkes-Mallows index (FMI): This metric is defined as the geometric mean of the precision and recall between two clusterings. It ranges from 0 to 1, with a higher value indicating a better clustering.
Silhouette Score: This metric measures the similarity of each instance to its own cluster compared to other clusters. It ranges from -1 to 1, with a higher value indicating a better clustering.
Davies-Bouldin index (DBI): This metric measures the average similarity between each cluster and its most similar cluster. It ranges from 0 to infinity, with a lower value indicating a better clustering.
Calinski-Harabasz index (CHI): This metric calculates the ratio of the between-cluster variance to the within-cluster variance. It ranges from 0 to infinity, with a higher value indicating a better clustering.
Cophenetic Correlation Coefficient (CCC): This metric measures the similarity between the original pairwise distances and the distances in the reduced space. It ranges from -1 to 1, with a higher value indicating a better clustering.
Cluster Purity - measures the proportion of points in a cluster that belong to the majority class in that cluster.
Euclidean distance, also known as L2 distance, measures the straight-line distance between two points in a multi-dimensional space. In the context of clustering, it can be used to measure the distance between the points in a cluster and the centroid of that cluster. The smaller the Euclidean distance, the more compact and well-defined the cluster is considered to be.

III. Time series forecasting/ Regression Models

When evaluating machine learning models for time series forecasting, there are several metrics that can be used to evaluate the performance of the model. Some of the most commonly used metrics include:

Mean Absolute Error (MAE): This metric measures the average absolute difference between the predicted values and the actual values. It is a commonly used metric because it is easy to understand and interpret.
Mean Squared Error (MSE): This metric measures the average squared difference between the predicted values and the actual values. It is a commonly used metric because it places more weight on larger errors, making it more sensitive to outliers.
Root Mean Squared Error (RMSE): This metric is the square root of the MSE and is used to measure the average magnitude of the error. It is widely used to evaluate the performance of time series forecasting models.
Mean Absolute Percentage Error (MAPE): This metric measures the average absolute percentage difference between the predicted values and the actual values. It is useful for comparing the performance of models across different scales.
Symmetric Mean Absolute Percentage Error (SMAPE): This metric is similar to the MAPE but is symmetric, meaning it treats the forecast and actual values symmetrically.
Theil's U: This is a multivariate measure of forecast accuracy that measures the ratio of the geometric mean of the forecast error and the geometric mean of the actual values.
Directional accuracy metrics: These are metrics that measure the direction of the forecast, such as the percentage of forecasts that were correctly directionally accurate, or the percentage of forecasts that were within a certain range of the actual values.
Time-based metrics: These are metrics that measure the performance of the model over time, such as the overall accuracy of the model over a certain period, or the accuracy of the model during specific time periods.

In addition to these metrics, it is also important to consider other factors when evaluating time series forecasting models such as evaluating the model's performance on different subsets of the data, and evaluating the model's performance in comparison to other models.

é¢†è‹±æŽ¨è

Converting Regression Problems into Classification Problems in Machine Learning: A Learner's Guide

Converting Regression Problems into Classificationâ€¦

Gundala Nagaraju (Raju) 8 ä¸ªæœˆå‰

Mastering Linear Discriminant Analysis in Machine Learning

Mastering Linear Discriminant Analysis in Machineâ€¦

nagababu molleti 1 å¹´å‰

What Is Logistic Regression in Machine Learning?

Himanshu Salunke 1 å¹´å‰

IV. Topic Modeling

Topic modeling is a technique used to identify patterns in a corpus of text by grouping similar words together into topics. There are several evaluation metrics that can be used to evaluate the performance of topic modeling models. Some of the most commonly used metrics include:

Perplexity: This metric measures how well a topic model can predict the likelihood of a given text. It is a measure of how well the model fits the data and is calculated as the exponential of the negative log-likelihood of the data. Lower perplexity scores indicate a better model.
Coherence: This metric measures the semantic similarity of the words within a topic. It is calculated as the average pairwise similarity of the words within a topic, and is a measure of how coherent the topic is. Higher coherence scores indicate a better model.
Topic coherence: This metric is the average coherence score of all the topics in the model, it can be used to evaluate the quality of the topics.
NPMI (Normalized Pointwise Mutual Information): This metric calculates the mutual information between all the words in a topic and the topic itself, it ranges between -1 and 1 and a higher value indicates a better topic.
Silhouette score: This is a measure of how similar an object is to its own cluster compared to other clusters. It can be used in the context of topic modeling to evaluate how well the documents are assigned to the topics.
Jaccard Similarity: This metric measures the similarity between two sets. It can be used in the context of topic modeling to evaluate the similarity between the words of two different topics.

What happens if you choose a wrong metric?

Let's consider classification model. Here, accuracy is a good metric to use when the class distribution is balanced, but precision and recall are more appropriate when the class distribution is imbalanced. If you don't choose the right metric, you may end up with a model that performs well on one metric but poorly on another. Imagine having a dataset with 99% negative class and 1% positive classes. If we try to optimize our model using accuracy as a metric, then our model will end up learning predicting everything as a negative class in order to get 99% accuracy. This can be catostrophic is we are using this model for a critical use-case like predicting a disease for a patient, as the model is predicting 100% as false negative. In such cases, looking at confusion matrix is the best way to go and you can use metric like F1 score to get a better model efficiency evaluation.

Now let's say we are building a text clustering model and decide to use Euclidean distance. In this case, the data points are text sentences or paragraphs, using Euclidean distance would not be appropriate, because it does not take into account the meaning of the text, instead, metrics such as cosine similarity, Jaccard similarity, or Jensen-Shannon divergence are more appropriate for this kind of data.

Then, how should I choose the right metric?

Here is a step-by-step approach for choosing the right model metric for your machine learning model:

Understand the problem and task: Before selecting a metric, it is important to understand the problem you are trying to solve and the task that your model is designed for (e.g. classification, regression, clustering).
Define the objectives: Identify the business objectives of your model and what you hope to achieve with it. This will help you to determine which metrics are most important to track.
Consider the trade-offs: Different metrics have different trade-offs and may be more or less suitable depending on the problem and objectives. For example, accuracy is a simple and widely used metric, but it may not be appropriate if the dataset is imbalanced.
Evaluate multiple metrics: Don't rely on a single metric. Instead, evaluate multiple metrics and compare the results to get a more comprehensive understanding of your model's performance.
Compare against a baseline: Compare the performance of your model against a simple baseline or a traditional method to get a sense of how well your model is doing.
Validate your choice: Finally, validate your choice of metric by testing your model on a hold-out test set or using cross-validation.

In general, it is a good practice to use multiple metrics to evaluate the topic modeling model, as one metric may not be sufficient to fully evaluate the model's performance. Also, the choice of metric will depend on the specific problem at hand and the characteristics of the data.

It's important to keep in mind that the best metric is the one that aligns with the business objectives and requirements, and that should be the primary driver when choosing a model metric.

If you want to learn more about model evaluation, read this report which I found was very good: https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/

Note: I used ChatGPT to produce part of this article :)

Himanshee .

Senior Data Scientist | CVS Health, Accenture

2 å¹´

very informative and useful. Thanks for sharing it!

èµž

å›žå¤

2 æ¬¡å›žåº”

Sarath Kavuru

2 å¹´

hello guys, to know whats happening in the tech world and to get updates on job openings everyday please join my newsletter. https://deft-architect-5416.ck.page/c6efc2b55f

èµž

å›žå¤

1 æ¬¡å›žåº”

Yogini V Prabhu, PhD

Practicing ML Engineer | Automation in DS Projects | Chemometrician

2 å¹´

One about all the model parameter and hypermeters, too.

èµž

å›žå¤

2 æ¬¡å›žåº”

SADAM SASI SEKHAR

Senior Machine Learning Architect || MLOps || GenAI || Data Scientist || Azure Cloud

2 å¹´

Helpful

èµž

å›žå¤

2 æ¬¡å›žåº”

Dmitriy T.

Chief Product Officer ex. MetaPax | CPO & Advisor for Start-ups | Mentoring for product managers

2 å¹´

Useful article, thanks. How much time did it take to make with ChatGPT help?

èµž

å›žå¤

1 æ¬¡å›žåº”

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Aishwarya Srinivasançš„æ›´å¤šæ–‡ç«

The Age of AI Agents: Beyond Automation, Towards Autonomy

2025å¹´3æœˆ7æ—¥

The Age of AI Agents: Beyond Automation, Towards Autonomy

The world of AI is undergoing a seismic shift, and at the heart of this transformation are AI Agents. These advancedâ€¦

36 æ¡è¯„è®º
This Week In AI (31st January 2025)

2025å¹´2æœˆ2æ—¥

This Week In AI (31st January 2025)

From Headlines to Hyperparameters, this is your weekly AI scoop! Introducing Week in AI, from Headlines toâ€¦

13 æ¡è¯„è®º
Whatâ€™s on top of mind for AI Leaders in 2025?

2025å¹´1æœˆ21æ—¥

Whatâ€™s on top of mind for AI Leaders in 2025?

Welcome to 2025! We are truly well beyond the AI hype-phase and are now looking at building long-term sustainableâ€¦

25 æ¡è¯„è®º
Spotify Wrapped: Why is it a hit?

2024å¹´12æœˆ19æ—¥

Spotify Wrapped: Why is it a hit?

Incase you didnâ€™t knowâ€¦â€¦ Spotify was launched in 2008 in Stockholm, Sweden, and was co-founded by Daniel Ek and Martinâ€¦

10 æ¡è¯„è®º
How AI PCs Are Supercharging Creativity and Collaborationâ€” Future of AI with Hyperpersonalization

2024å¹´11æœˆ14æ—¥

How AI PCs Are Supercharging Creativity and Collaborationâ€” Future of AI with Hyperpersonalization

Weâ€™ve all heard the buzz around AI, but what excites me most these days isnâ€™t happening in the cloud. Itâ€™s happeningâ€¦

12 æ¡è¯„è®º
KubeAI: Scalable, Open-Source LLMs for All

2024å¹´11æœˆ6æ—¥

KubeAI: Scalable, Open-Source LLMs for All

Co-author: Harini Anand As we conclude Hacktoberfest, thereâ€™s no better time to celebrate the thriving open-sourceâ€¦

16 æ¡è¯„è®º
Optimizing AI Infrastructure: The Shift Toward Cost-Efficient, Scalable Hardware Solutions

2024å¹´10æœˆ24æ—¥

Optimizing AI Infrastructure: The Shift Toward Cost-Efficient, Scalable Hardware Solutions

Sponsored by Intel Corporation For years, AI hardware has been a race to the top, but hereâ€™s the truth: itâ€™s not alwaysâ€¦

15 æ¡è¯„è®º
Breakdown the BMC: Felafax

2024å¹´10æœˆ14æ—¥

Breakdown the BMC: Felafax

Unleashing the X-Factor in AI Infrastructure Optimization In todayâ€™s rapidly evolving AI landscape, enterprises areâ€¦

6 æ¡è¯„è®º
Pioneering the Next Generation of Vector Databases

2024å¹´9æœˆ18æ—¥

Pioneering the Next Generation of Vector Databases

The Case of SingleStore Every millisecond counts in the world of data-intensive applications, and efficiency isn't justâ€¦

17 æ¡è¯„è®º
Breakdown the BMC: LighthouzAI

2024å¹´9æœˆ10æ—¥

Breakdown the BMC: LighthouzAI

Turning Procurement Chaos into Automated Brilliance In an era where businesses seek ways to streamline operations andâ€¦

6 æ¡è¯„è®º

See all articles

Handbook for metric selection and model evaluation

Aishwarya Srinivasan

I. Classification Models

II. Clustering Models

III. Time series forecasting/ Regression Models

é¢†è‹±æŽ¨è

IV. Topic Modeling

What happens if you choose a wrong metric?

Then, how should I choose the right metric?

Aishwarya Srinivasançš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Machine Learning Implementation of SHAP

What evaluation approaches would you work to deal with the effectiveness of a machine learning model

Machine learning for portfolio diversification

ML Model Quality Evaluation and Fine Tuning

What Is The Importance Of Confusion Matrix In Machine Learning?

Linear Regression in Machine Learning

Machine Learning: An end to end pipeline

Unraveling the Essence of Loss Functions: Real-World Insights and Applications

Linearity and Non-Linearity in Machine Learning

I. Classification Models

II. Clustering Models

III. Time series forecasting/ Regression Models

é¢†è‹±æŽ¨è

IV. Topic Modeling

What happens if you choose a wrong metric?

Then, how should I choose the right metric?

Aishwarya Srinivasançš„æ›´å¤šæ–‡ç«

The Age of AI Agents: Beyond Automation, Towards Autonomy

This Week In AI (31st January 2025)

Whatâ€™s on top of mind for AI Leaders in 2025?

Spotify Wrapped: Why is it a hit?

How AI PCs Are Supercharging Creativity and Collaborationâ€” Future of AI with Hyperpersonalization

KubeAI: Scalable, Open-Source LLMs for All

Optimizing AI Infrastructure: The Shift Toward Cost-Efficient, Scalable Hardware Solutions

Breakdown the BMC: Felafax

Pioneering the Next Generation of Vector Databases

Breakdown the BMC: LighthouzAI

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Machine Learning Implementation of SHAP

What evaluation approaches would you work to deal with the effectiveness of a machine learning model

Machine learning for portfolio diversification

ML Model Quality Evaluation and Fine Tuning

What Is The Importance Of Confusion Matrix In Machine Learning?

Linear Regression in Machine Learning

Machine Learning: An end to end pipeline

Unraveling the Essence of Loss Functions: Real-World Insights and Applications

Linearity and Non-Linearity in Machine Learning

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†