Handbook for metric selection and model evaluation

Handbook for metric selection and model evaluation

Evaluating machine learning (ML) models is a crucial step in the model development process. It allows us to determine how well our model is performing and make necessary adjustments to improve its performance. There are various metrics that can be used to evaluate ML models subject to (1) Machine learning problem (2) specific use-case/business problem, and it is important to choose the right one for your specific use case.

In this article, we will discuss the different metrics for evaluating ML models, why it is important to choose the right metric, and how to choose which metric to use for your model.

I. Classification Models

The most common metrics used for classification models include accuracy, precision, recall, F1-score, and area under the receiver operating characteristic (ROC) curve.

  • Accuracy is the proportion of correctly classified instances out of the total number of instances. It is a simple and easy-to-understand metric, but it can be misleading when the class distribution is imbalanced.
  • Precision is the proportion of true positive instances out of the total number of positive instances predicted by the model. It is a measure of how many of the positive instances predicted by the model are actually positive.
  • Recall is the proportion of true positive instances out of the total number of actual positive instances. It is a measure of how many of the actual positive instances were correctly identified by the model.
  • F1-score is the harmonic mean of precision and recall. It is a balance between precision and recall, and it is particularly useful when the class distribution is imbalanced.
  • Area under the ROC curve (AUC-ROC) is a metric that plots the true positive rate against the false positive rate at various thresholds. It is a measure of the model's ability to distinguish between positive and negative instances.
  • One of the good ways to see how your classification model is doing is to look at the confusion matrix. The confusion matrix provides a summary of the model's performance, where the number of correct predictions and incorrect predictions are summarized with count values and broken down by each class.

A confusion matrix is typically represented as a table with the following four main elements:

  • True Positives (TP): These are the number of instances that were correctly classified as positive.
  • False Positives (FP): These are the number of instances that were incorrectly classified as positive.
  • True Negatives (TN): These are the number of instances that were correctly classified as negative.
  • False Negatives (FN): These are the number of instances that were incorrectly classified as negative.


II. Clustering Models

Clustering models are used to group similar instances together into clusters. The goal of a clustering model is to find the underlying structure of the data and divide it into meaningful clusters. There are several evaluation metrics that can be used to evaluate the performance of clustering models. Some of the most commonly used metrics include:

  • Adjusted Rand Index (ARI): This metric compares the similarity of the predicted clusters with the true clusters. It ranges from -1 to 1, with a higher value indicating a better clustering.
  • Normalized Mutual Information (NMI): This metric compares the mutual information of the predicted clusters and the true clusters. It ranges from 0 to 1, with a higher value indicating a better clustering.
  • Fowlkes-Mallows index (FMI): This metric is defined as the geometric mean of the precision and recall between two clusterings. It ranges from 0 to 1, with a higher value indicating a better clustering.
  • Silhouette Score: This metric measures the similarity of each instance to its own cluster compared to other clusters. It ranges from -1 to 1, with a higher value indicating a better clustering.
  • Davies-Bouldin index (DBI): This metric measures the average similarity between each cluster and its most similar cluster. It ranges from 0 to infinity, with a lower value indicating a better clustering.
  • Calinski-Harabasz index (CHI): This metric calculates the ratio of the between-cluster variance to the within-cluster variance. It ranges from 0 to infinity, with a higher value indicating a better clustering.
  • Cophenetic Correlation Coefficient (CCC): This metric measures the similarity between the original pairwise distances and the distances in the reduced space. It ranges from -1 to 1, with a higher value indicating a better clustering.
  • Cluster Purity - measures the proportion of points in a cluster that belong to the majority class in that cluster.
  • Euclidean distance, also known as L2 distance, measures the straight-line distance between two points in a multi-dimensional space. In the context of clustering, it can be used to measure the distance between the points in a cluster and the centroid of that cluster. The smaller the Euclidean distance, the more compact and well-defined the cluster is considered to be.

III. Time series forecasting/ Regression Models

When evaluating machine learning models for time series forecasting, there are several metrics that can be used to evaluate the performance of the model. Some of the most commonly used metrics include:

  • Mean Absolute Error (MAE): This metric measures the average absolute difference between the predicted values and the actual values. It is a commonly used metric because it is easy to understand and interpret.
  • Mean Squared Error (MSE): This metric measures the average squared difference between the predicted values and the actual values. It is a commonly used metric because it places more weight on larger errors, making it more sensitive to outliers.
  • Root Mean Squared Error (RMSE): This metric is the square root of the MSE and is used to measure the average magnitude of the error. It is widely used to evaluate the performance of time series forecasting models.
  • Mean Absolute Percentage Error (MAPE): This metric measures the average absolute percentage difference between the predicted values and the actual values. It is useful for comparing the performance of models across different scales.
  • Symmetric Mean Absolute Percentage Error (SMAPE): This metric is similar to the MAPE but is symmetric, meaning it treats the forecast and actual values symmetrically.
  • Theil's U: This is a multivariate measure of forecast accuracy that measures the ratio of the geometric mean of the forecast error and the geometric mean of the actual values.
  • Directional accuracy metrics: These are metrics that measure the direction of the forecast, such as the percentage of forecasts that were correctly directionally accurate, or the percentage of forecasts that were within a certain range of the actual values.
  • Time-based metrics: These are metrics that measure the performance of the model over time, such as the overall accuracy of the model over a certain period, or the accuracy of the model during specific time periods.

In addition to these metrics, it is also important to consider other factors when evaluating time series forecasting models such as evaluating the model's performance on different subsets of the data, and evaluating the model's performance in comparison to other models.

IV. Topic Modeling

Topic modeling is a technique used to identify patterns in a corpus of text by grouping similar words together into topics. There are several evaluation metrics that can be used to evaluate the performance of topic modeling models. Some of the most commonly used metrics include:

  • Perplexity: This metric measures how well a topic model can predict the likelihood of a given text. It is a measure of how well the model fits the data and is calculated as the exponential of the negative log-likelihood of the data. Lower perplexity scores indicate a better model.
  • Coherence: This metric measures the semantic similarity of the words within a topic. It is calculated as the average pairwise similarity of the words within a topic, and is a measure of how coherent the topic is. Higher coherence scores indicate a better model.
  • Topic coherence: This metric is the average coherence score of all the topics in the model, it can be used to evaluate the quality of the topics.
  • NPMI (Normalized Pointwise Mutual Information): This metric calculates the mutual information between all the words in a topic and the topic itself, it ranges between -1 and 1 and a higher value indicates a better topic.
  • Silhouette score: This is a measure of how similar an object is to its own cluster compared to other clusters. It can be used in the context of topic modeling to evaluate how well the documents are assigned to the topics.
  • Jaccard Similarity: This metric measures the similarity between two sets. It can be used in the context of topic modeling to evaluate the similarity between the words of two different topics.

What happens if you choose a wrong metric?

Let's consider classification model. Here, accuracy is a good metric to use when the class distribution is balanced, but precision and recall are more appropriate when the class distribution is imbalanced. If you don't choose the right metric, you may end up with a model that performs well on one metric but poorly on another. Imagine having a dataset with 99% negative class and 1% positive classes. If we try to optimize our model using accuracy as a metric, then our model will end up learning predicting everything as a negative class in order to get 99% accuracy. This can be catostrophic is we are using this model for a critical use-case like predicting a disease for a patient, as the model is predicting 100% as false negative. In such cases, looking at confusion matrix is the best way to go and you can use metric like F1 score to get a better model efficiency evaluation.

Now let's say we are building a text clustering model and decide to use Euclidean distance. In this case, the data points are text sentences or paragraphs, using Euclidean distance would not be appropriate, because it does not take into account the meaning of the text, instead, metrics such as cosine similarity, Jaccard similarity, or Jensen-Shannon divergence are more appropriate for this kind of data.


Then, how should I choose the right metric?

Here is a step-by-step approach for choosing the right model metric for your machine learning model:

  1. Understand the problem and task: Before selecting a metric, it is important to understand the problem you are trying to solve and the task that your model is designed for (e.g. classification, regression, clustering).
  2. Define the objectives: Identify the business objectives of your model and what you hope to achieve with it. This will help you to determine which metrics are most important to track.
  3. Consider the trade-offs: Different metrics have different trade-offs and may be more or less suitable depending on the problem and objectives. For example, accuracy is a simple and widely used metric, but it may not be appropriate if the dataset is imbalanced.
  4. Evaluate multiple metrics: Don't rely on a single metric. Instead, evaluate multiple metrics and compare the results to get a more comprehensive understanding of your model's performance.
  5. Compare against a baseline: Compare the performance of your model against a simple baseline or a traditional method to get a sense of how well your model is doing.
  6. Validate your choice: Finally, validate your choice of metric by testing your model on a hold-out test set or using cross-validation.

In general, it is a good practice to use multiple metrics to evaluate the topic modeling model, as one metric may not be sufficient to fully evaluate the model's performance. Also, the choice of metric will depend on the specific problem at hand and the characteristics of the data.

It's important to keep in mind that the best metric is the one that aligns with the business objectives and requirements, and that should be the primary driver when choosing a model metric.

If you want to learn more about model evaluation, read this report which I found was very good: https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/




Note: I used ChatGPT to produce part of this article :)

Himanshee .

Senior Data Scientist | CVS Health, Accenture

2 å¹´

very informative and useful. Thanks for sharing it!

hello guys, to know whats happening in the tech world and to get updates on job openings everyday please join my newsletter. https://deft-architect-5416.ck.page/c6efc2b55f

Yogini V Prabhu, PhD

Practicing ML Engineer | Automation in DS Projects | Chemometrician

2 å¹´

One about all the model parameter and hypermeters, too.

SADAM SASI SEKHAR

Senior Machine Learning Architect || MLOps || GenAI || Data Scientist || Azure Cloud

2 å¹´

Helpful

Dmitriy T.

Chief Product Officer ex. MetaPax | CPO & Advisor for Start-ups | Mentoring for product managers

2 å¹´

Useful article, thanks. How much time did it take to make with ChatGPT help?

要查看或添加评论,请登录

Aishwarya Srinivasan的更多文章

  • The Age of AI Agents: Beyond Automation, Towards Autonomy

    The Age of AI Agents: Beyond Automation, Towards Autonomy

    The world of AI is undergoing a seismic shift, and at the heart of this transformation are AI Agents. These advanced…

    36 条评论
  • This Week In AI (31st January 2025)

    This Week In AI (31st January 2025)

    From Headlines to Hyperparameters, this is your weekly AI scoop! Introducing Week in AI, from Headlines to…

    13 条评论
  • What’s on top of mind for AI Leaders in 2025?

    What’s on top of mind for AI Leaders in 2025?

    Welcome to 2025! We are truly well beyond the AI hype-phase and are now looking at building long-term sustainable…

    25 条评论
  • Spotify Wrapped: Why is it a hit?

    Spotify Wrapped: Why is it a hit?

    Incase you didn’t know…… Spotify was launched in 2008 in Stockholm, Sweden, and was co-founded by Daniel Ek and Martin…

    10 条评论
  • How AI PCs Are Supercharging Creativity and Collaboration— Future of AI with Hyperpersonalization

    How AI PCs Are Supercharging Creativity and Collaboration— Future of AI with Hyperpersonalization

    We’ve all heard the buzz around AI, but what excites me most these days isn’t happening in the cloud. It’s happening…

    12 条评论
  • KubeAI: Scalable, Open-Source LLMs for All

    KubeAI: Scalable, Open-Source LLMs for All

    Co-author: Harini Anand As we conclude Hacktoberfest, there’s no better time to celebrate the thriving open-source…

    16 条评论
  • Optimizing AI Infrastructure: The Shift Toward Cost-Efficient, Scalable Hardware Solutions

    Optimizing AI Infrastructure: The Shift Toward Cost-Efficient, Scalable Hardware Solutions

    Sponsored by Intel Corporation For years, AI hardware has been a race to the top, but here’s the truth: it’s not always…

    15 条评论
  • Breakdown the BMC: Felafax

    Breakdown the BMC: Felafax

    Unleashing the X-Factor in AI Infrastructure Optimization In today’s rapidly evolving AI landscape, enterprises are…

    6 条评论
  • Pioneering the Next Generation of Vector Databases

    Pioneering the Next Generation of Vector Databases

    The Case of SingleStore Every millisecond counts in the world of data-intensive applications, and efficiency isn't just…

    17 条评论
  • Breakdown the BMC: LighthouzAI

    Breakdown the BMC: LighthouzAI

    Turning Procurement Chaos into Automated Brilliance In an era where businesses seek ways to streamline operations and…

    6 条评论

社区洞察

其他会员也浏览了