ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

ç‚¹å‡»â€œç»§ç»åŠ å…¥æˆ–ç™»å½•â€ï¼Œå³è¡¨ç¤ºæ‚¨åŒæ„éµå®ˆé¢†è‹±çš„ã€Šç”¨æˆ·åè®®ã€‹ã€ã€Šéšç§æ”¿ç–ã€‹åŠã€ŠCookie æ”¿ç–ã€‹ã€‚

On Machine Learning and Deep Learning - Online courses and text books - A point of view â€“ Evaluation Metrics (Part 2)

Ajay Taneja

Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics

å‘å¸ƒæ—¥æœŸ: 2020å¹´10æœˆ4æ—¥

+ å…³æ³¨

1. Introduction:

This article is the continuation of my series of notes on some of the topics involving Machine / Deep Learning. Most of the notes summarized in these blogs are a consolidation of the following online courses / text books;

Â· Deep Learning Specialization course at deeplearning.ai

Â· Approaching (Almost) Any Machine Learning Text Book â€“ Abhishek Thakur

Â· Machine Learning Yearning â€“ Technical Strategy for AI Engineers in the Era of Deep Learning â€“ Andrew Ng

Â· Sources from the Web in general

As I have stated earlier, Iâ€™m compiling a series of such notes in order to form a future reference as I apply these concepts in my work. However, Iâ€™m more than happy if these notes are useful to my LinkedIn community comprising of some Machine and Deep Learning enthusiasts!

This article lays emphasis on the topic of â€œEvaluation Metricsâ€. Evaluation Metrics form a significant step during the building of a Machine Learning model- these metrics provide a measure of the accuracy of a machine learning model thus indicating how robust the model is. There are different evaluation metrics that may be considered whilst solving a regression or a classification problem which forms the focus of this article.

2. Evaluation Metric for Regression and Classification Problems:

Talking of evaluation metrics, these have to be dealt differently for regression and classification problems. Whereas for regression problems the evaluation metrics are straight forward to explain / select but for classification problems several evaluation metrics come into play depending upon the distribution of data set: equally distributed or skewed data sets. This article goes into the details of the evaluation metrics generally used for regression and classification problems â€“ for minute details of these metrics the above course lectures/literature provide further explanation

Evaluation Metrics for Regression Problems:

Â· Error

Â· Absolute Error

Â· Mean Absolute Error

Â· Root Mean Squared Error

Â· R2 (R squared) â€“ also known as coefficient of determination

Some of the common metrics used in regression are: error and absolute error- whereas error is the difference between the true value and the predicted value, the absolute error is just the absolute of the difference

Mean absolute error (MAE)

The MAE measures the average magnitude of the errors in a set of forecasts/predictions, without considering their direction. It measures accuracy for continuous variables.

Root Mean Squared Error

In this case, the difference between the predicted and the corresponding observed value are each squared and the averaged over the sample. Finally, the square root of the average is taken.

It should be underscored here that since the errors are squared before they are averaged, the RMSE gives relatively high weight to large errors. This indicates that RMSE is most useful when it is large errors are particularly undesirable and thus enable the AI practitioner to refine the model further.

R2 (R squared)

It is also known as the coefficient of determination. This metric gives an indication of how good a model fits a given dataset. It indicates how close the regression line (i.e. the predicted values plotted) is to the actual data values. The R squared value lies between 0 and 1 where 0 indicates that this model doesn't fit the given data and 1 indicates that the model fits perfectly to the dataset provided.

A more intuitive understanding of R-Squared can be got from the graphical representation showing the variance (and the numerical measure) between the measured and predicted values a highlighted below;

Figure: Low and High R-squared values

A high R-squared value (~87.5%) indicated that the regression model / regression line is closer to the data points than the one with low R-squared value (~38 %)

3. Evaluation Metrics for Classification Problems:

Evaluation metrics for classification problems have to be thoughtfully selected. Some of the commonly used evaluation metrics for classification problems include;

Â· Accuracy

Â· Precision

Â· Recall

Â· F1- Score

Â· Area under the ROC (Receiver Operating Characteristic) Curve or simply AOC

Â· Logarithmic Loss

Â· Precision @ K (P @ K)

Â· Average Precision @ K(AP@K)

Â· Mean Average Precision @ K (MAVP @ K)

These metrics are briefly discussed below;

i. Accuracy: â€œAccuracyâ€ is well suited and the simplest evaluation metric for a binary classification problem wherein we have equal distribution pf positive and negative data and the training and the validation sets.

In order to explain the above: let us say one is solving a binary classification problem wherein theyâ€™re detecting presence of or absence of tuberculosis in patients by reading the chest X-ray images.

Let us say we have a training and a validation set comprising of 100 positive and 100 negative samples in each set. Then, if the machine learning model predicts 90% of the X-ray images correctly in the training set and 65% of the X-ray images correctly in the validation set, then, the accuracy in the training and the validation set is 90% (0.90) and 65% (0.65) respectively

o What if the positive and negative samples are not distributed equally (skewed dataset)?

Let us say in the same problem, we have 80 images of non-tuberculosis X-rays and 20 images of tuberculosis X-rays in the training and the validation set. Now, if in the validation set, the model classifies all images as non-tuberculosis, then, as per the definition of accuracy, described above, the accuracy is 80%. But this metric of accuracy in this case is misleading as the model may be completely at fault yet the accuracy level will always be 80% if it classifies all images as non-tuberculosis.

Hence, in case skewed data sets as of this example, different metrics will have to be used to evaluate the model.

ii. Precision:

Precision tries to answer the following question: What proportion of the positive identification is actually correct?

Thus,

Where;

TP = True Positive

FP = False Positive

It may be intuitively understood here, â€œTrue Positiveâ€ demotes the that if a model predicts an X-ray image as positive for tuberculosis and the actual medical result is also positive for tuberculosis then the image identifies is correct for a positive tuberculosis patient and hence this scenario is termed as: â€œTrue Positiveâ€

Similarly, false positive in when the model predicts an X-ray image as positive for tuberculosis but the actual medical result is negative

iii. Recall:

Recall tries to answer the question: What proportion of the actual positives was identified correctly

Thus,

To fully evaluate a model, one must examine both precision and recall and for an effective model both precision and recall must be high.

Variation of Precision and Recall with the threshold values:

It might be interesting to see the variation of Precision and Recall values with the values of â€œthresholdâ€ i.e. classification problems normally predict the probabilities wherein the threshold value for classifying a positive sample is taken as 0.5 many times.

The book by Abhishek shows that with high threshold values True positives reduce whereas False negatives increase. The book gives detailed discussion and source code on this effect

iv. F1- Score

The F1 score can be interpreted as a weighted average of the precision and recall where F1 score reaches its best value at 1 and worst score at 0.

To fully evaluate a model, one must examine both precision and recall and for an effective model both precision and recall must be high.

v. Area under the ROC (Receiver Operating Characteristic) Curve or simply AOC

An ROC (Receiver Operating Characteristic) Curve is a graph showing the performance of a classification model at all â€œthresholdsâ€. The term â€œthresholdsâ€ has been briefly introduced above. The ROC curve plots the following 2 parameters;

o True Positive Rate: True Positive Rate is similar to the definition of â€œRecallâ€ as described above; That is;

o False Positive Rate: False Positive Rate is defined as

An ROC curve plots the TPR and FPR at different classification thresholds. Lowering the classification thresholds will classifies more items as â€œFalse Positivesâ€ thus increasing both FP and TP

A typical ROC curve is shown below;

Area under the ROC curve: AUC stands for the area under the ROC curve or simply area under the curve. That is; AUC measures the entire 2-dimensional area under the ROC Curve â€“ area being normalized between 0 and 1. A model whose predictions are 100 % wrong has the AUC value of 0.0 and a model whose predictions are 100 % right has an AUC value of 1.0.

One way of interpreting AUC is that it measures the quality of the model predictions irrespective of what classification threshold is chosen.

Figure: AUC â€“ Area under the ROC Curve

AUC is a widely used metric in the industry for classification problems thus, is a metric that should we well known to all!

vi. Logarithmic loss: We then have the logarithmic loss over all samples which is a mere average of all losses. The logarithmic loss penalizes quite heavily for an â€œincorrectâ€ or a â€œfar offâ€ prediction â€“ i.e. the loss penalizes quite heavily for being very sure or very wrong!

4. Evaluation Metrics for Multi- Class Classification Problems:

What are Multi- Class Classification Problems? A classification class with more than 2 classes such that the input sample can be classified into one and only one of the classes is called a Multi-class classification problem

E.g. To classify a set of images of dogs into different breed of dogs e.g. German Shepherd, Bulldog, Golden Retriever, etc.

Evaluation metrics: Having discussed the evaluation metric for binary classification problems in the above paragraphs, the same metrics can be extended to check the robustness of a multi-class classification problems as highlighted below. The concepts of Precision, Recall, F1-Score can be extended to deal with multi-class classification problems. Following definitions may be highlighted in this regard;

o Macro Averaged Precision

o Micro Averaged Precision

o Weighted Averaged Precision

Macro Averaged Precision: In this case one evaluates the precision for each class individually and then averages then averages them

Micro Averaged Precision: In this case one evaluates TP, FP for each class and then uses these to calculate the overall Precision

Weighted Average Precision: This is the same as weighted average but the weights depend upon the number of items in each class.

The source code for each of the above metric is provided in the book â€“ Approaching (Almost) Any Machine Learning Problem â€“ Abhishek Thakur

Similar to the definition of these metrics of Multi-class classification problems related to Precision, we can define the metrics relating to recall and F1-Score

Confusion Matrix

Confusion Matrix for binary classification problems

We now come to describe a very important metric for classification problems: Confusion Matrix. Confusion Matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

E.g. let us say weâ€™re working on a binary classification problem for reading the chest X-ray images and thus predicting if the patient has tuberculosis (YES) or not (NO). Let us say we have n = 165 samples then we can have a simple table like the below

Thus, from the above TP, TN, FP and FN can be got as below;

From the above table, one can derive the Precision, Recall and F1-Score

Confusion Matrix for Multi-class classification problems: Similar to the above, one can develop the Confusion Matrix for a multi-class classification problem. This can eb best shown through an example.

E.g. Let us say weâ€™re solving a multi-class classification problem to identify the breed of dogs froma data set comprising of dog images. Let us say we have 3 classes so that;

Class 0: Dog breed German Shephard

Class 1: Dog Breed Golden Retriever

Class 2: Dog Breed Labrador Retriever

Now let us say we have the following actual and predicted data from a set of 10 samples;

Actual data for 10 samples: [ 0, 1, 2, 0, 1, 2, 0, 2, 2, 0]

Predicted data for 10 samples: [ 0, 2, 1, 0, 2, 1, 0, 0, 2, 0]

Confusion matrix for the above data may be constructed as below;

5. Evaluation Metrics for Multi- Label Classification Problems:

What are Multi- Label Classification Problems? Typically, a classification problem involves predicting a single label. Alternatively, the problem might involve predicting a likelihood across two or more class labels. In such cases, the classification task assumes that the input belongs to one class only.

On the other hand, some classification problems involve predicting more than one label for a given sample. Such problems are referred to as multi-label classification problems. E.g. an image sample might comprise of several objects and the aim of the model might be to predict the list of objects in the given image sample - this is a multi-label classification problem.We have the following metrics available for multi-label classification problems:

o Precision @ k (P@k)

o Average Precision @ k (AP@k)

o Mean Average Precision at k (MAP@k)

Following are some details on these metrics;

Precision@k (P@k) : Here we have a list of original classes for a given sample (i.e. true classes/set) and a list of predicted classes for the same. Then the P@k is defined as the number of correctly predicted classes considering only the top k values divided by k

Average Precision@k (AP@k) : Avearge Precision @ k (AP@k) calculates P@k for every k e.g. if we need AP@3, we calculate P@1, P@2, P@3, and divide the sum by 3.

Mean Average Precision@k (AP@k): The above metrics P@k and AP@k are to do with calculating the accuracy=y per sample but in Machine learning problems, weâ€™re concerned with evaluating the accuracy for all samples, hence we have the Mean Average Precision, which is;

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Ajay Tanejaçš„æ›´å¤šæ–‡ç«

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

2025å¹´2æœˆ24æ—¥

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

1. Introduction: This article is the continuation of my series of articles on â€œFine-Tuning of LLMsâ€ and is the fourthâ€¦
Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

2025å¹´2æœˆ10æ—¥

Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

1. Introduction This is the continuation of my series of blogs on Fine-Tuning of LLMs and is the third blog in theâ€¦
Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

2025å¹´2æœˆ4æ—¥

Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

1. Introduction This is the continuation of my series of blogs on Fine-Tuning and is the second blog in the series.
Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

2025å¹´2æœˆ1æ—¥

Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

1. Fine Tuning Series and Background of Transformers and ChatGPT Training Process: One of my earlier series of blogsâ€¦
RAG Beyond Basics:

2025å¹´1æœˆ7æ—¥

RAG Beyond Basics:

1. Introduction: In this article/blog, I will discussing some advanced techniques in the Retrieval-Augmented Generationâ€¦
The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

2024å¹´10æœˆ24æ—¥

The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

1. Introduction: The general idea of Retrieval-Augmented Generation (RAGs) is now well understood in LLM community andâ€¦

2 æ¡è¯„è®º
Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

2024å¹´9æœˆ23æ—¥

Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

1. Introduction: This is the continuation of my series of blogs on Graphs and is the 14th article in the series.

3 æ¡è¯„è®º
Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

2024å¹´8æœˆ26æ—¥

Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

1. Introduction: This is the continuation of my Graph Series of Blogs and is the thirteenth blog in the series.
Training Graph Neural Networks: Part 12 of my Graph series of blogs

2024å¹´8æœˆ18æ—¥

Training Graph Neural Networks: Part 12 of my Graph series of blogs

1. Introduction: This is the continuation of my series of blogs on Graphs and is the twelfth article in the series.
Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

2024å¹´6æœˆ30æ—¥

Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

1. Introduction: This article is the continuation of my series of blogs on â€œGraphsâ€ and is the eleventh article in theâ€¦

See all articles

Ajay Tanejaçš„æ›´å¤šæ–‡ç«

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

RAG Beyond Basics:

The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

Training Graph Neural Networks: Part 12 of my Graph series of blogs

Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

ç¤¾åŒºæ´žå¯Ÿ