On Machine Learning and Deep Learning - Online courses and text books - A point of view – Evaluation Metrics (Part 2)

On Machine Learning and Deep Learning - Online courses and text books - A point of view – Evaluation Metrics (Part 2)

1.     Introduction:

This article is the continuation of my series of notes on some of the topics involving Machine / Deep Learning. Most of the notes summarized in these blogs are a consolidation of the following online courses / text books;

·       Deep Learning Specialization course at deeplearning.ai

·       Approaching (Almost) Any Machine Learning Text Book – Abhishek Thakur

·       Machine Learning Yearning – Technical Strategy for AI Engineers in the Era of Deep Learning – Andrew Ng

·       Sources from the Web in general

 As I have stated earlier, I’m compiling a series of such notes in order to form a future reference as I apply these concepts in my work. However, I’m more than happy if these notes are useful to my LinkedIn community comprising of some Machine and Deep Learning enthusiasts!

This article lays emphasis on the topic of “Evaluation Metrics”. Evaluation Metrics form a significant step during the building of a Machine Learning model- these metrics provide a measure of the accuracy of a machine learning model thus indicating how robust the model is. There are different evaluation metrics that may be considered whilst solving a regression or a classification problem which forms the focus of this article.

2.     Evaluation Metric for Regression and Classification Problems:

Talking of evaluation metrics, these have to be dealt differently for regression and classification problems. Whereas for regression problems the evaluation metrics are straight forward to explain / select but for classification problems several evaluation metrics come into play depending upon the distribution of data set: equally distributed or skewed data sets. This article goes into the details of the evaluation metrics generally used for regression and classification problems – for minute details of these metrics the above course lectures/literature provide further explanation

Evaluation Metrics for Regression Problems:

·       Error

·       Absolute Error

·       Mean Absolute Error

·       Root Mean Squared Error

·       R2 (R squared) – also known as coefficient of determination

Some of the common metrics used in regression are: error and absolute error- whereas error is the difference between the true value and the predicted value, the absolute error is just the absolute of the difference

Mean absolute error (MAE)

The MAE measures the average magnitude of the errors in a set of forecasts/predictions, without considering their direction. It measures accuracy for continuous variables.

Root Mean Squared Error

In this case, the difference between the predicted and the corresponding observed value are each squared and the averaged over the sample. Finally, the square root of the average is taken.

It should be underscored here that since the errors are squared before they are averaged, the RMSE gives relatively high weight to large errors. This indicates that RMSE is most useful when it is large errors are particularly undesirable and thus enable the AI practitioner to refine the model further.

R2 (R squared)

It is also known as the coefficient of determination. This metric gives an indication of how good a model fits a given dataset. It indicates how close the regression line (i.e. the predicted values plotted) is to the actual data values. The R squared value lies between 0 and 1 where 0 indicates that this model doesn't fit the given data and 1 indicates that the model fits perfectly to the dataset provided.

A more intuitive understanding of R-Squared can be got from the graphical representation showing the variance (and the numerical measure) between the measured and predicted values a highlighted below;

No alt text provided for this image

Figure: Low and High R-squared values

A high R-squared value (~87.5%) indicated that the regression model / regression line is closer to the data points than the one with low R-squared value (~38 %)

3.     Evaluation Metrics for Classification Problems:

Evaluation metrics for classification problems have to be thoughtfully selected. Some of the commonly used evaluation metrics for classification problems include;

·       Accuracy

·       Precision

·       Recall

·       F1- Score

·       Area under the ROC (Receiver Operating Characteristic) Curve or simply AOC

·       Logarithmic Loss

·       Precision @ K (P @ K)

·       Average Precision @ K(AP@K)

·       Mean Average Precision @ K (MAVP @ K)

These metrics are briefly discussed below;

i.         Accuracy: “Accuracy” is well suited and the simplest evaluation metric for a binary classification problem wherein we have equal distribution pf positive and negative data and the training and the validation sets.

In order to explain the above: let us say one is solving a binary classification problem wherein they’re detecting presence of or absence of tuberculosis in patients by reading the chest X-ray images.

Let us say we have a training and a validation set comprising of 100 positive and 100 negative samples in each set. Then, if the machine learning model predicts 90% of the X-ray images correctly in the training set and 65% of the X-ray images correctly in the validation set, then, the accuracy in the training and the validation set is 90% (0.90) and 65% (0.65) respectively

o  What if the positive and negative samples are not distributed equally (skewed dataset)? 

Let us say in the same problem, we have 80 images of non-tuberculosis X-rays and 20 images of tuberculosis X-rays in the training and the validation set. Now, if in the validation set, the model classifies all images as non-tuberculosis, then, as per the definition of accuracy, described above, the accuracy is 80%. But this metric of accuracy in this case is misleading as the model may be completely at fault yet the accuracy level will always be 80% if it classifies all images as non-tuberculosis.

Hence, in case skewed data sets as of this example, different metrics will have to be used to evaluate the model.

ii.         Precision:

Precision tries to answer the following question: What proportion of the positive identification is actually correct?

Thus,

No alt text provided for this image

Where;

TP = True Positive

FP = False Positive

It may be intuitively understood here, “True Positive” demotes the that if a model predicts an X-ray image as positive for tuberculosis and the actual medical result is also positive for tuberculosis then the image identifies is correct for a positive tuberculosis patient and hence this scenario is termed as: “True Positive”

Similarly, false positive in when the model predicts an X-ray image as positive for tuberculosis but the actual medical result is negative

iii.         Recall:

 Recall tries to answer the question: What proportion of the actual positives was identified correctly

Thus,

No alt text provided for this image

To fully evaluate a model, one must examine both precision and recall and for an effective model both precision and recall must be high.

Variation of Precision and Recall with the threshold values:

It might be interesting to see the variation of Precision and Recall values with the values of “threshold” i.e. classification problems normally predict the probabilities wherein the threshold value for classifying a positive sample is taken as 0.5 many times.

The book by Abhishek shows that with high threshold values True positives reduce whereas False negatives increase. The book gives detailed discussion and source code on this effect

iv.         F1- Score

 The F1 score can be interpreted as a weighted average of the precision and recall where F1 score reaches its best value at 1 and worst score at 0.

No alt text provided for this image

To fully evaluate a model, one must examine both precision and recall and for an effective model both precision and recall must be high.

v.         Area under the ROC (Receiver Operating Characteristic) Curve or simply AOC

 An ROC (Receiver Operating Characteristic) Curve is a graph showing the performance of a classification model at all “thresholds”. The term “thresholds” has been briefly introduced above. The ROC curve plots the following 2 parameters;

o  True Positive Rate: True Positive Rate is similar to the definition of “Recall” as described above; That is;

No alt text provided for this image

o  False Positive Rate: False Positive Rate is defined as

No alt text provided for this image

An ROC curve plots the TPR and FPR at different classification thresholds. Lowering the classification thresholds will classifies more items as “False Positives” thus increasing both FP and TP

A typical ROC curve is shown below;

No alt text provided for this image

Area under the ROC curve: AUC stands for the area under the ROC curve or simply area under the curve. That is; AUC measures the entire 2-dimensional area under the ROC Curve – area being normalized between 0 and 1. A model whose predictions are 100 % wrong has the AUC value of 0.0 and a model whose predictions are 100 % right has an AUC value of 1.0.

One way of interpreting AUC is that it measures the quality of the model predictions irrespective of what classification threshold is chosen.

No alt text provided for this image

Figure: AUC – Area under the ROC Curve

AUC is a widely used metric in the industry for classification problems thus, is a metric that should we well known to all!

vi.         Logarithmic loss: We then have the logarithmic loss over all samples which is a mere average of all losses. The logarithmic loss penalizes quite heavily for an “incorrect” or a “far off” prediction – i.e. the loss penalizes quite heavily for being very sure or very wrong!

 

4.     Evaluation Metrics for Multi- Class Classification Problems:

 What are Multi- Class Classification Problems?  A classification class with more than 2 classes such that the input sample can be classified into one and only one of the classes  is called a Multi-class classification problem

 E.g. To classify a set of images of dogs into different breed of dogs e.g. German Shepherd, Bulldog, Golden Retriever, etc.

 Evaluation metrics: Having discussed the evaluation metric for binary classification problems in the above paragraphs, the same metrics can be extended to check the robustness of a multi-class classification problems as highlighted below. The concepts of Precision, Recall, F1-Score can be extended to deal with multi-class classification problems. Following definitions may be highlighted in this regard;

o  Macro Averaged Precision

o  Micro Averaged Precision

o  Weighted Averaged Precision

Macro Averaged Precision: In this case one evaluates the precision for each class individually and then averages then averages them

Micro Averaged Precision: In this case one evaluates TP, FP for each class and then uses these to calculate the overall Precision

Weighted Average Precision: This is the same as weighted average but the weights depend upon the number of items in each class.

The source code for each of the above metric is provided in the book – Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur

Similar to the definition of these metrics of Multi-class classification problems related to Precision, we can define the metrics relating to recall and F1-Score

 Confusion Matrix

Confusion Matrix for binary classification problems

We now come to describe a very important metric for classification problems: Confusion Matrix. Confusion Matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

E.g. let us say we’re working on a binary classification problem for reading the chest X-ray images and thus predicting if the patient has tuberculosis (YES) or not (NO). Let us say we have n = 165 samples then we can have a simple table like the below

No alt text provided for this image

Thus, from the above TP, TN, FP and FN can be got as below;

No alt text provided for this image

From the above table, one can derive the Precision, Recall and F1-Score

Confusion Matrix for Multi-class classification problems: Similar to the above, one can develop the Confusion Matrix for a multi-class classification problem. This can eb best shown through an example.

E.g. Let us say we’re solving a multi-class classification problem to identify the breed of dogs froma data set comprising of dog images. Let us say we have 3 classes so that;

Class 0: Dog breed German Shephard

Class 1: Dog Breed Golden Retriever

Class 2: Dog Breed Labrador Retriever

Now let us say we have the following actual and predicted data from a set of 10 samples;

Actual data for 10 samples: [ 0, 1, 2, 0, 1, 2, 0, 2, 2, 0]

Predicted data for 10 samples: [ 0, 2, 1, 0, 2, 1, 0, 0, 2, 0]

Confusion matrix for the above data may be constructed as below;

No alt text provided for this image

5.     Evaluation Metrics for Multi- Label Classification Problems:

 What are Multi- Label Classification Problems? Typically, a classification problem involves predicting a single label. Alternatively, the problem might involve predicting a likelihood across two or more class labels. In such cases, the classification task assumes that the input belongs to one class only.

On the other hand, some classification problems involve predicting more than one label for a given sample. Such problems are referred to as multi-label classification problems. E.g. an image sample might comprise of several objects and the aim of the model might be to predict the list of objects in the given image sample - this is a multi-label classification problem.We have the following metrics available for multi-label classification problems:

o  Precision @ k (P@k)

o  Average Precision @ k (AP@k)

o  Mean Average Precision at k (MAP@k)

Following are some details on these metrics;

 Precision@k (P@k) : Here we have a list of original classes for a given sample (i.e. true classes/set) and a list of predicted classes for the same. Then the P@k is defined as the number of correctly predicted classes considering only the top k values divided by k

 Average Precision@k (AP@k) : Avearge Precision @ k (AP@k) calculates P@k for every k e.g. if we need AP@3, we calculate P@1, P@2, P@3, and divide the sum by 3.

 Mean Average Precision@k (AP@k): The above metrics P@k and AP@k are to do with calculating the accuracy=y per sample but in Machine learning problems, we’re concerned with evaluating the accuracy for all samples, hence we have the Mean Average Precision, which is;

No alt text provided for this image


 



要查看或添加评论,请登录

Ajay Taneja的更多文章

社区洞察