Understanding the ROC & AUC

No alt text provided for this image

Introduction

In any type of machine learning, we need to calculate the accuracy of the model for performance evaluation. AUC (Area Under Curve) – ROC (Receiver Operating Characteristics) are two curves which can be counted upon to visualize the accuracy/performance of a classification model.

Usually AUC/ROC is used for two class problems however, it can also be used for multi class problems. When making a prediction for a two-class classification problem, the following types of errors can be made by a classifier:

  1. False Positive (FP): predict an event when there was no event.
  2. False Negative (FN): predict no event when in fact there was an event.
  3. True Positive (TP): predict an event when there was an event.
  4. True Negative (TN): predict no event when in fact there was no event.

The above-mentioned errors are usually represented in a matrix called Confusion Matrix.

Confusion matrix is a table with two rows and two columns comprising of combination of predicted and actual values for the model.

No alt text provided for this image

True Positive (TP):

Interpretation: You predicted positive and it’s true.

You predicted that it will rain and it actually rained.

True Negative (TN):

Interpretation: You predicted negative and it’s true.

You predicted that it won’t rain and it didn’t.

False Positive (FP): (Type 1 Error)

Interpretation: You predicted positive and it’s false.

You predicted that it will rain however, it didn’t.

False Negative (FN): (Type 2 Error)

Interpretation: You predicted negative and it’s false.

You predicted that it won’t rain and it did rain actually.


Accuracy of the model:

Accuracy simply states the number of correct predictions the model is doing over the total number of predictions made. For instance, if the classifier is 82% correct, it means that out of 100 predictions, the classifier is predicting 82 values correctly.

No alt text provided for this image


Precision & Recall:

Precision and Recall are two metrics calculated for each of the classes that we are dealing with. Precision is the number of True Positives out of all the predictions made and Recall is the number of True Positives out of the total number of available predictions.

No alt text provided for this image

Precision & Recall metric also has an associated term called F1-score when it comes to measure the accuracy of the classifier:

No alt text provided for this image


Sensitivity & Specificity:

Sensitivity & Specificity are similar to Precision and Recall with minute difference. Sensitivity is the percentage of negative predictions which are actually negative. And Specificity is the percentage of positive predictions which are actually positive.

No alt text provided for this image

Sensitivity and Specificity are inversely proportional to each other. When we increase one of them, the other one decrease automatically.

 

Receiver Operating Characteristics (ROC) curve:

The confusion matrix discussed above gives us all the accuracy metrics viz Precision-Recall, Sensitivity-Specificity with the fact that we have the actual predictions made as True and False.

However, there are scenarios where we do not get the predictions as True and False, we end up getting a probability of occurrence. In such cases, we need to define a cut-off threshold to interpret the possibilities. This curve is a plot of false positive rate versus the true positive rate on x-axis and y-axis respectively.

No alt text provided for this image

Area Under Curve (AUC):

The ROC curve being a two-dimensional graph to represent the accuracy of a given classifier model, it is advised to reduce it to one dimensional value to compare the accuracies of different classifiers. AUC is one way of doing that. The AUC is the area that the ROC curve captures under it. Since, ROC is a graph on the single unit axis, the area under it cannot be greater than 1.

So, the AUC value ranges between 0 to 1. However, the random guess classifier generates a diagonal graph as shown above and has an AUC of 0.5. So, any realistic classifier model should not have an AUC lesser than 0.5.

 

ROC-AUC score:

The ROC-AUC score of any classifier model can be calculated using sklearn.metrics library in python. Below is a simple code for doing the same:

No alt text provided for this image


Using AUC-ROC curve for multi class problems:

Usually the AUC-ROC curves are plotted for two class problems. However, while dealing with multi-class problems, we plot the AUC-ROC curves for N number of classes using One Vs All methodology.

Let’s consider a scenario of having three different classes A, B & C.

We will have three ROCs in this case:

ROC for A classified against B & C

ROC for B classified against A & C

ROC for C classified against A & B.


要查看或添加评论,请登录

Gautam Kumar的更多文章

  • Treating outliers on a dataset

    Treating outliers on a dataset

    An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In…

  • What is Cloud Computing

    What is Cloud Computing

    The most simplistic definition of cloud computing is the delivery of on-demand IT services over the internet. The…

    1 条评论
  • An Introduction to Lambda Function

    An Introduction to Lambda Function

    Functions are basically piece of codes which execute only when we invoke them. For any programming language, functions…

  • Understanding Support Vector Machine

    Understanding Support Vector Machine

    Support Vector Machine: An Introduction I have talked about Linear regression and Classification on my prior articles…

  • Classification in Data Science

    Classification in Data Science

    What is Classification? Although classification can be performed on both structured and unstructured data, it is mainly…

  • Understanding the basics of Data Clustering

    Understanding the basics of Data Clustering

    Clustering Clustering is the task of dividing the population or data points into a few groups such that data points in…

  • Multicollinearity - understanding the relationship between variables

    Multicollinearity - understanding the relationship between variables

    Multicollinearity Multicollinearity or simply collinearity is defined by the phenomenon in which two or more…

  • Dimension Reduction - Principal Component Analysis (aka PCA)

    Dimension Reduction - Principal Component Analysis (aka PCA)

    Being in an era of data flowing from every here and there, we often come across scenarios that we gather way too much…

    2 条评论
  • Linear Regression

    Linear Regression

    When it comes to supervised machine learning, there are two types of learning algorithms: Regression – this basically…

社区洞察

其他会员也浏览了