I can barely 'RECALL' with enough 'PRECISION' and little 'SPECIFICITY' what is 'SENSITIVITY'!
Anurag Halder
Director - Analytics | Content Planning & Strategy | Media | Entertainment | OTT | Big Data Analytics | Data Science
I find it very difficult and unfair to remember jargon, anything that forces me to memorize generally fails me. To avoid ambiguity and overcome the challenge, I tend to, as much possible, logically try to arrive at concepts. At most times these jargon have very simple ideas and are christened for reference. One such case I feel are terms: Recall, Precision, Specificity and Sensitivity.
My target in this article would be to try and explain from my view of looking at the problem and solving for:
- Binary Classification Model Performance
- Arriving at Optimal Threshold values for Classification
- Thereby understanding what is ROC (Receiver Operator Curve) and arriving at AUC (Area Under Curve)
- It will not cover the concepts of unbalanced classes etc, but once corrected for, these steps are all equally applicable
At the end of the article, I will share the link to a notebook, where you can use play around with different parameters and see for yourself how the whole thing falls into place.
First Gear: Run the model and get predicted probabilities
Our starting point for this exercise would be to take off from the stage where you have run your first classification model on your training set and have established predicted probabilities for your test data. Consider the adjacent image.
Given we have achieved the probabilities let us try to get a first idea of how well our model did in classifying the classes correctly. One of the popular methods to achieve this is Confusion Matrix. Which is simply 'given a chosen threshold' i.e. if my predicted value is > 0.5 the result would be classified as one else zero. A concept very clean to understand. The outcome of a confusion matrix being 'given a chosen threshold' the counts of ones actually predicted as ones, the count of zeros actually predicted as zeros (the more we can achieve to get the proportion of counts in these two buckets the more better the model in classifying), the count of ones predicted as zeros and the count of actual zeros predicted as ones (i.e. the misclassification rate). But the question is how do we arrive at the optimal threshold?
Second Gear: Plot some histograms
Let's plot a simple histogram as shown in the adjoining image, where the blue curve plots the histogram of the predicted probabilities of the actual negative class (tagged as 0), while the red curve indicates the density curve for the actual positive (tagged as 1). The first inference: The curves look well separated with marginal overlap, i.e. the predicted probabilities of either classes do not majorly overlap and choosing a right threshold can result in a good confusion matrix. (For cases of poorer models we will see how the curve looks like in sometime)
Third Gear: Moving around to choose an optimal threshold
In the last section we talked about choosing the right threshold so that we can have a good confusion matrix and one of the meanings for good could be the threshold resulting in lowest misclassification rate, though from domain to domain you can choose to trade off false positives vs false negatives. Note: Positive or Negative is a nomenclature, you can choose to select one of your classes as positive while the other class as negative. Example, in a classification between 2 types of leaf, say setosa and versicolor, you may assign setosa as the positive class and versicolor as negative and vice versa.
The confusion matrix shown above helps us understand concepts of True Positive, False Positive, False Negative and True Negative. Let's tag along a few more ratios in place: True Positive Rate i.e. How many actual positive class examples where correctly predicted as positive or True Positive / (Actual Positives = True Positive + False Negative) . Similarly let's define False Positive Rate i.e. how many of the actual Negative class examples got classified into Positive class or False Positive / (Total Negatives = False Positive + True Negative) and finally Misclassification Rate can be defined as the ratio of all the cases which were incorrectly classified of all the observations i.e. (False Positive + False Negative) / (Number of examples). Why did I mention only the above specific metrics is because it will help us chart out two more graphs, which we will shortly touch.
Let us revisit the histogram overlap graph. We were talking about obtaining the optimal threshold value, so this is our starting point to arrive at one. The instances of the red line are the threshold selected at that instance, 3 such instances have been shown above, any example to the right of the threshold is chosen as positive class else negative.
So we are choosing a starting threshold of 0 and then evaluating our confusion matrix, i.e. the red line is at the extreme left, hence by the blue curve (Negative class) lying to the right of the threshold, all actual Negative class examples have been classified as positive, i.e. 100% False Positive Rate, similarly by the Red Curve (Positive class) all the actual positives have been classified as positives thus a 100% True Positive Rate. This continues till the threshold line (red line) enters the blue curve. Lets for simplicity break the problem into different zones the threshold line can fall on.
Zone 1: Till the threshold touches the blue curve - False Positive Rate and True Positive Rate are 100% as explained above
Zone 2: Till the threshold is in the blue curve but before the red curve - Now some of the actual Negative class examples have started falling to the left of the threshold so misclassification of the negative class is reducing hence False Positive Rate will start to go down from 100% but because the red curve or the positive class is still to the right it continues to be at 100% True Positive Rate
Zone 3: When the threshold is in the intersecting zone - Here now some of the positive class has started to lie on the left hand side causing misclassfication of the positive class. From the blue curve perspective we are more and more classifying negatives as negatives now and hence False Positive Rate is steeply dropping while more of the red curve now falling left resulting in True Positive Rate to go down from 100%
Zone 4: Till the threshold is in the red curve - At this stage all negatives class examples have been classified as negative so False Positive Rate has touched 0% while as the threshold moves towards the right tail of the red curve it is more and more misclassfiying the positive class as negative causing a steep drop of True Positive Rate towards 0%
Zone 5: When the threshold is beyond the red curve - At this point all negative class (blue curve) examples have been classified as negative hence False Positive Rate is at 0% while all positive class examples (red curve) have been misclassified into the negative class i.e. True Positive Rate is also at 0%
The above scenario has been explained w.r.t. the example displayed in the image, how the results behave in different scenarios will be displayed below, helping you get a complete clarity of the play out. All the Zones explained above when put together in a graph they result in ROC or Receiver Operator Curve. By looking at the Receiver Operator Curve and by calculating the area below it you will be able to establish how well your model is able to classify between different classes, we will explain how to arrive at the area soon, but before that lets look at some example ROC and misclassfication curves for various thresholds.
Given the example for various thresholds the True Positive Rate and False Positive Rate have been plotted. Because of the small overlap between the blue and red curves the ROC does not exactly touch the coordinates (0,1) but brushes just pass. Similarly the misclassfication is at a minimum when the threshold is as ~0.57. These curves behave very differently as the area of overlap vary, which we will see very soon below.
Forth Gear: Measuring AUC
As a final exercise before we link all of them together and demonstrate different scenarios, I will briefly touch on finding the area of the ROC and how it determines model performance.
To arrive at the area of the ROC break the graph into a chunk of rectangles as seen by the green shaded zone. So the first rectangle is between coordinates (0.0, 0.4) and (0.1, 0.4) i.e. width is 0.1 unit and height is 0.4 units and hence area of the rectangle is 0.04. Similarly for the next rectangle 0.1 times 0.6 is 0.06 and continue adding the area until 0.1 * 1.0 = 0.1, for the blue zone simply consider half the area of the blue rectangle i.e. 0.5 times 0.1 times 0.1 is 0.005. For the adjoining graph hence the AUC or area under the curve is:
0.1 * 0.4 + 0.1*0.6 + 0.1*0.6 + 0.5*0.1*0.1 + 0.1*0.8 + 0.1*0.9 + 0.1*0.9 + 0.1*1 + 0.1*1+ 0.1*1+ 0.1*1 = 0.825
More and more the red line shrinks to a diagonal, lower is the classification power of the model because it is at the same time loosing more True Positive Classification rate and lowering the False Positive Rate as well, hence a model with a random classification power will have an AUC of ~0.5.
With this I conclude my blabbering and showcase to you, as you alter the threshold values for various scenarios how all of the above play out. Thank you for bearing with me this long, enjoy the simulations. If you found this useful please do like and share, for any constructive suggestions please feel free to comment and help me improve.
As promised: https://github.com/AnuragHalder/Classification_Performance_Threshold
Top Gear: Scenarios
Scenario 1: Marginal Overlap; AUC ~ 0.9998; Threshold - 0.58
Scenario 2: No Overlap; AUC ~ 1; Threshold between 0.48 and 0.55
Scenario 3: High Overlap; AUC ~ 0.75; Threshold - 0.52
Scenario 4: Complete Overlap; AUC ~ 0.5; Threshold - 0.56
Data Engineer
4 年Nice post Anurag. It would be really kind if you can clarify some my following queries, 1) How did you calculate the classifications rate? Is it a derivative of confusion matrix like FPR TPR etc., 2) How about plotting Precision & Recall across various cut-off points and determining the optimal cut-off point? Are there any caveat with this approach?
Senior Data Scientist | FMCG
5 年Very nicely explained.
Director - Analytics | Content Planning & Strategy | Media | Entertainment | OTT | Big Data Analytics | Data Science
5 年Have published the same article in Medium as well https://medium.com/@anurag.economics/i-can-barely-recall-with-enough-precision-and-little-specificity-what-is-sensitivity-802f9c902492