Machine Learning | Accuracy Paradox

Machine Learning | Accuracy Paradox

Imagine a little boy sitting all day in your room. Let's call him Daniel. The moon is out and it's extremely past his bedtime. Daniel has been working on his logistic regression model to predict an employee turnover problem. For hours now, Daniel is trying to improve his model's accuracy. After countless trials and errors, Daniel ran his code for the last time. The accuracy of the model turned out to be 87%!

"Mommy, I did it! My model's prediction is 87%! Holy Moly! This is a higher than my last algebra test! Let's put this on the fridge!" — Little Boy Daniel
So how well did Daniel's model actually do? Is accuracy a good measure? Think about it.

Class Imbalance Problem

You take look at his dataset and found out that he was working on an employee turnover problem (binary classification problem). The training set had 100 employee observations with 95 employees classified as "no_turnover" and 5 employees classified as "turnover". Meaning only 5 employees left the company. This dataset is an example of a class imbalance problem because of the skewed distribution of employees who did and did not leave. More skewed the class means that the more accuracy being used as an evaluator breaks down.

The "Dumb Model"

Knowing what Daniel did wrong, as he did not take into account the class imbalance of his problem, you decided to act smart and designed a new algorithm called the "Dumb Model". This algorithm will literally just predict every instance as 0's, which also means it'll be predicting all employees as no_turnover. That's it.

You go up to him and showed him your model. Since 95 of his observations are classified as 0's (no_turnover), your dumb model made a prediction with 95% accuracy! That's more than Daniel's logistic regression model! You built a model that can be considered completely useless, with exactly zero predictive power, and yet, you got an increase in accuracy.

If you were to learn one thing from this article it is this:

Accuracy can be misleading.

In this case, evaluating Daniel's logistic regression model evaluated by accuracy is the wrong thing to measure. We would have to know the different errors that we care about and correct decisions. Accuracy alone does not measure an important concept that needs to be taken into consideration in this type of evaluation: False Positive and False Negative errors. Let's look into the four types of predictions a classification algorithm can make and some basic terminology:

Four Types of Classification Predictions

  • True Positives (TP): we correctly predicted that they employees do turnover
  • True Negatives (TN): we correctly predicted that they don't turnover
  • False Positives (Type I error): You predict the employee will leave, but does not leave
  • False Negatives (Type II error): You predict that the employee will not leave, but does leave

So Why Not Accuracy?

Accuracy is not useful when trying to predict things that are not common. Accuracy is simply the proportion of correctly classified instances. It is usually the first metric you look at when evaluating a model. However, when the data is imbalanced (where most of the instances belong to one of the classes), or you are more interested in the performance on either one of the classes, accuracy doesn’t really capture the effectiveness of a classifier.

Normally in classification problems, we're typically more concerned about the errors that we make. Because the target class is usually the area of interest that we're trying to focus on.  This is called accuracy paradox.

Looking Good or Is Good?

Evaluating how your model performs can be tricky. There is a difference between a model “looking” good and “is” good. Let's see another example for better clarification. You have a Cat&Dog dataset and it has 80% of its target variable classified as "cat". If you were to build a model to predict every instance as "cat", your model will literally be 80% accurate. Does this model really tell you anything about your predictions? In this case, your model "looks" good. But in the context of the problem, it's not because its not classifying any "dog" instances. It "is" good when it can predict as many cats and dogs as it can.

I know all these terms can be overwhelming... Hang in there! If you like nurses and money you're in for a real treat if you keep on reading! This brings me to another important topic in model evaluation: Precision and Recall

Precision

Precision: is measuring what fraction of your predictions for the positive class are valid. It is formulated by (True Positives / True Positives + False Positives).

Example: For precision, lets say you have 1000 spam labels and the rest are emails sent from your Grandma. Your goal isn't to necessarily find all of the spam emails but your goal is to predict as many spam emails as you can. Your goal is to be precise and not have your grandma emails be classified as spam! You want to make sure for every prediction you make, it should try to always predict it correctly. Precision minimizes false positives.

Recall

Recall: is telling you how often your predictions actually capture the positive class. It is formulated by (True Positives / True Positives + False Negatives).

Example: If we are developing a system that detects sick and healthy patients, it is desirable that we have a very high recall. In this context, predicting someone as being healthy when they are actually sick (False Negatives) is more detrimental than predicting someone who is sick when they are actually healthy (False Positives). Most of the sick patients are identified, probably at loss of precision, since it is very important that all sick patients are identified.

Conclusion

These are just a few of the many evaluation metrics that you can use for classification problems. There is no best evaluation metric. You should always keep in mind the context on the problem and weigh your decisions on whether you want higher recall or higher precision. Especially in classification problems, accuracy is most likely NOT the right metric to use as an evaluation metric!

If there is anything that you guys would like to add to this article, feel free to leave a message and don’t hesitate! Any sort of feedback is truly appreciated. Don’t be afraid to share this! Thanks!

[Click to continue reading...]

Joe H ☆

Dog Dad and full-time nerd.

6 年

ANOVA, T-Tests, Null hypothesis theory, data triangulation, model performance, explainable/descriptive ML, determinism etc....don’t get me started on Big or Little Endianess and signs, exponents, and mantissas for accuracy! Omg Kyle McKiou LOL the mission is real!

Dnyanada Arjunwadkar

Technical Solutions Engineer, Big Data at Google

6 年

This is honestly the best article I have found. Thanks for sharing these confusing concepts of Recall, Precision, False Positive Rate, False Negative rates in such a simplistic manner. We want more articles !! Thanks man :)

回复
Raju T

Senior Machine Learning Engineer

6 年
回复

要查看或添加评论,请登录

Randy Lao ??的更多文章

社区洞察

其他会员也浏览了