Machine Learning Classification Confidence – How Confident Should You Be?
If your kid got a 99% on an exam, he or she should feel pretty good about that result. However, should you feel good if your machine-learning algorithm reports 99% confidence for a specific determination or conclusion? Well, yes and no. And note that the 99% is an arbitrary number used throughout this post, not a statistic for machine learning algorithms.
What is machine learning at a high level?
There has been a lot of hype with machine learning, so let’s go over how machine learning works. In my current domain, I am exposed to computer vision applications more often, so much of my machine learning experience is with convolutional neural networks (CNN), and will talk about CNNs in this post. However, my comments generalize to machine learning algorithms.
A good way to think of a machine learning algorithm is that it is very good at automatically finding signatures or patterns from many example source data (where “many” indicates the scale of big data), so that it can make a determination or conclusion when given a new example. For instance, a CNN takes the input of an image, and makes a determination, such as “is it an image of a cat?” It does this by evaluating many images of cats, and automatically finds the small signatures that a cat should have and where. When the CNN is given a new image, a 99% confidence means the new image exhibits the small signatures in the right place, for 99% of the signatures that the CNN determined should be there “to be a cat”.
How does it find these signatures? The CNN starts with assigning an arbitrary probability to each small section of an image. With each example image it processes, it adjusts the probability, and starts converging on whether it might be a signature. So if you have heard a machine learning algorithm being a probabilistic approach, this is exactly why. And if you hear the hype of machine learning nearing the intelligence state of Skynet (from the movie Terminator), this is also why you should not believe the hype.
Why you should feel good about 99%
What makes a CNN algorithm significantly better than a deterministic approach, is that a deterministic approach requires that the human teach the machine what and where all those small signatures must be, and often we humans don’t know ourselves. If a human can’t consciously recognize all the small signatures, we cannot possibly teach the deterministic approach to be very good. Even if a human can recognize all the small signatures, the deterministic approach would require way too much complexity in the code. The CNN algorithm can find those signatures automatically by learning from the many examples, and thus have seen reliability and efficiency of these algorithms in doing a much better job over its deterministic predecessors.
Having gone through many examples, the probabilities would have converged, resulting in many signatures. Then having many different unique signatures found, a 99% match against those signatures is feeling pretty good.
Why you should not be over-confident with 99%
Who says the probabilities have to converge? Some may not converge, or even converge optimally, and may result in few signatures. With much fewer signatures, a 99% confidence does not sound too good. However, the honest software developer would have told you that the machine learning approach has not worked well for the particular application.
While the above situation is easy to detect or verify, there are two scenarios which can be more difficult to deal with. The statistics are quite different on how often each scenario happens for the different applications and the different configurations of CNN.
- False positives: when the CNN concludes the image is a cat (99% confidence), but it is not. Even if the CNN indicates that the many signatures are needed to prove the image is of a cat, it does not mean that there are no other things (in this known world) that exhibit the same signatures. The Google Photo incident of classifying a person as a gorilla is an example of a false positive. And unfortunately, CNN is a global algorithm, meaning it cannot easily handle exception cases. For instance, you cannot tell the CNN to take a slightly different and localized path to fix a specific problem such as a false positive.
- False negatives: when the CNN concludes the image is not a cat (low % confidence), but it is. This usually happens when there is noise in the image that distorts the signature enough to be not considered a signature. This scenario can be bad if security applications can be fooled into missing the signatures. If you are a Star Wars fan, a non-machine-learning example of a false negative would represent a missed opportunity to capture the droids (see image).
A more appropriate number should be known, such as the classification accuracy. Classification accuracy is basically the percentage that we know for sure the CNN got things right when given large number of images to process, done through lots of time-consuming human verification. Getting this accuracy to 99% then would be much more comforting, but not still fool-proof.
So where do we stand with machine learning?
In many applications, machine learning algorithms have significantly outperformed its deterministic predecessors, as the classification accuracy should prove. However, it is “not human intelligent” – it is just converging on probabilities. And they cannot be assumed to be perfect (even with 99% classification accuracy), though it is noteworthy that many of the errors (false positives and false negatives) would not fool the human. As such, a good application understands the limitations, and would apply human intervention to avoid catastrophic outcomes from the errors. For example, if the application is to track and arrest specific felons, any hits should first be confirmed by a human (instead of an automatic arrest command without human confirmation) that it is indeed the felon they are seeking, to avoid false positives resulting in catastrophic outcomes in arresting the wrong guy.
And does this mean it’s the end of the deterministic approach in AI? No – if there are small number of identifiable and unique signatures needed, the deterministic approach is likely more reliable and easier to manage.