Deep learning: human-level performance
In the last few years, there were a lot of talks about comparing the machine learning systems to human-level performance. Because of advances in deep learning machine learning algorithms are suddenly working much better and become much more feasible in a lot of application areas to actually become competitive with human-level performance, second, it turns out that the workflow of designing and building machine learning system is much more efficient when you're trying to do something that humans can also do, so, in those setting it becomes natural to talk about comparing or trying to mimic human-level performance.
When you're working on a problem progress tends to be relatively rapid as you approach human-level performance but then after a while, the algorithm surpasses the human-level performance and then progress in accuracy actually slows down maybe it keeps going better but after surpassing human-level it can still be getting better but the slope of how rapid accuracy is going up often slows down. And hope it achieves some theoretical optimal level of performance human-level. And over time as you keep training your algorithm maybe bigger and bigger model and more and more data the performance approaches but never surpasses some theoretical limit which is called Bayes optimal error.
The performance is for many tasks not that far from Bayes optimal error because people are very good at looking at images and telling if there's a cat or listening to an audio and transcribing it. So, by the time you surpass the human-level performance maybe there's not much headroom to still improve. But the second reason is so long as your performance is worse than human-level performance then there're actually certain tools you could actually use to improve performance. They are harder to use once your performance surpasses the human-level performance.
For tasks that humans are quite good at, so long as your machine learning algorithm is still worse than the human you can get labeled data from the human so you can have more data to fit your learning algorithm. Or you can use manual error analysis, you can ask people to look at the examples that your algorithm is giving it wrong, try to get insight in terms of why a person get it right but the algorithm gets it wrong. And also you can get a better analysis of bias and variance.
We talk about you want your learning algorithm to do well on your training set but sometimes you don't want to do too well and knowing human level performance can tell you exactly how well but not too well you want your algorithm to do on the training set.
Here we still use cat classification as the example, given a picture let's say humans have near perfect accuracy, so suppose the human-level error is 1%, in that case, if your learning algorithm achieves 8% training error and 10% dev error then maybe you want to do better on the training set. So, the fact is the huge gap between how well your algorithm does on your training data versus how well humans do, it shows your algorithm isn't fitting your the training set well, so, in terms of tools of bias and variance, in this case, you will focus on reducing bias so you want to do things like to find a bigger neural network. But now let's imagine the human-level error is not 1%, it's 7.5% maybe the images in your dataset are so blurry and even humans can't tell whether it is a cat, in this case you see that maybe you're actually doing just fine on the training set because it is doing only a little bit worse than human-level performance, and maybe you want to focus on reducing the variance in your learning algorithm so you might try regularization to try to bring your dev error closer to your training error.
So, in earlier courses discussion on bias and variance, we were mainly assuming there were tasks where Bayes error is nearly zero, so to explain what it happens here for our cat classification example, think of the human-level error as a proxy or an estimate for Bayes optimal error. And for computer vision tasks this is a pretty reasonable a proxy because humans are actually very good at computer vision so whatever human can do maybe not too far from Bayes error. By definition human-level error is worse than Bayes error because nothing could be better than Bayes error but the human-level error may not be too far from Bayes error.
Let's see how to define 'human-level' a bit more precisely and in particular use the definition that is most useful for helping you drive progress in your machine learning project. Let's say you want to look at a radiology image and make a diagnosis classification decision. Suppose a typical human, untrained human achieved 3% in this task. And a typical doctor, maybe radiology doctor achieves 1% error and an experienced doctor does even better 0.7%, and a team of experienced they consensus their opinion to achieves 0.5% error. So, if you want a proxy or an estimate of Bayes error and given that team of experienced doctor achieves 0.5% we know that Bayes error should not large than 0.5%, we don't know how better it is maybe there's a larger team of even more experienced doctor could do better so maybe it's a little better than 0.5% but we know the optimal error cannot be higher than 0.5%. So, what I would do in this setting is to use 0.5% as the estimate for Bayes error so we'll define human-level performance as 0.5%.
The gap between human-level error and training error is taken as avoidable bias and the gap between training error and dev error we take it as variance. Like in previous slide we talked when avoidable bias is bigger than variance we'll focus on reducing bias like training a bigger neural network, whereas if the variance is much bigger then we'll focus on variance reduction techniques such as regularization or getting bigger training set. Where it's really doing matter is when your training error approach 0.7% and dev error is 0.8%, unless it's very careful about estimating Bayes error you might not know how far away you're from Bayes error and therefore how much you should be trying to reduce avoidable bias. In fact, if you all you knew was that a single experienced doctor achieves 0.7% instead of the team experienced doctors achieving even better 0.5%, it might be very difficult to know if you should be trying to fit your training set even better. And this problem arose only when you're doing very well on your problem already, you know 0.7% and 0.5% really close to the human-level performance.
Lots of teams often find it exciting to surpass human-level performance on the specific recognition and classification. Look at the following example, you have a team of humans discussing and debating and achieve 0.5% error, single human achieve 1% error. So, in case of your training error is 0.6% it's easy to answer what the avoidable bias is. Because you'll take 0.5% as the estimate of Bayes error so your avoidable bias is not gonna use 1% as reference. And in this case, there's maybe more to do to reduce your variance than your avoidable bias. moreover, if your training error is already better, say 0.3%, than even a team of humans looking at and discussing and debating then it's also harder to rely on human intuition to tell your algorithm what are ways that your algorithm could still improve the performance. So, in this example, once you surpass 0.5% threshold your options that make progress on the machine learning problem are just less clear. It doesn't mean you cannot make progress you might still make significant progress but some the tools you have for pointing you in a clear direction just don't work as well.
There are many problems where machine learning significantly surpasses human-level performance. And notice something for these problems are actually learning from structural data, and these are not natural perception problems so these are not computer vision or speech recognition or natural language processing tasks because humans tend to be very good at perception task. And finally all of these problems where there are teams that have access to huge amount of data, so, for example, the best system for all these applications have probably looked at far more data of that application than any human could possibly look at, so it can better find the statistical pattern than even a human might.
Chen Yang