Deep learning: Error analysis
You've heard about orthogonalization, how to set up your dev and test, human-level performance as a proxy for Bayes error and how to estimate your avoidable bias and variance. Let's pull it all together into a set of guidelines for how to improve your performance of your learning algorithm.
Getting a supervised learning algorithm to work well means fundamentally hoping or assuming you can do two things. The first is that you can fit your training set pretty very well, roughly saying you can achieve low avoidable bias. The second is that doing well in a training set generalize pretty well to the dev set or the test set, this is sort of saying that variance is not too bad. And in the spirit of orthogonalization what you see is that this is a certain set of knobs you could use to fix the avoidable bias issue such as training a bigger network or training longer, and as a separate set of knobs you could use to address the variance problems such as regularization or getting more training data.
In summary, if you want to improve your performance of your machine learning system, I would recommend looking at the difference between your training error and your proxy for Bayes error and just give a sense of the avoidable bias, in other words just how much better you think you could be trying to do on your training set. Then look at the difference between your dev error and training error as an estimate of how much of a variance problem you have, in other words, how much harder you should be working to make your performance generalized from the training set to the dev set that it wasn't trained on explicitly.
If you're trying to get your learning algorithm to do a task that humans can do and if your learning algorithm is not yet at the performance of a human then manually examining mistakes your algorithm is making can give you insight into what to do next. This process is called error analysis.
Take cat classifier as the example and you've achieved 90% accuracy or equivalently 10% error on your dev set, let's say this is much worse than what you're hoping to do. Maybe your team looks at some example that the algorithm misclassifies and notices that miscategorize some dogs as cats. So, maybe your team comes to you with a proposal for how to make the algorithm do better specifically on dogs. You can imagine building a focused effort maybe collecting more dogs pictures or maybe designing features specific to dogs or something in order to make your classifier do better on dogs so it stops misrecognizing these dogs as cats. So, the question is should you go ahead and start a project focus on the dog problem? There could be several of months of work you could do in order to make your algorithm make few mistakes on dog pictures, so, is that worth your effort? Well, rather than spending few months doing this only to risk finding out at the end that it wasn't that helpful, here is an error analysis procedure that can let you quickly tell whether or not this could be worth your effort.
If you take 100 mislabeled example from dev set and then examine them manually and just count how many of these mislabeled examples are actually pictures of dogs. Now suppose it turns out that is 5% of your hundered mislabeled dev examples are pictures of dogs, and if you spend a lot of time on the dog problem and your error might go down from 10% to 9.5%, and this is a 5% relative decrease in the error from 10% to 9.5%. And so you might reasonably decide this is not the best to use your time or maybe it is but at least this gives you a ceiling or an upper bound on how much you could improve your performance by working on the dog problem.
Sometimes you can also evaluate multiple ideas in parallel during error analysis. For example, let's say you have several ideas for improving your cat detector, fixing dog problem, fixing great cat misrecognition problem or improving performance on blurry images. What I would do is to create a table and on the left side column that goes through the set of images you plan to look at manually and other columns correspond to the ideas you're evaluating, and usually also to leave space in the spreadsheet for comment. So, remember during error analysis you're just looking at depths of examples that your algorithm misrecognized, so, if you find that the first misrecognized image is a picture of dog the put a check mark in the cells of the first row and dog problem column. Then finally having gone through some set of images that would count up what percentage of each of these error categories were attributed to the dog, or big cat or the blurry categories. So, this just means going down each column and counting up what percentage of images have a check mark on that column. The conclusion of this process gives you an estimate of how worthwhile it might be to work on each of these different categories of errors. This doesn't give you a rigid mathematic formula that tells you what to do but it gives a sense of the best options to pursue.
The data for your supervised learning problem comprises input X and output labels Y. What if you're going through the data and you find some of outputs Y are incorrect? is that worthwhile to go in to fix up some of these labels? Let's look at the cat detector example, if you find the data has some incorrectly labled examples, let's first take a look at training data, it truns out deep learning algorithm is quite robust to random errors in the training data so long as your errors or your incorrectly labled examples so as those errors are not too far from random, you know, maybe sometimes the labeler just wasn't paying attention or they accidentally hit the wrong key on the keyboard, if the errors are reasonably random then it's probably okay to just leave the errors they are and not spend too much time fixing them so long as the totally the size is big enough and actual percentage of error is maybe not too high. However, deep learning is less robust on systematic errors, so for example if your labeler consistently labels white dogs as cats then that is a problem because your classifier will learn to classify all white colored dogs as cats.
How about incorrectly labeled examples in dev set and test set? What we would do is during error analysis to add an extra column, so you can also count up the number of examples where the labels Y was incorrect. For example, maybe you count up the impact on a hundred mislabeled dev set examples so you're gonna find 100 examples where your classifier output disagree with the label in your dev set, and sometimes for few of those examples your classifier disagrees with a label because the label was wrong rather than because your classifier was wrong.
The added column counts the percentage of errors due to incorrect labels where the Y value in dev set was wrong, that is a count for Y your learning algorithm made the prediction that differed from what the label on your data says. So, the question is is it worthwhile going in to try to fix up this 6% incorrectly labeled examples, the advice is if it makes a significant difference to your ability to evaluate algorithm on your dev set then go ahead and spend the time to fix incorrect labels. So, three numbers that I recommend to look at to try to decide if it's worth going in and reducing the number of mislabeled examples: the overall dev set error, in this case, is 10% because the accuracy is supposed 90%, errors due to incorrect labels is 10%*6%=0.6%, and errors due to other causes is 9.4% counted by excluding the incorrect label 10%-0.6%. So, in this case, I would say there is 9.4% worth of error that you could focus on fixing whereas the errors due to incorrect label is a relatively small fraction of the overall errors so fixing this if you want to do but it's not the important thing to do right now. And remember that the main purpose of dev set is you want to really use it to help you select between two classifiers A and B, suppose A has 2.1% error and B has 1.9% error on your dev set but you don't trust your dev set anymore to be correctly telling you whether A is actually better than B because B has higher errors due to incorrect labels which has very large impact on the overall errors.
Now if you decide to go into your dev set and manually re-examine the labels and try to fix up some of these labels here are few additional guidelines. Frist, I would encourage you to apply whatever processes you apply to both your dev and test set at the same time. The dev set is sort of telling you where to aim the target and when you hit you want that generalized to the test set. So, your team will work very efficiently if dev and test sets come from the same distribution. So, you're going in to fix up your dev set and will apply the same process to the test set to make sure they continue to come from the same distribution. Second, I would urge you to consider examing examples your algorithm got right as well as it once got wrong. It's easy to look at examples your algorithm got wrong and just see any of those needs to be fixed, but it's possible that there are few examples that your algorithm got right that should be also fixed. And if you only fixed ones that your algorithm got wrong you'll end up with a more bias estimate of the error of your algorithm. Finally, if you're going into your dev/test data to correct some of the labels there, you may or may not decide to go and apply the same process to your training set. Remember previously we said there's actually less important to correct labels in your training set and it's just possible you decide to just correct the labels in your dev/test set which are also often smaller than your training set. It's super important that your dev and test set come from the same distribution but if your training set comes from slightly different distribution often that is a pretty reasonable thing to do.
Chen Yang