登录查看更多内容

Deep learning: Error analysis

Chen Yang??????

Machine & Deep Learning | Big Data Cloud

发布日期: 2018年3月4日

You've heard about orthogonalization, how to set up your dev and test, human-level performance as a proxy for Bayes error and how to estimate your avoidable bias and variance. Let's pull it all together into a set of guidelines for how to improve your performance of your learning algorithm.

Getting a supervised learning algorithm to work well means fundamentally hoping or assuming you can do two things. The first is that you can fit your training set pretty very well, roughly saying you can achieve low avoidable bias. The second is that doing well in a training set generalize pretty well to the dev set or the test set, this is sort of saying that variance is not too bad. And in the spirit of orthogonalization what you see is that this is a certain set of knobs you could use to fix the avoidable bias issue such as training a bigger network or training longer, and as a separate set of knobs you could use to address the variance problems such as regularization or getting more training data.

In summary, if you want to improve your performance of your machine learning system, I would recommend looking at the difference between your training error and your proxy for Bayes error and just give a sense of the avoidable bias, in other words just how much better you think you could be trying to do on your training set. Then look at the difference between your dev error and training error as an estimate of how much of a variance problem you have, in other words, how much harder you should be working to make your performance generalized from the training set to the dev set that it wasn't trained on explicitly.

If you're trying to get your learning algorithm to do a task that humans can do and if your learning algorithm is not yet at the performance of a human then manually examining mistakes your algorithm is making can give you insight into what to do next. This process is called error analysis.

Take cat classifier as the example and you've achieved 90% accuracy or equivalently 10% error on your dev set, let's say this is much worse than what you're hoping to do. Maybe your team looks at some example that the algorithm misclassifies and notices that miscategorize some dogs as cats. So, maybe your team comes to you with a proposal for how to make the algorithm do better specifically on dogs. You can imagine building a focused effort maybe collecting more dogs pictures or maybe designing features specific to dogs or something in order to make your classifier do better on dogs so it stops misrecognizing these dogs as cats. So, the question is should you go ahead and start a project focus on the dog problem? There could be several of months of work you could do in order to make your algorithm make few mistakes on dog pictures, so, is that worth your effort? Well, rather than spending few months doing this only to risk finding out at the end that it wasn't that helpful, here is an error analysis procedure that can let you quickly tell whether or not this could be worth your effort.

If you take 100 mislabeled example from dev set and then examine them manually and just count how many of these mislabeled examples are actually pictures of dogs. Now suppose it turns out that is 5% of your hundered mislabeled dev examples are pictures of dogs, and if you spend a lot of time on the dog problem and your error might go down from 10% to 9.5%, and this is a 5% relative decrease in the error from 10% to 9.5%. And so you might reasonably decide this is not the best to use your time or maybe it is but at least this gives you a ceiling or an upper bound on how much you could improve your performance by working on the dog problem.

Sometimes you can also evaluate multiple ideas in parallel during error analysis. For example, let's say you have several ideas for improving your cat detector, fixing dog problem, fixing great cat misrecognition problem or improving performance on blurry images. What I would do is to create a table and on the left side column that goes through the set of images you plan to look at manually and other columns correspond to the ideas you're evaluating, and usually also to leave space in the spreadsheet for comment. So, remember during error analysis you're just looking at depths of examples that your algorithm misrecognized, so, if you find that the first misrecognized image is a picture of dog the put a check mark in the cells of the first row and dog problem column. Then finally having gone through some set of images that would count up what percentage of each of these error categories were attributed to the dog, or big cat or the blurry categories. So, this just means going down each column and counting up what percentage of images have a check mark on that column. The conclusion of this process gives you an estimate of how worthwhile it might be to work on each of these different categories of errors. This doesn't give you a rigid mathematic formula that tells you what to do but it gives a sense of the best options to pursue.

The data for your supervised learning problem comprises input X and output labels Y. What if you're going through the data and you find some of outputs Y are incorrect? is that worthwhile to go in to fix up some of these labels? Let's look at the cat detector example, if you find the data has some incorrectly labled examples, let's first take a look at training data, it truns out deep learning algorithm is quite robust to random errors in the training data so long as your errors or your incorrectly labled examples so as those errors are not too far from random, you know, maybe sometimes the labeler just wasn't paying attention or they accidentally hit the wrong key on the keyboard, if the errors are reasonably random then it's probably okay to just leave the errors they are and not spend too much time fixing them so long as the totally the size is big enough and actual percentage of error is maybe not too high. However, deep learning is less robust on systematic errors, so for example if your labeler consistently labels white dogs as cats then that is a problem because your classifier will learn to classify all white colored dogs as cats.

How about incorrectly labeled examples in dev set and test set? What we would do is during error analysis to add an extra column, so you can also count up the number of examples where the labels Y was incorrect. For example, maybe you count up the impact on a hundred mislabeled dev set examples so you're gonna find 100 examples where your classifier output disagree with the label in your dev set, and sometimes for few of those examples your classifier disagrees with a label because the label was wrong rather than because your classifier was wrong.

The added column counts the percentage of errors due to incorrect labels where the Y value in dev set was wrong, that is a count for Y your learning algorithm made the prediction that differed from what the label on your data says. So, the question is is it worthwhile going in to try to fix up this 6% incorrectly labeled examples, the advice is if it makes a significant difference to your ability to evaluate algorithm on your dev set then go ahead and spend the time to fix incorrect labels. So, three numbers that I recommend to look at to try to decide if it's worth going in and reducing the number of mislabeled examples: the overall dev set error, in this case, is 10% because the accuracy is supposed 90%, errors due to incorrect labels is 10%*6%=0.6%, and errors due to other causes is 9.4% counted by excluding the incorrect label 10%-0.6%. So, in this case, I would say there is 9.4% worth of error that you could focus on fixing whereas the errors due to incorrect label is a relatively small fraction of the overall errors so fixing this if you want to do but it's not the important thing to do right now. And remember that the main purpose of dev set is you want to really use it to help you select between two classifiers A and B, suppose A has 2.1% error and B has 1.9% error on your dev set but you don't trust your dev set anymore to be correctly telling you whether A is actually better than B because B has higher errors due to incorrect labels which has very large impact on the overall errors.

Now if you decide to go into your dev set and manually re-examine the labels and try to fix up some of these labels here are few additional guidelines. Frist, I would encourage you to apply whatever processes you apply to both your dev and test set at the same time. The dev set is sort of telling you where to aim the target and when you hit you want that generalized to the test set. So, your team will work very efficiently if dev and test sets come from the same distribution. So, you're going in to fix up your dev set and will apply the same process to the test set to make sure they continue to come from the same distribution. Second, I would urge you to consider examing examples your algorithm got right as well as it once got wrong. It's easy to look at examples your algorithm got wrong and just see any of those needs to be fixed, but it's possible that there are few examples that your algorithm got right that should be also fixed. And if you only fixed ones that your algorithm got wrong you'll end up with a more bias estimate of the error of your algorithm. Finally, if you're going into your dev/test data to correct some of the labels there, you may or may not decide to go and apply the same process to your training set. Remember previously we said there's actually less important to correct labels in your training set and it's just possible you decide to just correct the labels in your dev/test set which are also often smaller than your training set. It's super important that your dev and test set come from the same distribution but if your training set comes from slightly different distribution often that is a pretty reasonable thing to do.

Chen Yang

要查看或添加评论，请登录

Chen Yang??????的更多文章

Practice on using ansible 2.4 to deploy HDP 2.6.4.0

2018年4月18日

Practice on using ansible 2.4 to deploy HDP 2.6.4.0

I'm practicing ansible installation of hdp 2.6.
Deep learning--CNN: localization in object detection (1/2)

2018年4月3日

Deep learning--CNN: localization in object detection (1/2)

Deep learning has been successfully applied to computer vision, speech recognition, online advertising, logistics many…
Deep learning--CNN: classic ConvNet, residual networks, inception network

2018年3月20日

Deep learning--CNN: classic ConvNet, residual networks, inception network

There are some classic neural network architectures LeNet-5, AlexNet, and VGG-16. First, let's look at the following…

1 条评论
Deep learning--CNN: Padding, strided convolution, convolution over volume, pooling layer

2018年3月12日

Deep learning--CNN: Padding, strided convolution, convolution over volume, pooling layer

In order to build deep neural networks, one modification to the basic convolutional operation that you need to really…
Deep learning--CNN: Edge detection

2018年3月11日

Deep learning--CNN: Edge detection

Computer vision is one of the areas advancing rapidly thanks to deep learning. Deep learning is now helping the…
Deep learning: End-to-end deep learning

2018年3月7日

Deep learning: End-to-end deep learning

One of the exciting recent developments in deep learning has been a rise of end-to-end deep learning. Basically, there…
Deep learning: Transfer learning, multitask learning

2018年3月6日

Deep learning: Transfer learning, multitask learning

One of the powerful ideas of deep learning is that sometimes you can take knowledge, the neural network has learned…

1 条评论
Deep learning: Training and testing on different distributions

2018年3月5日

Deep learning: Training and testing on different distributions

If you're working on a brand new machine learning application, one of the pieces of advice is that you should build…
Deep learning: human-level performance

2018年3月3日

Deep learning: human-level performance

In the last few years, there were a lot of talks about comparing the machine learning systems to human-level…
Deep learning: orthogonalization, evaluation metrics, train/dev/test set

2018年3月2日

Deep learning: orthogonalization, evaluation metrics, train/dev/test set

In the example of the earlier TV set, orthogonalization refers to that the TV designers had designed these knobs so…

See all articles

Deep learning: Error analysis

Chen Yang??????

Machine & Deep Learning | Big Data Cloud

Chen Yang??????的更多文章

社区洞察

其他会员也浏览了

Loss Functions In Deep Learning: Types, Purpose, And More

Machine Learning Explained: Understanding Supervised, Unsupervised, and Reinforcement Learning

How to Reduce Risk and Time-to-Market in Deep Learning Development

Optimization in deep learning- Learn with examples

Best Practice: Deep Learning Checklist

Deep Learning: The magic of Batch Normalization, Code Included.

Advancing Deep Fake Detection Through Multi-Task Learning: An In-Depth Analysis

ML - Supervised vs. Unsupervised Learning by Izam

Beginner's Guide to Optimizers in Deep Learning

Chen Yang??????的更多文章

Practice on using ansible 2.4 to deploy HDP 2.6.4.0

Deep learning--CNN: localization in object detection (1/2)

Deep learning--CNN: classic ConvNet, residual networks, inception network

Deep learning--CNN: Padding, strided convolution, convolution over volume, pooling layer

Deep learning--CNN: Edge detection

Deep learning: End-to-end deep learning

Deep learning: Transfer learning, multitask learning

Deep learning: Training and testing on different distributions

Deep learning: human-level performance

Deep learning: orthogonalization, evaluation metrics, train/dev/test set

社区洞察

其他会员也浏览了

Loss Functions In Deep Learning: Types, Purpose, And More

Machine Learning Explained: Understanding Supervised, Unsupervised, and Reinforcement Learning

How to Reduce Risk and Time-to-Market in Deep Learning Development

Optimization in deep learning- Learn with examples

Best Practice: Deep Learning Checklist

Deep Learning: The magic of Batch Normalization, Code Included.

Advancing Deep Fake Detection Through Multi-Task Learning: An In-Depth Analysis

ML - Supervised vs. Unsupervised Learning by Izam

Beginner's Guide to Optimizers in Deep Learning