Deep learning: Training and testing on different distributions

Deep learning: Training and testing on different distributions

If you're working on a brand new machine learning application, one of the pieces of advice is that you should build your first system quickly and then iterate. First, you quickly set up dev and test set and metric so this is really deciding where to place your target, and if you get it wrong you can always move it later but just set up target somewhere. And then you should build your initial machine learning system quickly and find the training set to train it and start to see and understand how well you're doing against your dev/test set in the evaluation metric. When you build your initial system you then be able to use bias/variance analysis as well as errors analysis to prioritize the next steps. In particular, if error analysis causes you to realize that a lot of errors are from the speaker being very far from the microphone which causes the special changes of speech recognition then that will give you a good reason to focus on techniques to address this which is called far field to speech recognition which basically means handling when the speaker is far from the microphone.

Along the value of building this initial system, it can be a quick and direct implementation you don't overthink it but all the value of the initial system is having some trained system that allows you to look at the bias/variance to try to prioritize what to do next, allows you to do error analysis to look at some mistakes to figure out all the different directions you could go in which ones are actually the most worthwhile.

Deep learning algorithms have a huge hunger for training data, they just work best because we can find enough labeled training data to put into the training set. This has resulted in many teams sometimes whatever the data you can find just shoving it into the training set and just for getting more training data even if some of the data or even it may be a lot of this data doesn't come from the same distribution as your dev and test data. There are some subtleties and some best practices for dealing with when your training and test distributions differ from each other.

Take cat detector example again, look at the following two sources of data, one is from crawling the web and downloading a lot of very professionally frames, high-resolution images of cats. Another is the data you really care about from mobile app, which tends to be less professionally shot, less to frame, maybe even blurrier because of a shot by amateur users. Let's say you don't have a lot of mobile users yet so you maybe got 10 thousand of pictures uploaded, whereas by crawling the web you can download huge numbers of cat pictures and maybe you have 200 thousand cat pictures. So, what you really care about is your final system does work well on the mobile app distribution of images. One option is to put both of these datasets together and randomly shuffle them into a training, dev and test set. The advantage is now your training, dev, and test sets all come from the same distribution, but the disadvantage which actually is a huge disadvantage is if you look at your dev set of these 25 hundred examples a lot of it will come from the web page distribution of images rather than what you actually care about which is the mobile app distribution of images, and you don't wanna spend most of your time optimizing for the web page distribution of images. So, I don't recommend this option because this is setting up your dev set to tell your team to optimize for a different distribution of data than what you actually care about.

Instead, I will recommend taking another option. You will have the training with all 200 thousand images from the web, and you can if you want to add 5 thousand images from the mobile app, and then for your dev and test sets would be all mobile app images. The advantage of this way splitting up your data into training, dev and test set is you're aiming the target where you want it to be, the disadvantage of this course is now your training distribution is different from your dev and test distributions but it turns out that this splitting will get better performance over the long term, and we will discuss later some specific techniques for dealing with your training set coming from different distributions than your dev and test sets.

Estimating bias/variance of your learning algorithm really helps you prioritize what to work on next, but the way you analyze bias/variance changes when your training set comes from a different distribution than your dev and test set. Let's keep using the cat classifier example, and let's say humans get near perfect performance on this so Bayes error is near zero percentage of this problem. To carry out the error analysis you usually look at the training error and dev error. If your dev data come from the same distribution as your training set, you'll see that here you have a large variance problem, that your algorithm is not generalized well from the training set which it's doing well on to the dev set which it's suddenly doing worse on. But in the setting where your training data and dev data come from different distribution, you can no longer safely draw this conclusion. In particular, maybe it's doing just fine on dev set it's just that the training set was really easy because it was high resolution very clear images and maybe the dev set is much harder, so maybe there isn't a variance problem and this just affects that the dev set contains images that are much more difficult to classify accurately. So, the problem with this analysis is that when you went from the training error to dev error two things change at a time, one is that the algorithm saw data in the training set but not in the dev set, two is the distribution in your dev set is different, so it's different to know this 9% increase in error how much of it is because the algorithm didn't see the data in the dev set so that's sort of variance part of the problem, and how much of it is because dev set data is just different.

So, in order to tease out those two effects, it will be useful to define a new piece of data which we'll call the training dev set that should have the same distribution as training set but you don't explicitly train a neural network on this. To obtain the training-dev set what we're gonna do is randomly shuffle training set and then carve out just a piece of the training set to be the training-dev set. So, just as the dev and test set have the same distribution the training set and the training-dev set also has the same distribution. But the difference is that now you train your neural network just on the training set you won't run backpropagation on the training-dev portion of this data. To carry out error analysis what you should do is now look at the error of this classifier on the training set, on the training-dev set as well as on the dev set. So, if your training error is 1% and training-dev error goes up to 9% that says you have a variance problem because the only difference between training data and training-dev data is that your neural network got to saw the training part of this data and was trained explicitly on this part, but wasn't trained on the training-dev data. Let's look at another case that if your training error is 1% and training-dev error is 1.5% and dev error goes up to 10% so now you actually have a pretty low variance problem because when you went from training data that you've seen to the training-dev data that your neural network has not seen the error increase only a little bit but it really jumps when you go to the dev set so this is a data mismatch problem.

Let's look at few more example that if your training error is 10% and training-dev error is 11% and dev error is 12%, remember that human-level proxy for Bayes error is roughly 0%, so in this type of performance you really have an avoidable bias problem because you're doing much worse than human-level. And one last example if your training error is 10% and training-dev error is 11% and dev error is 20% then it looks like this actually has two issues: the avoidable bias is quite high because you're not even doing well on the training set for human-level error get near 0%, the variance from training to training-dev set seems quite small but the data mismatch from training-dev to dev is quite large.

Let's write out the general principles. The key quantities I will look at are the human-level error, training error, training-dev error and dev error. Depending on the difference between these errors you can get the sense of how big is the avoidable bias, the variance, and the data mismatch. You could also add test set error and the gap between dev and test error tells you the degree of overfitting to the dev set so if the gap is large maybe you need to find bigger dev set.

Sometimes if your dev set distribution is much easier for whatever application, e.g., speech recognization, working on then the dev and test errors can actually go down.

Let me motivate this funny thing using the speech activated rearview mirror example. It turns out the key quantities numbers we've been writing down can be placed into a table where the horizontal axis I'm gonna place different datasets, for example, you might have data from a general speech recognition task and also you have a rearview mirror specific speech data recorded inside the car. On the vertical axis, I'm gonna label different ways or algorithms for examing the data, the first is the human-level performance which is how accurate are humans on each of these datasets, then there is an error on examples that your neural network has trained on, and finally the error on examples that your neural network has not trained on. Like we stated in previous, the gap between human-level and training errors measures avoidable bias, and the gap between training and training-dev errors measure variance, the gap between training-dev and dev/test errors measures data mismatch. But it turns out it could be useful to also fill in the remaining two entries in this table (human-level error, rearview mirror speech data) and (training error, rearview mirror speech data) because comparing human-level performance on general speech recognition data and rearview mirror speech data tells us for humans the rearview mirror speech data is actually harder than general speech recognition because humans get 6% error rather than 4% error. But for a lot of problems, you find that examining the subset of entries in the red colored framed scope is enough to point you a pretty promising direction, but sometimes filling out this whole table can give you additional insights.

Now, if your training set comes from a different distribution than your dev/test set and if error analysis that you have a data mismatch problem. Then, what we usually do is carry out manual error analysis and try to understand the difference between training and dev/test sets. For example, if you're building a speech activated rearview mirror application you might listen to examples in your dev set to try to figure out how your dev set is different from your training set, you may find a lot of example from your dev set are very noisy there are a lot of car noise and this is one way that your dev set differs from training set. And maybe you find other categories of errors, for example in the speech activated rearview mirror application, you might find that there's often misrecognizing street numbers because there are a lot more navigation queries which might have the street address, so getting street numbers right is really important. When you have insight into the nature of dev set errors what you can do is trying to find ways to make training data more similar, or alternatively trying to collect more data similar to your dev/test sets. So, for example, if you find car noise in the background is a major source of error one thing you could do is to simulate noisy in-car data. Similarly, if you find that you're having a hard time recognizing street numbers maybe you can go and deliberately try to get more data of people speaking out numbers and add that to your training set.

So, if you go and make the training data more similar to your dev set what are some things you can do? One of the techniques you can do is artificial data synthesis. To build speech recognition system maybe you don't have a lot of audio that was actually recorded inside the car with the background noise of the car background on its highway and so on, but it turns out there is a way to synthesize it. So, let's say you recorded a large amount of clean audio without this car background noise and you can also get a car noise, and if you take those two clips and add them together you can then synthesize that the original clean audio now sounds like in a noisy car. Through artificial synthesis, you might be able to quickly create data that sounds like recorded inside the car without needing to go out there and collect tons of data in the car that's actually driving along.

Now there's one note of caution, let's say you have 10 thousand hours of data that was recorded against a quiet background, and you just have one hour of car noise, so one thing you could try is to take this one hour noise and repeat it 10 thousand times in order to add to 10 thousand hours clean audio, if you do that the synthesized audio will sound perfectly fine to the human ear because one-hour car noise sounds just like any other hours of car noise to the human ear. but there's a risk that your learning algorithm will overfit to be one hour of car noise. But if it's possible, no guarantee, that using 10 thousand hours of unique car noise rather than just one hour that could result in better performance for your learning algorithm.

And the challenge with artificial synthesis is to the human ear perhaps your ears can tell the 10 thousand hours of repeated car noise all sounds the same as the one hour noise so you might end up creating this very impoverished synthesized dataset from a much smaller subset of this space (set of all car noise background) without realizing it.

Another artificial synthesis example, let's say you're building a self-driving car so you want to detect vehicles, you know put a boundary box around it. One idea is once you use computer graphics to simulate tons of images of cars, as shown in the following slide, and in fact here are a couple of pictures of cars that were generated using computer graphics. I think these graphics effects are actually pretty good and you can imagine by synthesizing pictures like this you could train a pretty good computer vision system for detecting cars. Unfortunately, if you just draw this very small subset of these cars, suppose the following purple circle stands for all cars, then to the human eye maybe the synthesized images look fine but you'll overfit to this small subset you're synthesizing.

Chen Yang

要查看或添加评论,请登录

Chen Yang??????的更多文章

社区洞察

其他会员也浏览了