登录查看更多内容

Deep learning: Training and testing on different distributions

Chen Yang??????

Machine & Deep Learning | Big Data Cloud

发布日期: 2018年3月5日

If you're working on a brand new machine learning application, one of the pieces of advice is that you should build your first system quickly and then iterate. First, you quickly set up dev and test set and metric so this is really deciding where to place your target, and if you get it wrong you can always move it later but just set up target somewhere. And then you should build your initial machine learning system quickly and find the training set to train it and start to see and understand how well you're doing against your dev/test set in the evaluation metric. When you build your initial system you then be able to use bias/variance analysis as well as errors analysis to prioritize the next steps. In particular, if error analysis causes you to realize that a lot of errors are from the speaker being very far from the microphone which causes the special changes of speech recognition then that will give you a good reason to focus on techniques to address this which is called far field to speech recognition which basically means handling when the speaker is far from the microphone.

Along the value of building this initial system, it can be a quick and direct implementation you don't overthink it but all the value of the initial system is having some trained system that allows you to look at the bias/variance to try to prioritize what to do next, allows you to do error analysis to look at some mistakes to figure out all the different directions you could go in which ones are actually the most worthwhile.

Deep learning algorithms have a huge hunger for training data, they just work best because we can find enough labeled training data to put into the training set. This has resulted in many teams sometimes whatever the data you can find just shoving it into the training set and just for getting more training data even if some of the data or even it may be a lot of this data doesn't come from the same distribution as your dev and test data. There are some subtleties and some best practices for dealing with when your training and test distributions differ from each other.

Take cat detector example again, look at the following two sources of data, one is from crawling the web and downloading a lot of very professionally frames, high-resolution images of cats. Another is the data you really care about from mobile app, which tends to be less professionally shot, less to frame, maybe even blurrier because of a shot by amateur users. Let's say you don't have a lot of mobile users yet so you maybe got 10 thousand of pictures uploaded, whereas by crawling the web you can download huge numbers of cat pictures and maybe you have 200 thousand cat pictures. So, what you really care about is your final system does work well on the mobile app distribution of images. One option is to put both of these datasets together and randomly shuffle them into a training, dev and test set. The advantage is now your training, dev, and test sets all come from the same distribution, but the disadvantage which actually is a huge disadvantage is if you look at your dev set of these 25 hundred examples a lot of it will come from the web page distribution of images rather than what you actually care about which is the mobile app distribution of images, and you don't wanna spend most of your time optimizing for the web page distribution of images. So, I don't recommend this option because this is setting up your dev set to tell your team to optimize for a different distribution of data than what you actually care about.

Instead, I will recommend taking another option. You will have the training with all 200 thousand images from the web, and you can if you want to add 5 thousand images from the mobile app, and then for your dev and test sets would be all mobile app images. The advantage of this way splitting up your data into training, dev and test set is you're aiming the target where you want it to be, the disadvantage of this course is now your training distribution is different from your dev and test distributions but it turns out that this splitting will get better performance over the long term, and we will discuss later some specific techniques for dealing with your training set coming from different distributions than your dev and test sets.

Estimating bias/variance of your learning algorithm really helps you prioritize what to work on next, but the way you analyze bias/variance changes when your training set comes from a different distribution than your dev and test set. Let's keep using the cat classifier example, and let's say humans get near perfect performance on this so Bayes error is near zero percentage of this problem. To carry out the error analysis you usually look at the training error and dev error. If your dev data come from the same distribution as your training set, you'll see that here you have a large variance problem, that your algorithm is not generalized well from the training set which it's doing well on to the dev set which it's suddenly doing worse on. But in the setting where your training data and dev data come from different distribution, you can no longer safely draw this conclusion. In particular, maybe it's doing just fine on dev set it's just that the training set was really easy because it was high resolution very clear images and maybe the dev set is much harder, so maybe there isn't a variance problem and this just affects that the dev set contains images that are much more difficult to classify accurately. So, the problem with this analysis is that when you went from the training error to dev error two things change at a time, one is that the algorithm saw data in the training set but not in the dev set, two is the distribution in your dev set is different, so it's different to know this 9% increase in error how much of it is because the algorithm didn't see the data in the dev set so that's sort of variance part of the problem, and how much of it is because dev set data is just different.

So, in order to tease out those two effects, it will be useful to define a new piece of data which we'll call the training dev set that should have the same distribution as training set but you don't explicitly train a neural network on this. To obtain the training-dev set what we're gonna do is randomly shuffle training set and then carve out just a piece of the training set to be the training-dev set. So, just as the dev and test set have the same distribution the training set and the training-dev set also has the same distribution. But the difference is that now you train your neural network just on the training set you won't run backpropagation on the training-dev portion of this data. To carry out error analysis what you should do is now look at the error of this classifier on the training set, on the training-dev set as well as on the dev set. So, if your training error is 1% and training-dev error goes up to 9% that says you have a variance problem because the only difference between training data and training-dev data is that your neural network got to saw the training part of this data and was trained explicitly on this part, but wasn't trained on the training-dev data. Let's look at another case that if your training error is 1% and training-dev error is 1.5% and dev error goes up to 10% so now you actually have a pretty low variance problem because when you went from training data that you've seen to the training-dev data that your neural network has not seen the error increase only a little bit but it really jumps when you go to the dev set so this is a data mismatch problem.

Let's look at few more example that if your training error is 10% and training-dev error is 11% and dev error is 12%, remember that human-level proxy for Bayes error is roughly 0%, so in this type of performance you really have an avoidable bias problem because you're doing much worse than human-level. And one last example if your training error is 10% and training-dev error is 11% and dev error is 20% then it looks like this actually has two issues: the avoidable bias is quite high because you're not even doing well on the training set for human-level error get near 0%, the variance from training to training-dev set seems quite small but the data mismatch from training-dev to dev is quite large.

Let's write out the general principles. The key quantities I will look at are the human-level error, training error, training-dev error and dev error. Depending on the difference between these errors you can get the sense of how big is the avoidable bias, the variance, and the data mismatch. You could also add test set error and the gap between dev and test error tells you the degree of overfitting to the dev set so if the gap is large maybe you need to find bigger dev set.

Sometimes if your dev set distribution is much easier for whatever application, e.g., speech recognization, working on then the dev and test errors can actually go down.

Let me motivate this funny thing using the speech activated rearview mirror example. It turns out the key quantities numbers we've been writing down can be placed into a table where the horizontal axis I'm gonna place different datasets, for example, you might have data from a general speech recognition task and also you have a rearview mirror specific speech data recorded inside the car. On the vertical axis, I'm gonna label different ways or algorithms for examing the data, the first is the human-level performance which is how accurate are humans on each of these datasets, then there is an error on examples that your neural network has trained on, and finally the error on examples that your neural network has not trained on. Like we stated in previous, the gap between human-level and training errors measures avoidable bias, and the gap between training and training-dev errors measure variance, the gap between training-dev and dev/test errors measures data mismatch. But it turns out it could be useful to also fill in the remaining two entries in this table (human-level error, rearview mirror speech data) and (training error, rearview mirror speech data) because comparing human-level performance on general speech recognition data and rearview mirror speech data tells us for humans the rearview mirror speech data is actually harder than general speech recognition because humans get 6% error rather than 4% error. But for a lot of problems, you find that examining the subset of entries in the red colored framed scope is enough to point you a pretty promising direction, but sometimes filling out this whole table can give you additional insights.

Now, if your training set comes from a different distribution than your dev/test set and if error analysis that you have a data mismatch problem. Then, what we usually do is carry out manual error analysis and try to understand the difference between training and dev/test sets. For example, if you're building a speech activated rearview mirror application you might listen to examples in your dev set to try to figure out how your dev set is different from your training set, you may find a lot of example from your dev set are very noisy there are a lot of car noise and this is one way that your dev set differs from training set. And maybe you find other categories of errors, for example in the speech activated rearview mirror application, you might find that there's often misrecognizing street numbers because there are a lot more navigation queries which might have the street address, so getting street numbers right is really important. When you have insight into the nature of dev set errors what you can do is trying to find ways to make training data more similar, or alternatively trying to collect more data similar to your dev/test sets. So, for example, if you find car noise in the background is a major source of error one thing you could do is to simulate noisy in-car data. Similarly, if you find that you're having a hard time recognizing street numbers maybe you can go and deliberately try to get more data of people speaking out numbers and add that to your training set.

So, if you go and make the training data more similar to your dev set what are some things you can do? One of the techniques you can do is artificial data synthesis. To build speech recognition system maybe you don't have a lot of audio that was actually recorded inside the car with the background noise of the car background on its highway and so on, but it turns out there is a way to synthesize it. So, let's say you recorded a large amount of clean audio without this car background noise and you can also get a car noise, and if you take those two clips and add them together you can then synthesize that the original clean audio now sounds like in a noisy car. Through artificial synthesis, you might be able to quickly create data that sounds like recorded inside the car without needing to go out there and collect tons of data in the car that's actually driving along.

Now there's one note of caution, let's say you have 10 thousand hours of data that was recorded against a quiet background, and you just have one hour of car noise, so one thing you could try is to take this one hour noise and repeat it 10 thousand times in order to add to 10 thousand hours clean audio, if you do that the synthesized audio will sound perfectly fine to the human ear because one-hour car noise sounds just like any other hours of car noise to the human ear. but there's a risk that your learning algorithm will overfit to be one hour of car noise. But if it's possible, no guarantee, that using 10 thousand hours of unique car noise rather than just one hour that could result in better performance for your learning algorithm.

And the challenge with artificial synthesis is to the human ear perhaps your ears can tell the 10 thousand hours of repeated car noise all sounds the same as the one hour noise so you might end up creating this very impoverished synthesized dataset from a much smaller subset of this space (set of all car noise background) without realizing it.

Another artificial synthesis example, let's say you're building a self-driving car so you want to detect vehicles, you know put a boundary box around it. One idea is once you use computer graphics to simulate tons of images of cars, as shown in the following slide, and in fact here are a couple of pictures of cars that were generated using computer graphics. I think these graphics effects are actually pretty good and you can imagine by synthesizing pictures like this you could train a pretty good computer vision system for detecting cars. Unfortunately, if you just draw this very small subset of these cars, suppose the following purple circle stands for all cars, then to the human eye maybe the synthesized images look fine but you'll overfit to this small subset you're synthesizing.

Chen Yang

要查看或添加评论，请登录

Chen Yang??????的更多文章

Practice on using ansible 2.4 to deploy HDP 2.6.4.0

2018年4月18日

Practice on using ansible 2.4 to deploy HDP 2.6.4.0

I'm practicing ansible installation of hdp 2.6.
Deep learning--CNN: localization in object detection (1/2)

2018年4月3日

Deep learning--CNN: localization in object detection (1/2)

Deep learning has been successfully applied to computer vision, speech recognition, online advertising, logistics many…
Deep learning--CNN: classic ConvNet, residual networks, inception network

2018年3月20日

Deep learning--CNN: classic ConvNet, residual networks, inception network

There are some classic neural network architectures LeNet-5, AlexNet, and VGG-16. First, let's look at the following…

1 条评论
Deep learning--CNN: Padding, strided convolution, convolution over volume, pooling layer

2018年3月12日

Deep learning--CNN: Padding, strided convolution, convolution over volume, pooling layer

In order to build deep neural networks, one modification to the basic convolutional operation that you need to really…
Deep learning--CNN: Edge detection

2018年3月11日

Deep learning--CNN: Edge detection

Computer vision is one of the areas advancing rapidly thanks to deep learning. Deep learning is now helping the…
Deep learning: End-to-end deep learning

2018年3月7日

Deep learning: End-to-end deep learning

One of the exciting recent developments in deep learning has been a rise of end-to-end deep learning. Basically, there…
Deep learning: Transfer learning, multitask learning

2018年3月6日

Deep learning: Transfer learning, multitask learning

One of the powerful ideas of deep learning is that sometimes you can take knowledge, the neural network has learned…

1 条评论
Deep learning: Error analysis

2018年3月4日

Deep learning: Error analysis

You've heard about orthogonalization, how to set up your dev and test, human-level performance as a proxy for Bayes…
Deep learning: human-level performance

2018年3月3日

Deep learning: human-level performance

In the last few years, there were a lot of talks about comparing the machine learning systems to human-level…
Deep learning: orthogonalization, evaluation metrics, train/dev/test set

2018年3月2日

Deep learning: orthogonalization, evaluation metrics, train/dev/test set

In the example of the earlier TV set, orthogonalization refers to that the TV designers had designed these knobs so…

See all articles

Deep learning: Training and testing on different distributions

Chen Yang??????

Machine & Deep Learning | Big Data Cloud

Chen Yang??????的更多文章

社区洞察

其他会员也浏览了

Supervised vs. Reinforcement Learning: Real-World Applications & Key Insights

Deep Learning Daily: 100th Edition Special!

Demystifying Machine Learning: An Introduction to Supervised Learning

Rolling in the Deep Learning: Basic Concepts for Everyone

Momentum Contrastive Learning

Types of machine learning

Deep Learning in Context of Different Learning Goals

A Machine Learning Primer

Unleashing the Power of Transfer Learning in Deep Learning

Real-world ML: Contrastive Learning, The Power of Grasping the Data Essence

Chen Yang??????的更多文章

Practice on using ansible 2.4 to deploy HDP 2.6.4.0

Deep learning--CNN: localization in object detection (1/2)

Deep learning--CNN: classic ConvNet, residual networks, inception network

Deep learning--CNN: Padding, strided convolution, convolution over volume, pooling layer

Deep learning--CNN: Edge detection

Deep learning: End-to-end deep learning

Deep learning: Transfer learning, multitask learning

Deep learning: Error analysis

Deep learning: human-level performance

Deep learning: orthogonalization, evaluation metrics, train/dev/test set

社区洞察

其他会员也浏览了

Supervised vs. Reinforcement Learning: Real-World Applications & Key Insights

Deep Learning Daily: 100th Edition Special!

Demystifying Machine Learning: An Introduction to Supervised Learning

Rolling in the Deep Learning: Basic Concepts for Everyone

Momentum Contrastive Learning

Types of machine learning

Deep Learning in Context of Different Learning Goals

A Machine Learning Primer

Unleashing the Power of Transfer Learning in Deep Learning

Real-world ML: Contrastive Learning, The Power of Grasping the Data Essence