Deep learning: orthogonalization, evaluation metrics, train/dev/test set
In the example of the earlier TV set, orthogonalization refers to that the TV designers had designed these knobs so each knob kinda does only one thing and this makes much easier to tune the TV so you know the picture gets centered. In the example of car, orthogonalization refers to that if you think one dimension and what you want to do as controlling the steering angle, and another dimension as controlling your speed then you want one knob to just affect the steering angle as much as possible, and another knob is really acceleration or braking that control your speed. But if you have the control that mixes these two together, say both angle and speed changes at the same time, then it becomes much harder to set to the speed and angle as you want. By orthogonal control, at 90 degrees to each other, that are ideally aligned with the things you actually want to control
So, how does this relate to machine learning? First, you have to make sure you're at least doing well on the training set, so, performance on the training set needs to pass some acceptable fashion. After that, you then hope that is also doing well on dev set and you then hope that is also doing well on test set, and finally you'll hope that doing well on the test set and the cost function results in your system performing in the real world, and you hope it'll resolve in good.
If your algorithm is not fitting the training set well on the cost function, you want one knob or one specific set of knobs to make sure you can tune the algorithm to make it fit well on the training set. So, the knobs that you use to tune this are you might train a bigger network, or you might switch to a better optimization algorithm and so on. In contrast, if you find your algorithm is not fitting the dev set well, it's doing well in training set but not on dev set, then there's a separate set of knobs around regularization that you could use to try to make it satisfy the second criteria. And getting a bigger training set would be another knob you can use, that help make your learning algorithm generalize better to the dev set. Now you're doing well in training set and dev set if you are not doing well on the test set, then the knobs you tune is probably you want to get a bigger dev set because if that does well on the dev set but not the test set probably it means it overfit your dev set then you need to go back and get a bigger dev set. Finally, if your algorithm does well on test set but not in real world, probably you need to go back and change either the test set or the cost function because that means either your test distribution isn't set correctly or your cost function isn't measuring the right thing.
If you have set a single real number evaluation metric for your problem, you'll find your progress is much faster. Applying machine learning is a very empirical process if you have an idea and code it up and run in an experiment to see how it works, and then use the outcome of the experiment to refine your idea, you keep going around this loop and keep improving your algorithm.
Tak example of recognization of cats, you have two classifiers to compare in term of metrics precision and recall. Precision means of the examples that your classifier recognizes cats what percentage actually are cats. Recall means of all the images that really are cats what percentage were correctly recognized by your classifier. It turns out there is often a trade-off between precision and recall, and you care about both that you want that when classifier says something is a cat there's a high chance it really a cat, but of all the cat images you also it to pull out a large fraction of them as cats. So, it might be reasonable to try to evaluate your classifier in terms of precision and recall. The problem is that if classifier A does better on recall whereas classifier B does better on precision then you're not sure which classifier is better. So, what here is recommended is rather than using these two numbers of precision and recall you instead define a new evaluation metric that combines precision and recall. In machine learning literature, that metric is called F1 force you can think of this as the average of precision and recall, and it's good at trading off precision and recall. So, having a well-defined dev set which is how you're measuring precision and recall plus a single number evaluation metric allows you to quickly tell if classifier A or classifier B is better.
It's not always easy to combine all the things you care about into single number evaluation metric. In such cases, it's sometimes useful to set up satisfying as well as optimizing metrics. Take the following example, it seems a little bit artificial using a linear way to sum these two things. You might choose the classifier that maximizes accuracy but subjects to that the running time is less or equal to 100ms. So, in this case, we'll say accuracy is the optimizing metric because you wanna maximize accuracy, but running time is what we call a satisfying metric, meaning it just has to be good enough, it just needs to be less or equal than 100ms and beyond that you don't really care or you don't care that much.
So, more generally if you have n metrics that you care about, it's sometimes reasonable you pick one of them as optimizing so you wanna do as well as possible on that one. And n-1 metrics are to be satisfying meaning that so long as they reach some threshold you don't care how much better it is in that threshold.
The way you set up training/dev/test set can have a huge impact on how rapidly you could make progress building machine learning application. You've tried a lot of ideas and trained up different models on a training set and use the dev set to evaluate the different ideas and pick up, and keep iterating to improve dev set performance until finally, you have one classifier you're happy with that you then evaluate on your test set.
Take the example you'll build a cat classifier, you'll operate in eight different regions and one way to set up dev/test set you could do is pick four these regions as dev set and other regions as the test set. This is a very bad idea because your dev and test sets come from different distributions. Setting dev set plus your single real number evaluation metrics that's like placing a target telling your team where you thing is a bullseye you want to aim at. The team can iterate quicking trying different ideas, running experiments and very quicking use dev set and metrics to evaluate classifiers and pick the best one. The problem is that your team may spend months iterating to do well on the dev set only to realize when you finally go to the test set all the moths of work you spent optimizing to the dev set is not giving you good performance on the test set. So, having dev and test set from different distributions it's like setting a target having your team trying in close near the bullseye when suddenly you move bullseye to a different location somewhere else.
So, to avoid this what I recommend is that you take all these data and randomly shuffle the data into dev/test set and have data from all eight regions, and the dev set and test set really come from the same distribution which is the distribution of all real data mixed together.
Now, you know dev and test set should come from the same distribution but how large they should be? The rule of thumb is taking all the data you have and using 70%/30% split into training and test sets, or maybe you'll use 60% for training, 20% for dev set and 20% for the test set. In early machine learning era, this was pretty reasonable especially back when dataset sizes were just smaller, but in modern machine learning, we're now used to working with much larger dataset sizes. Let's say you have a million training examples it might be quietly reasonable to set up the data so you have 98% training set, 1% for dev set and 1% for test set because in a million examples then 1% of that is ten thousand example that might be plenty enough for your dev set or test set.
So, how about the test set? Remember the purpose of your test set is that after you finish developing a system the test set helps you evaluate how good your final system is. So, the guideline is to set your test set big enough to give high confidence in the overall performance of your system. For some application maybe you don't need a high confidence in the overall performance of your final system. Maybe all you need is a training and a dev set, not having a test set might be okay. This is a little bit unusual I'm not recommending not having a test set when I'm building a system I do find it reassuring to have a separate test set you can use to get an unbiased estimate of how it was doing before you ship it.
You set up dev set and evaluation metric like placing a target somewhere for your team to aim at, but sometimes partway through a project, you might realize you put your target in wrong place, in that case, you should move your target. Let's say you build a cat classifier and the metric is classification error. Algorithm A seems better because the classification error is smaller than algorithm B. But let's say you try out these algorithms and algorithm A may for some reason is letting through a lot of pornographic images, so, if you ship algorithm A the users would see more cat images because of 3% classification error but also shows users some pornographic images, which is totally unacceptable. In contrast, algorithm B has 5% classification error so it misclassifies few images but it doesn't have pornographic images. So, from your company's point of view as well as the user acceptable point of view algorithm B is much better.
So, in this case, the evaluation metric plus the dev set it prefers to algorithm A because it has a lower error but you or your users prefer algorithm B. So when this happens when your evaluation metric is no longer correctly rank ordering preferences between algorithms, that is a sign you should change your evaluation metric or perhaps your development set or test set. The problem is this evaluation metric (the expression is as shown in the following slide) is that they treat pornographic and non-pornographic images equally but you really want your classifier not to mislabel pornographic images as cat images. One way to change this evaluation metric would be if you add a weight term w^(i) and it is equal to 1 if x^(i) is non-porn and 10 or maybe even large number when x^(i) is porn, and this normalization constant becomes sum of w^(i) over i.
So far we talked about how to define evaluation metric to evaluate classifier and help us better to rank all the classifiers when they're performing at varying levels in term of streaming out porn. And this is actually an example of orthogonalization, where you take a machine learning problem and break it into steps, so one knob on one step is to figure out how to define a metric that captures what you want to do and you will separately worry about how to do well on this metric.
Another example is that even if algorithm A is doing better than B but when you deploy algorithm as a product you find algorithm B actually is performing better. And you've found that you've been training off very nice high-quality images but when you deploy it on mobile app users are uploading all sorts of pictures much less frame or much blurrier and when you test algorithm you find algorithm B is actually doing better. So, this is another example of your metric and dev/test set falling down.
Chen Yang