ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Deep learning: orthogonalization, evaluation metrics, train/dev/test set

Chen Yang??????

Machine & Deep Learning | Big Data Cloud

å‘å¸ƒæ—¥æœŸ: 2018å¹´3æœˆ2æ—¥

In the example of the earlier TV set, orthogonalization refers to that the TV designers had designed these knobs so each knob kinda does only one thing and this makes much easier to tune the TV so you know the picture gets centered. In the example of car, orthogonalization refers to that if you think one dimension and what you want to do as controlling the steering angle, and another dimension as controlling your speed then you want one knob to just affect the steering angle as much as possible, and another knob is really acceleration or braking that control your speed. But if you have the control that mixes these two together, say both angle and speed changes at the same time, then it becomes much harder to set to the speed and angle as you want. By orthogonal control, at 90 degrees to each other, that are ideally aligned with the things you actually want to control

So, how does this relate to machine learning? First, you have to make sure you're at least doing well on the training set, so, performance on the training set needs to pass some acceptable fashion. After that, you then hope that is also doing well on dev set and you then hope that is also doing well on test set, and finally you'll hope that doing well on the test set and the cost function results in your system performing in the real world, and you hope it'll resolve in good.

If your algorithm is not fitting the training set well on the cost function, you want one knob or one specific set of knobs to make sure you can tune the algorithm to make it fit well on the training set. So, the knobs that you use to tune this are you might train a bigger network, or you might switch to a better optimization algorithm and so on. In contrast, if you find your algorithm is not fitting the dev set well, it's doing well in training set but not on dev set, then there's a separate set of knobs around regularization that you could use to try to make it satisfy the second criteria. And getting a bigger training set would be another knob you can use, that help make your learning algorithm generalize better to the dev set. Now you're doing well in training set and dev set if you are not doing well on the test set, then the knobs you tune is probably you want to get a bigger dev set because if that does well on the dev set but not the test set probably it means it overfit your dev set then you need to go back and get a bigger dev set. Finally, if your algorithm does well on test set but not in real world, probably you need to go back and change either the test set or the cost function because that means either your test distribution isn't set correctly or your cost function isn't measuring the right thing.

If you have set a single real number evaluation metric for your problem, you'll find your progress is much faster. Applying machine learning is a very empirical process if you have an idea and code it up and run in an experiment to see how it works, and then use the outcome of the experiment to refine your idea, you keep going around this loop and keep improving your algorithm.

Tak example of recognization of cats, you have two classifiers to compare in term of metrics precision and recall. Precision means of the examples that your classifier recognizes cats what percentage actually are cats. Recall means of all the images that really are cats what percentage were correctly recognized by your classifier. It turns out there is often a trade-off between precision and recall, and you care about both that you want that when classifier says something is a cat there's a high chance it really a cat, but of all the cat images you also it to pull out a large fraction of them as cats. So, it might be reasonable to try to evaluate your classifier in terms of precision and recall. The problem is that if classifier A does better on recall whereas classifier B does better on precision then you're not sure which classifier is better. So, what here is recommended is rather than using these two numbers of precision and recall you instead define a new evaluation metric that combines precision and recall. In machine learning literature, that metric is called F1 force you can think of this as the average of precision and recall, and it's good at trading off precision and recall. So, having a well-defined dev set which is how you're measuring precision and recall plus a single number evaluation metric allows you to quickly tell if classifier A or classifier B is better.

It's not always easy to combine all the things you care about into single number evaluation metric. In such cases, it's sometimes useful to set up satisfying as well as optimizing metrics. Take the following example, it seems a little bit artificial using a linear way to sum these two things. You might choose the classifier that maximizes accuracy but subjects to that the running time is less or equal to 100ms. So, in this case, we'll say accuracy is the optimizing metric because you wanna maximize accuracy, but running time is what we call a satisfying metric, meaning it just has to be good enough, it just needs to be less or equal than 100ms and beyond that you don't really care or you don't care that much.

So, more generally if you have n metrics that you care about, it's sometimes reasonable you pick one of them as optimizing so you wanna do as well as possible on that one. And n-1 metrics are to be satisfying meaning that so long as they reach some threshold you don't care how much better it is in that threshold.

The way you set up training/dev/test set can have a huge impact on how rapidly you could make progress building machine learning application. You've tried a lot of ideas and trained up different models on a training set and use the dev set to evaluate the different ideas and pick up, and keep iterating to improve dev set performance until finally, you have one classifier you're happy with that you then evaluate on your test set.

Take the example you'll build a cat classifier, you'll operate in eight different regions and one way to set up dev/test set you could do is pick four these regions as dev set and other regions as the test set. This is a very bad idea because your dev and test sets come from different distributions. Setting dev set plus your single real number evaluation metrics that's like placing a target telling your team where you thing is a bullseye you want to aim at. The team can iterate quicking trying different ideas, running experiments and very quicking use dev set and metrics to evaluate classifiers and pick the best one. The problem is that your team may spend months iterating to do well on the dev set only to realize when you finally go to the test set all the moths of work you spent optimizing to the dev set is not giving you good performance on the test set. So, having dev and test set from different distributions it's like setting a target having your team trying in close near the bullseye when suddenly you move bullseye to a different location somewhere else.

So, to avoid this what I recommend is that you take all these data and randomly shuffle the data into dev/test set and have data from all eight regions, and the dev set and test set really come from the same distribution which is the distribution of all real data mixed together.

Now, you know dev and test set should come from the same distribution but how large they should be? The rule of thumb is taking all the data you have and using 70%/30% split into training and test sets, or maybe you'll use 60% for training, 20% for dev set and 20% for the test set. In early machine learning era, this was pretty reasonable especially back when dataset sizes were just smaller, but in modern machine learning, we're now used to working with much larger dataset sizes. Let's say you have a million training examples it might be quietly reasonable to set up the data so you have 98% training set, 1% for dev set and 1% for test set because in a million examples then 1% of that is ten thousand example that might be plenty enough for your dev set or test set.

So, how about the test set? Remember the purpose of your test set is that after you finish developing a system the test set helps you evaluate how good your final system is. So, the guideline is to set your test set big enough to give high confidence in the overall performance of your system. For some application maybe you don't need a high confidence in the overall performance of your final system. Maybe all you need is a training and a dev set, not having a test set might be okay. This is a little bit unusual I'm not recommending not having a test set when I'm building a system I do find it reassuring to have a separate test set you can use to get an unbiased estimate of how it was doing before you ship it.

You set up dev set and evaluation metric like placing a target somewhere for your team to aim at, but sometimes partway through a project, you might realize you put your target in wrong place, in that case, you should move your target. Let's say you build a cat classifier and the metric is classification error. Algorithm A seems better because the classification error is smaller than algorithm B. But let's say you try out these algorithms and algorithm A may for some reason is letting through a lot of pornographic images, so, if you ship algorithm A the users would see more cat images because of 3% classification error but also shows users some pornographic images, which is totally unacceptable. In contrast, algorithm B has 5% classification error so it misclassifies few images but it doesn't have pornographic images. So, from your company's point of view as well as the user acceptable point of view algorithm B is much better.

So, in this case, the evaluation metric plus the dev set it prefers to algorithm A because it has a lower error but you or your users prefer algorithm B. So when this happens when your evaluation metric is no longer correctly rank ordering preferences between algorithms, that is a sign you should change your evaluation metric or perhaps your development set or test set. The problem is this evaluation metric (the expression is as shown in the following slide) is that they treat pornographic and non-pornographic images equally but you really want your classifier not to mislabel pornographic images as cat images. One way to change this evaluation metric would be if you add a weight term w^(i) and it is equal to 1 if x^(i) is non-porn and 10 or maybe even large number when x^(i) is porn, and this normalization constant becomes sum of w^(i) over i.

So far we talked about how to define evaluation metric to evaluate classifier and help us better to rank all the classifiers when they're performing at varying levels in term of streaming out porn. And this is actually an example of orthogonalization, where you take a machine learning problem and break it into steps, so one knob on one step is to figure out how to define a metric that captures what you want to do and you will separately worry about how to do well on this metric.

Another example is that even if algorithm A is doing better than B but when you deploy algorithm as a product you find algorithm B actually is performing better. And you've found that you've been training off very nice high-quality images but when you deploy it on mobile app users are uploading all sorts of pictures much less frame or much blurrier and when you test algorithm you find algorithm B is actually doing better. So, this is another example of your metric and dev/test set falling down.

Chen Yang

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Chen Yang??????çš„æ›´å¤šæ–‡ç«

Practice on using ansible 2.4 to deploy HDP 2.6.4.0

2018å¹´4æœˆ18æ—¥

Practice on using ansible 2.4 to deploy HDP 2.6.4.0

I'm practicing ansible installation of hdp 2.6.
Deep learning--CNN: localization in object detection (1/2)

2018å¹´4æœˆ3æ—¥

Deep learning--CNN: localization in object detection (1/2)

Deep learning has been successfully applied to computer vision, speech recognition, online advertising, logistics manyâ€¦
Deep learning--CNN: classic ConvNet, residual networks, inception network

2018å¹´3æœˆ20æ—¥

Deep learning--CNN: classic ConvNet, residual networks, inception network

There are some classic neural network architectures LeNet-5, AlexNet, and VGG-16. First, let's look at the followingâ€¦

1 æ¡è¯„è®º
Deep learning--CNN: Padding, strided convolution, convolution over volume, pooling layer

2018å¹´3æœˆ12æ—¥

Deep learning--CNN: Padding, strided convolution, convolution over volume, pooling layer

In order to build deep neural networks, one modification to the basic convolutional operation that you need to reallyâ€¦
Deep learning--CNN: Edge detection

2018å¹´3æœˆ11æ—¥

Deep learning--CNN: Edge detection

Computer vision is one of the areas advancing rapidly thanks to deep learning. Deep learning is now helping theâ€¦
Deep learning: End-to-end deep learning

2018å¹´3æœˆ7æ—¥

Deep learning: End-to-end deep learning

One of the exciting recent developments in deep learning has been a rise of end-to-end deep learning. Basically, thereâ€¦
Deep learning: Transfer learning, multitask learning

2018å¹´3æœˆ6æ—¥

Deep learning: Transfer learning, multitask learning

One of the powerful ideas of deep learning is that sometimes you can take knowledge, the neural network has learnedâ€¦

1 æ¡è¯„è®º
Deep learning: Training and testing on different distributions

2018å¹´3æœˆ5æ—¥

Deep learning: Training and testing on different distributions

If you're working on a brand new machine learning application, one of the pieces of advice is that you should buildâ€¦
Deep learning: Error analysis

2018å¹´3æœˆ4æ—¥

Deep learning: Error analysis

You've heard about orthogonalization, how to set up your dev and test, human-level performance as a proxy for Bayesâ€¦
Deep learning: human-level performance

2018å¹´3æœˆ3æ—¥

Deep learning: human-level performance

In the last few years, there were a lot of talks about comparing the machine learning systems to human-levelâ€¦

See all articles

Deep learning: orthogonalization, evaluation metrics, train/dev/test set

Chen Yang??????

Machine & Deep Learning | Big Data Cloud

Chen Yang??????çš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

What is the difference between FP16 and FP32 when doing deep learning?

Exploring Hyperparameter Tuning in Machine Learning: Techniques, Strategies & Tools

Understanding the variance of Variational Autoencoders

OpenPose: Real time tracking of Human body pose with Deep Learning

Self Organizing Maps

Root Mean Square Propagation

Deep Learning does not solve problems...

Cross-Modal Attention Blocks: Bridging Modalities with Deep Learning ??????

The Different Types of Machine Learning: An Overview

Machine Learning for everyone

Chen Yang??????çš„æ›´å¤šæ–‡ç«

Practice on using ansible 2.4 to deploy HDP 2.6.4.0

Deep learning--CNN: localization in object detection (1/2)

Deep learning--CNN: classic ConvNet, residual networks, inception network

Deep learning--CNN: Padding, strided convolution, convolution over volume, pooling layer

Deep learning--CNN: Edge detection

Deep learning: End-to-end deep learning

Deep learning: Transfer learning, multitask learning

Deep learning: Training and testing on different distributions

Deep learning: Error analysis

Deep learning: human-level performance

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

What is the difference between FP16 and FP32 when doing deep learning?

Exploring Hyperparameter Tuning in Machine Learning: Techniques, Strategies & Tools

Understanding the variance of Variational Autoencoders

OpenPose: Real time tracking of Human body pose with Deep Learning

Self Organizing Maps

Root Mean Square Propagation

Deep Learning does not solve problems...

Cross-Modal Attention Blocks: Bridging Modalities with Deep Learning ??????

The Different Types of Machine Learning: An Overview

Machine Learning for everyone

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†