Deep learning--CNN: localization in object detection (1/2)
chen yang

Deep learning--CNN: localization in object detection (1/2)

Deep learning has been successfully applied to computer vision, speech recognition, online advertising, logistics many many problems. There are few things that are unique about the application of deep learning to computer vision about the status of computer vision. You can think of most machine learning problems as falling in somewhere on the spectrum between where you have relatively little data to where you have lots of data. For example, today we have a decent amount of data for speech recognition at least relative to the complexity of the problem, and even though there are reasonably large data sets today for image recognition or image classification because image recognition is just a complicated problem to look at all those pixels and figure out what it is. It feels like even though the online data sets are quite big it feels like we still wish we had more data. And there are some problems like object detection where we have even less data, it looks at the picture and actually, you're putting bounding boxes so telling you where in the picture the objects are as well. Because of the cost of getting the boxes is just more expensive than labeling the objects and bounding boxes so we tend to have less data for object detection than for image recognition.

So if you look at a broad spectrum of machine learning problems you see on average that when you have a lot of data you tend to find people getting away when using simpler algorithms as well as less hand-engineering, so, there's just less need to carefully design features for the problem but instead you can have generous neural network even in a simpler architecture just learn whatever you want to learn when you have a lot of data, whereas in contrast if you don't have that much data then on average you see people engaging in more hand-engineering and if you want to be ungenerous you know you can say there're more hacks which are the best ways to get good performance. So, I think computer vision is trying to learn really complex functions and it often feels like we don't have enough data for computer vision even though dataset gets bigger and bigger often we just don't have much data as we need, and this is why state of computer vision historically and even today has relied more on hand-engineering.

Object detection is one of the areas of computer vision that is just exploding and it's working so much better than just a couple years ago. In order to build up object detection, you first learn about object localization. You're already familiar with image classification task where an algorithm looks a picture and might be responsible for saying this is a car. The problem of classification with localization is not only do you have to label this as, say, a car but the algorithm also is responsible for put a bounding box or drawing a red rectangle around the position of the car in the image, where the term localization refers to figuring out where in the picture is the car you detected.

Recall the image classification problem in which you might input an image into a ConvNet with multiple layers and this results in a vector of features that are fed to, maybe a softmax unit that outputs the predicted class. So, if you're building a self-driving car maybe your object categories are the following, where you might have a pedestrian or a car or a motorcycle or background that means none of the above, so these are your classes they have a softmax with four possible outputs so this is a standard classification pipeline, how about you want to localize the car in the image as well? to do that you can change your neural network to have a few more output units that output a bounding box. I'm gonna use the notational convention that the upper-left of the image is coordinated (0,0) and lower-right is (1,1), so specifying the red rectangle bounding box requires specifying the midpoint (b_x,b_y) as well as the height b_h and the width b_w.

The output y cap is a vector suppose there is only one possible object in the image, where p_c=1 means there detected an object, otherwise means none of the objects is found. And the loss function L(y cap, y) is, say, a squared error, notice that y cap has eight components so (y cap - y)^2 goes sum of squares of differences of the elements. That is if y_1=1 so that's the case where there is an object because y_1 is equal to p_c, the loss can be the sum of squares over all the different elements. The other case is if y_1=0 the loss can be (y_1 cap - y_1)^2 because all the rest of components are "don't care" so all you care about is how accurately is the neural network outputting p_c in that case.

In more general cases, you can have a neural network just outputting x and y coordinates of important points in image sometimes called the landmark. Let's say you're building a face recognition application and for some reason you want your algorithm to tell you where is the corner of someone's eye, so that point has a coordinate (l_x, l_y), and you can just have a neural network and have its final layer outputting two more numbers l_x and l_y. What if you want it to tell you key points along the eye, and along the mouth, so you can extract the mouth shape and tell the person is smelling or frowning, maybe extract few key points along the edges of the nose, along the face which maybe define the jawline, but you could define some numbers for the sake of argument, let's say 64 points or landmarks on the face. By selecting a number of landmarks and generating a labeled training set that contains all of these landmarks you can then have the network to tell you where all the key positions or the landmark on the face. This is a basic building block for recognizing emotions from faces, and if you played with the snapchat and other entertainment you know AR augmented reality filter so the snapchat filter can only draw a crown on the face and have other special effects being able to detect these landmarks on the face is also a key building block for the computer graphics effects.

Now let's move to building object detection algorithm, and here we use ConvNet to perform object detection using something called the sliding windows detection algorithm. Let's say you want to build a car detection algorithm, you can first create a labeled training set, so you can start off with which I'm gonna call closely cropped images meaning that x is pretty much only the cars, so you can take a picture and crop out and just cut out anything else that is not part of a car so you end up with the car centered and in pretty much the entire image. Given this labeled training set you can then train a ConvNet that inputs an image like one of these closely cropped images and then the job of the ConvNet is to output y zero or one is as a car or not. Once you trained this ConvNet you can then use it in sliding windows detection.

So the way you do that is if you have a testing image you start by picking a certain window size that is shown down there and then you would input into the above trained ConvNet a small rectangle region, so take just a little red square and put that into the ConvNet and have the ConvNet to make prediction, and presumably for that little region in the red square I'll say no, that little red square does not contain a car.

In the sliding window detection algorithm what you do is you then process input the second image now bounded by this red square shifted a little bit over and feed that into the ConvNet and run the ConvNet again, and then you do that with the third image and so on, and you keep going until you slide the window across every position in the image. And here I'm using a pretty large stride in this example just to make the animation go faster but the idea is you basically go through every region of this size and pass lots of little cropped images into the ConvNet and have it to classify 0 or 1 for each position and some stride.

Now, having done this once with sliding window through the image you'd the repeat it but now use a larger window. So now you take a slightly larger region and run that region so resize the region into whatever input size the ConvNet is expecting and feed that to the ConvNet and have it to output 0 or 1, and then slide the window over again using some stride and so on, and you run that throughout your entire image until you get the end. And you might do the third time using even larger windows and so on. The hope is that if you do this then so long as there's a car somewhere in the image that there will be a window.

Now there's a huge disadvantage of sliding windows detection which is the computational cost because you're cropping out so many different square regions in the image and running each of them independently through a ConvNet, and if you use a very coarse stride, a big stride, very big step side then that will reduce the number of windows you need to pass through the ConvNet but that coarser granularity may hurt performance, whereas if you use very fine granularity or very small stride then the huge number of, you know all these little regions you're passing through the ConvNet means that there's a very high computational cost.

The sliding window object detector can be implemented convolutionally or much more efficiently. Let's say your object detection algorithm input 14x14x3 image, this is quite small but just for illustrative purposes, and let's say it then uses 5x5 filters and let's say we use 16 of them to map from 14x14x3 to 10x10x16, and then does 2x2 max pooling to reduce to 5x5x16 then has a fully connected layer to connect at 400 units and then another fully connected layer and finally output y using a softmax unit.

In order to make the change I'm gonna change this picture a little bit, and instead, I'm gonna view y as four numbers corresponding to the class probabilities of the four classes that the softmax unit is classifying amongst, the four classes could be pedestrian, motorcycle, car, and background. So how the fully connected layers can be turned into convolutional layers? So, the second ConvNet is same as before for the first few layers, and now one way of implementing the fully connected layer is to implement it as 400 number of 5x5x16 filters, so if you take 5x5x16 and convolve it with 5x5x16 filter the output would be 1x1, and if you have 400 number of 5x5x16 filter, then the output dimension would be 1x1x400, and so rather than viewing these 400 as just a set of nodes we're gonna view these as 1x1x400 volume, and mathematically this is the same as a fully connected layer because each of these 400 nodes has a filter of dimension 5x5x16 and so each of these 400 values is some, you know, arbitrary linear function of these 5x5x16 activations from the previous layer. To implement the next covolutional layer we're gonna implement a 1x1x400 convolution and if you have 400 1x1 fiters then the next layer will again be 1x1x400 so that corresponds to the second fully connected layer, and finally we're gonna have another 1x1 filter followed by a softmax activation so as to give a 1x1x4 volume to take the place of the four numbers that the first ConvNest is outputting.

So this shows how you can take these fully connected layers and implement them using convolutional layers so that these set of units instead are now implemented as 1x1x400 and 1x1x4 volumes.

Armed this conversion, let's see how you can have a convolutional implementation of sliding windows object detection. Let's say your sliding windows ConvNet inputs 14x14x3 images and as before you have a neural network as follows as eventually outputs a 1x1x4 volume, which is the output of your softmax unit. Here for simplifying the drawing I just only draw the front face of these volumes and just drop 3d component of these drawing. And let's say your test set image is 16x16x3 so add that yellow stripe to the border of the 14x14x3 image, so the original sliding windows algorithm you might want to input the blue region into a ConvNet and run that once to generate a classification 0 or 1, and then slide it right by 2 pixels to input the green rectangle into the ConvNet and run the whole ConvNet and get another label 0 or 1, and then you might input the orange region into the ConvNet and run it one more time to get another label, and then do the fourth and final time with the lower right purple square.

And so to run sliding windows on this 16x16x3 image you run the ConvNet from above 4 times in order to get four labels, but it turns out a lot of this computation done by these four ConvNet is highly duplicated, so what the convolutional implementation of sliding windows does is it allows these four ConvNet to share a lot of computation. Specifically, we can take the ConvNet and just run it same parameters, the same 16 5x5x3 filters and run it and now you can have a 12x12x16 output volume and then do the maxpool same as before, now you have 6x6x16 to run through your same 400 5x5x16 filters to get now your 2x2x400 volume instead of 1x1x400 volume, and then run it through 400 1x1x400 filters it gives you another 2x2x400 volume, and do that one more time now you're left with a 2x2x4 output volume instead of 1x1x4. And it turns out the blue 1x1x4 subset gives you the result of running the upper-lefthand corner 14x14x3 image.

The convolutional implementation of sliding windows is more computationally efficient, but still has a problem not quite outputting the most accurate bounding boxes. A good way to get this output more accurate bounding boxes is with the YOLO algorithm, YOLO stands for you only look once. Here's what you do, let's say you have an input image 100x100 and you're gonna place down a grid on this image, and here's for the purpose of illustration I'm using 3x3 grid although in an actual implementation you use a finer one. And the basic idea is you're gonna take image classification and localization algorithm and apply that to each of the nine grid cells of this image, so the more concrete, for each grid cell you specify a label y which is an 8-dimensional vector same as you saw previously, so let's start with the upper left cell, for that one there is no object so the label vector y for the upper-left cell has p_c as 0 and doesn't care for the rest of elements. And then check the other cells of the first row and find there's no interesting object in them, and then move to the next row. Because the YOLO algorithm takes the midpoint of each of the two object and it assigns the object to the grid cell containing the midpoints, so the midpoint of the left car is assigned to the left-most cell and the right car midpoint is assigned to the right-most cell of the second row, and even though the central grid cell has some part of both cars it will pretend the central grid cell has no interesting object, so for each of nine grid cell you'll end up with an eight dimensional vector and because you have 3x3 grid cells, so the total volume of the output is gonna be 3x3x8.

So now to train your neural network the input is a 100x100x3 image and then you have a ConvNet with convolutional layers, maxpool layers and so on so that in the end this eventually maps to the 3x3x8 output volume. So the advantage of this algorithm is that the neural network outputs precise bounding boxes so at test time what you do is you feed in an input image x and run forward propagation until you get this output y and then for each of the nine outputs you can then just read off one or zero you know there's an object associated with that one of the nine positions, and if there is an object and there's a bounding box for the object in that grid cell. And so long as you don't have more than one in each grid cell this algorithm should work okay. But in practice, if I use a relatively small or much finer grid you will reduce the chance that multiple objects assigned to the same grid cell.

Notice two things, the first this is a lot like the image classification and localization algorithm that it outputs the bounding boxes explicitly and so this allows your neural network to output bounding boxes of any aspect ratio as well as output precise coordinates that aren't just dictated by stride size of your sliding window classifier. And second, this is a convolutional implementation and you're not implementing this algorithm nine times on the 3x3 grid instead this is one single convolutional implementation but you use one ConvNet with a lot of chared computation for all 3x3 grid cells. In fact, one nice thing about YOLO algorithm which counts for popularity is because this is a convolutional implementation, it actually runs very fast so this works even for real-time object detection.

And there is one more detail which is how you encode these bounding boxes, so given these two cars remember we have the 3x3 grid and let's take the example of the car on the right, so in the right-most grid cell in the second row, there is an object car so the target label y would have p_c as 1, in the YOLO algorithm relative to this square or coordinate, I will take the convention that the upper-left point of the cell is (0,0) and the lower-right point is (1,1), so to specify the position of midpoint, i.e., the orange dot (b_x, b_y) would be (0.4, 0.3), and then the width of the bounding box is specified as a fraction of the overall width of this cell, so the width of the red box is maybe 50% of the width of the cell, and the height of red box is 90% of height of the cell. So in other words, these b_x, b_y, b_h and b_w are specified relative to the grid cell, and b_x and b_y have to be between 0 and 1 because by definition the midpoint is within the bounds of that grid cell that was assigned to it. But b_h and b_w could be greater than 1, in particular, if you had a car where the bounding box is bigger and stride over cells then the height and width of a bounding box could be greater than 1.

Chen Yang

要查看或添加评论,请登录

Chen Yang??????的更多文章

社区洞察

其他会员也浏览了