Localization and Object Detection with Deep Learning and YOLO (Single shot detectors)
Shashank V Raghavan??
Artificial Intelligence?| Autonomous Systems??| Resident Robot Geek??| Quantum Computing??| Product and Program Management??
Localization and Object detection are two of the core tasks in Computer Vision , as they are applied in many real-world applications such as Autonomous vehicles and Robotics.
The goal is to classify the object and localize it, the process is split into two steps: In the first case , we know the number of objects (we will refer to the problem as classification + localization) and in the second we don’t (object detection).
Classification + Localization
If we have only one object or we know the number of objects, it is actually trivial. We can use one convolutional neural network and train it not only to classify the image but also to output 4 coordinates for the bounding box. In that way we treat the localization as a simple regression problem.
For example, we can borrow a well-studied model such as ResNet or Alexnet which consists of a bunch of convolutional, pooling and other layers, and repurpose the fully connected layer to produce the bounding box apart from the category. It is so simple that make us question whether or not it will give results. And it actually works pretty well in practice. Of course, you can get fancy with it and modify the architecture for serving specific problems or enhance its accuracy, but the main idea remains.
Be sure to note that in order to use this model, we should have a training set with images annotated for the class and the bounding box. And it is not the most fun to do such annotations.
But what if we do no know the number of objects a priori? Then we need to get into the rabbit’s hole and talk about some hardcore stuff. Are you ready? Do you want to take a break before? Sure, I understand but I warn you not to leave. This is where the fun begins.
Object Detection
There are some clever ideas to make the system intolerant to the number of outputs and to reduce its computation cost. So, we do not know the exact number of objects in our image and we want to classify all of them and draw a bounding ox around them. That means that the number of coordinates that the model should output is not constant. If the image has 2 objects , we need 8 coordinates . If it has 4 objects, we want 16. So how we build such a model?
One key idea to traditional computer vision is regions proposal. We generate a set of windows that are likely to contain an object using classic CV algorithms, like edge and shape detection and we apply only these windows( or regions of interests) to the CNN.
R-CNN
Given an image with multiple objects , we generate some regions of interests using a proposal method(in RCNN’s case this method is called selective search) and warp the regions into a fixed size. We forward each region to Convolutional Neural Network (such as AlexNet), which will use an SVM to make a classification decision for each one and predicts a regression for each bounding box. This prediction comes as a correction of the region proposed, which may be in the right position but not at the exact size and orientation.
Although the model produces good results, it suffers from a main issue. It is quite slow and computational expensive. Imagine that in an average case, we produce 2000 regions, which we need to store in disk, and we forward each one of them into the CNN for multiple passes until it is trained. To fix some of these problems, an improvement of the model comes in play called ‘fast-RCNN’.
Fast RCNN
The idea is straightforward. Instead of passing all regions into the convolutional layer one by one, we pass the entire image once and produce a feature map. Then we take the region proposals as before ( using some external method) and sort of project them onto the feature map. Now we have the regions in the feature map instead of the original image and we can forward them in some fully connected layers to output the classification decision and the bounding box correction.
Note that the projection of regions proposal is implemented using a special layer(ROI layer) ,which is essentially a type of max-pooling with a pool size dependent on the input, so that the output always has the same size.
Faster RCNN
And we can take this a step further. Using the produced feature maps from the convolutional layer, we infer regions proposal using a Region Proposal network rather than relying on an external system. Once we have those proposal , the remaining procedure is the same as Fast-RCNN (forward to ROI layer, classify using SVM and predict the bounding box). The trick part is how to train the whole model as we have multiple tasks that need to be addressed:
领英推荐
As the name suggests, FasterRCNN turns out to be much faster than the previous models and is the one preferred in most real-world applications.
Localization and object detection is a super active and interesting area of research due to the high emergency of real world applications that require excellent performance in computer vision tasks (self-driving cars , robotics). Companies and universities come up with new ideas on how to improve the accuracy on regular basis.
There is another class of models for localization and object detection, called single shot detectors, which have become very popular in the last years because they are even faster and require less computational cost in general. Sure, they are less accurate, but they are ideal for embedded systems and similar power-hungry applications.
YOLO - You only look once (Single shot detectors)
YOLO!!! So do we only live once? I sure do not know. What I know is that we only have to LOOK once. Wait what?
That’s right. If you want to detect and localize objects in an image, there is no need to go through the whole process of proposing regions of interest, classify them and correct their bounding boxes, as we observed exactly what models like RCNN and Faster RCNN do.
Do we really need all that complexity and computation? Well if we want top-notch accuracy we certainly do. Luckily there is another simpler way to perform such a task, by processing the image only once and output the prediction immediately. These types of models are called Single shot detectors.
Single shot detectors
Instead of having a dedicated system to propose regions of interests, we have a set of predefined boxes to look for objects, which are forwarded to a bunch of convolutional layers to predict class scores and bounding box offsets. Then for each box we predict a number of bounding boxes with a confidence score assigned to each one, we detect one object centered in that box and we output a set of probabilities for each possible class. Once we have all that, we simply and maybe naively keep only the box with a high confidence score. And it works. With very impressive results actually. To elaborate the overall flow even better, let’s use one of the most popular single shot detectors called YOLO.
You only look once (YOLO)
There have been 3 versions of the model so far, with each new one improving the previous in terms of both speed and accuracy. The number of predefined cells and the number of predicted bounding boxes for each cell is defined based on the input size and the classes. In our case, we are going to use the actual numbers used to evaluate the PASCAL VOC dataset.
First, we divide the image into a grid of 13x13, resulting in 169 cells in total.
For every one of the cells, it predicts 5 bounding boxes (x,y,w,h) with a confidence score, it detects one object regardless the number of boxes and 20 probabilities for the 20 classes.
In total, we have 169*5=845 bounding boxes and the shape of output tensor of the mode is going to be (13,13,5*5+20)= (13,13,45). The whole essence of the YOLO models is to build this (13,13,45) tensor. To accomplish that, it uses a CNN network and 2 fully connected layers to perform the actual regression.
The final prediction is extracted after keeping only the bounding boxes with a high confidence score( higher than a threshold such as 0.3).
Because the model may output duplicate detections for the same object, we use a technique called Non-maximal suppression to remove duplicates. In a simple implementation, we sort the predictions by the confidence score and as we iterate them we keep only the first appearances of each class.
As far as the actual model is concerned, the architecture is quite trivial as it consists of only convolutional and pooling layers, without any fancy tricks. We train the model using a multiple loss function, which includes a classification loss, a localization loss and a confidence loss.
The most recent versions of YOLO have introduced some special tricks to improve the accuracy and reduce the training and inference time. Some examples are batch normalization, anchor boxes, dimensions clusters and others. The power of YOLO is not its spectacular accuracy or the very clever ideas behind it, is its superb speed, which makes it ideal for embedded systems and low-power applications. That’s why self-driving cars and surveillance cameras are its most common real-world use cases.
As deep learning continues to play along with computer vision (and it will sure do), we can expect many more models to be tailored for low-power systems even if they sometimes sacrifice accuracy. And don't forget the whole Internet of Things kind of thing. This is where these models really work.