Enhancement of Object Detection with Depth Estimation
Sensor Data Collected by Autonomous Vehicle, Source: https://waymo.com/open/data/

Enhancement of Object Detection with Depth Estimation

Today, the growing challenges of the 21st century are being met in particular by ever faster, more efficient and above all automated processes. An important task to be solved in this context is object recognition by deep neural networks, which is already used in many fields such as medical technology, document analysis or in automated production processes. With steadily increasing computing power over the last few years, object recognition has also found increasing application in the automotive industry, where it supports e.g. the recognition of vehicles and other objects such as pedestrians. It is of utmost priority that the object detection works consistently and precisely even under difficult conditions. Different weather conditions, fast driving manoeuvres or optical illusions caused by reflections, for example, must not impair detection.

In this article I would like to present a method how object recognition can be improved with Multi-Task-Learning by looking at specific difficult conditions. What does this mean in concrete terms? Multi-Task-Learning generally describes the training of several tasks in a single training process. In our case in addition to object detection, a depth estimation based on the camera data is trained and evaluated during the training with LiDAR data. This means that LiDAR data is only needed in training and not for prediction.

Es wurde kein Alt-Text für dieses Bild angegeben.

A brief theory: The basis for camera-based object recognition is the classic Faster-RCNN architecture. This architecture consists of 3 basic modules. In the first module, the backbone network, so-called features are extracted from the image data. In the second module, the Region Proposal Network, the search radius for finding objects is narrowed down. For this purpose, potential object boxes are extracted using anchor points. These potential object boxes are then classified in the last module, the Box Head. The loss is then added up and backpropagation is initiated.

Es wurde kein Alt-Text für dieses Bild angegeben.

In order to be able to carry out an additional depth estimation on the basis of this architecture, the Faster-RCNN architecture is extended accordingly by the green-colored areas. The first important step is to create a depth map from the LiDAR data representing point clouds. In general, interpolation methods are best suited for this. However, there are isolated problems in the peripheral areas, which is why a specially developed interpolation algorithm is used in the following. 

With the help of this depth map, the LiDAR module can now be developed. The LiDAR module uses the features created by the backbone to reduce the channels to 1 by means of convolutions. Then the depth prediction is multiplied by the interpolation mask in binary form so that only those pixels are evaluated that have valid values in the interpolation. Afterwards the loss for the depth estimation is added to the loss for the object detection. This is the crucial step in which the single task optimisation problem becomes a multi task optimisation problem.

So what effect does Multi-Task-Learning have on object recognition? The answer, as so often, is: It depends. Basically, LiDAR has the property of having a very limited range. This range is approximately 75m. Therefore all objects that are further than 75m away cannot gain any added value from Multi-Task-Learning. In addition, objects that are very close are usually very easy to classify. Nevertheless, there are situations in which Multi-Task-Learning has an advantage in object recognition.

Es wurde kein Alt-Text für dieses Bild angegeben.

These cases refer to sequences in which visibility is restricted and all objects are within the range to be detected by LiDAR. This is clearly shown in the picture above. While the Faster RCNN architecture identifies very few objects, the Multi-Task-Learning architecture shows a significant improvement through depth estimation and is able to clearly identify even very small objects such as the person behind the vehicle in the right-hand edge of the image.

So what can you take with you? Multi-Task-Learning is absolutely no guarantee that individual tasks can be improved. What is decisive is a close examination of the problem. In this case Multi-Task-Learning does not provide a general improvement, but is able to improve corner-cases such as dark sequences with close objects.

要查看或添加评论,请登录

Yannick Klose的更多文章

  • Pattern Tracking - Machine Learning

    Pattern Tracking - Machine Learning

    Over the last few years, the computing power of GPU’s has increased rapidly enabling us to compute complex operations…

    2 条评论

社区洞察

其他会员也浏览了