Distance-based Keypoint Identification

Distance-based Keypoint Identification

In this article, we introduce a deep learning-based approach for automatically selecting the nearest or farthest point in a given image. This is not a difficult task for humans: given a photo, we can easily estimate the relative distance of objects using stereo vision. Furthermore, we are adept at analyzing visual context. Intuition and common sense also enable us to understand complex imagery with ease. For example, when we look at the image below, we immediately know that the black lamppost is close to the camera, whereas the buildings are farther away.

No alt text provided for this image
Figure 1: image from COCO dataset (https://cocodataset.org/#explore?id=246625)

These skills are much more challenging for a computer to learn. Therefore, to simplify the task of finding the nearest or farthest point, we break down the problem into smaller, more feasible steps. We first utilize a depth estimation model to obtain the local depth information of a given image. Next, we apply a semantic segmentation model to focus on the relevant image region. Finally, we select the nearest or the farthest point based on the depth of information inside the region of interest.

How would a computer estimate the distance of objects? In the next section, we first describe how this can be achieved using traditional computer vision. Next, we present our deep-learning-based approach to solving this problem.

Depth Estimation: The traditional approach

Traditional object depth estimation requires two cameras. It uses a trigonometric method to estimate the distance between the target object and the cameras. Let's say we have two cameras placed a known distance apart from each other. If we also know the angles between the cameras and the target object, we can calculate the object distance.

We use the diagram below to better illustrate the case:

No alt text provided for this image
Figure 2: triangle formed by the two cameras, C1 and C2, and the target object O

C1 and C2 represent the two cameras, and?x is the distance between them. Let?y be the distance between the target object and the camera?C1, and?z be the distance between the target object and the camera?C2. By using the Law of Sines, we can calculate?y using the following equation:

No alt text provided for this image

Similarly,?z can be calculated as follows:

No alt text provided for this image

As we already know the distance?x, once we find out the angles?A1 and?A2, we can then calculate the distances?y?and?z. To get the angle?A1, we first have to study the camera parameters. In our case, the camera has a 3840 x 2160 resolution, and it covers a 30-degree horizontal view angle. If the object is at the center of the image captured by?C1, it is exactly in front of camera?C1, and therefore the angle?A3 is?0°. If the object is at the rightmost boundary of?C1’s captured image, the angle?A3?is?15° (30°/2). We can use the following equation to calculate?A3, i.e. the angle between the camera and the object.

No alt text provided for this image

where?l?is the horizontal pixel distance between the target object and the image center point,?θC1 is the maximum horizontal view angle of the camera (in our case, it is?30°), and?p is the image width in pixels (in our case, it is 3840). Lastly, we can calculate?A1, which equals?90°?A3.

Similarly, we can use this method to calculate?A2. Finally, we can use the formula mentioned above to find?y and?z.

Despite the effectiveness of this approach, it has a few key requirements that can be challenging to meet:

1. We need two cameras.

2. We need to know the exact distance between the two cameras.

3. We need to keep the target object stationary. Alternatively, for a moving target, we need to capture images from two cameras at the same time.

4. We need to locate the target object in both captured images.

If we cannot achieve the above, this method is not viable, or the estimated distance might have a large margin of error.


An alternative is to use deep neural networks, which would allow us to circumvent the restrictions described above. Single-camera depth estimation is a well-known topic in deep learning-based computer vision research. Instead of finding the absolute distance, deep neural networks can estimate the relative depth information given an image. Because the distance is relative, we do not need to rely on information such as camera position or focal length. Since we are trying to find the nearest or farthest point within a single image, we do not need to obtain the absolute distance. Therefore, this approach is perfect for our application.

Depth estimation using deep learning

There are many existing deep-learning models for single-camera depth estimation. When working with limited time and data, pre-trained models like these can be highly effective as a starting point. After some experimentation, We found that the AdaBins model has good performance in both indoor and outdoor environments. Therefore, we decided to use AdaBins in the first stage of our pipeline. The model outputs a depth map, which estimates objects’ distance relative to the camera.

More about AdaBins

Machine learning tasks usually fall into two categories: regression or classification. Regression refers to predicting numerical values (e.g. predicting next month’s ticket sales), and classification refers to predicting categorical labels (e.g. predicting if a photo contains a cat or a dog). Depth estimation is commonly achieved via regression. However, it is also possible to frame this as a classification problem.

The motivation is to reduce the difficulty of the depth estimation task. To further illustrate this idea, let’s think about regression as an infinite-class classification problem. Intuitively, the larger the number of target classes, the harder the classification problem is. Consider the simplest classification task: binary classification. During training, the binary classifier tries to find a hyperplane in order to separate the two classes into two subspaces while minimizing the loss function. The best hyperplane is the optimal boundary between the two classes. When we have a 3-class classification problem, we can formulate it as two binary classification sub-problems. E.g. if we have classes “A”, “B,” and “C”, we can use one sub-classifier to distinguish between “A” and “not A”. Then, we use another classifier to separate “not A” into “B” and “C”. Each sub-classifier has its own hyperplane to optimize. When there are many classes, the model needs to learn many complex decision boundaries in order to accommodate every single one of these classes. This is why the difficulty is reduced by replacing the regression task (an infinite-class classification) with a fixed n-class classification task, as this reduces the complexity of the decision boundary.

The authors of AdaBins propose a dynamic method to compute a depth map based on adaptive bins. Instead of predicting depth values via regression, they classify each pixel into depth bins. Binning can cause discretization of depth values, leading to lower quality estimates. To address this problem, they also propose using linear combinations of the bin centers to calculate depth. During training, the AdaBins model learns two components: 1. the width of each bin and 2. the range attention map. The bin width tells us where each bin begins and ends, while the range attention map provides the weighting of each bin when calculating the depth.

No alt text provided for this image
Figure 3: illustrates different binning strategies. AdaBins dynamically adjusts the bin widths for different input images. Vertical lines indicate the boundaries between bins.

For more details, we refer readers to the original paper "AdaBins: Depth Estimation using Adaptive Bins"[1].

The AdaBins model generates a depth map for the input image. Figure 4(b) shows the output of the model. Objects farther away will appear brighter, while nearer objects will appear darker.

No alt text provided for this image
Figure 4: (a) the original image
No alt text provided for this image
Figure 4: (b) the depth map generated by the AdaBins model

As shown in the image above, we can see that the generated depth map provides relative distance estimates that match our intuition to a certain degree. However, the result is not 100% correct. For example, the upper part of the tallest building is darker, which means that the model incorrectly predicted that the upper part of the building is closer to the camera. Despite these imperfections, the model generally provides reasonable relative depth estimates, and they are suitable for our downstream task. By checking pixel intensities, we can easily select the nearest or farthest points, as estimated by the depth model.

Further refinement using regions of interest

In our experiments, we found that the farthest point selected by the depth model is usually in the sky. This is as expected since objects in the foreground are closer to the camera compared to the sky background. However, in some cases, we are not interested in the sky region. To address this, we added a filter to prevent the sky background from being selected.

Semantic segmentation[2]?is very suitable for this use case. Segmentation groups together pixels that belong to the same object class. Thanks to the rapid development of autonomous driving, there are a lot of existing models for more fine-grained semantic segmentation tasks, such as segmenting humans, vehicles, trees, etc. In our case, we only define two object classes: foreground and background. The background is simply the sky region, while the foreground is the rest of the image. We use the pre-trained Cascade Segmentation Module proposed in the paper “Scene Parsing through ADE20K Dataset”[3]. As shown in Figure 5, the output of our segmentation model is a binary map, which indicates whether each pixel is foreground or background.

No alt text provided for this image
Figure 5: the binary map generated by the semantic segmentation model.

By combining this segmentation map with the depth map obtained previously, we can then select the nearest or farthest point using the following equations, where i and j correspond to the horizontal and vertical positions of any pixel:

No alt text provided for this image

The image below shows the experimental result from our pipeline. The red dot is the nearest point selected by the algorithm, while the blue dot is the farthest point.

No alt text provided for this image
Figure 6: the blue dot and the red dot show the farthest and nearest point selected by our pipeline, respectively.

Recap

To conclude, we described our single-camera approach for finding the nearest or farthest point in an image. We use two deep neural networks to achieve this goal. The first model predicts the relative depth of objects in an image. The second model performs semantic segmentation and identifies the region of interest. By combining these two models, we can select the nearest or the farthest point inside our target region.

References

[1] Bhat, Shariq Farooq, Ibraheem Alhashim, and Peter Wonka. "Adabins: Depth estimation using adaptive bins." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

[2] Thoma, Martin. "A survey of semantic segmentation." arXiv preprint arXiv:1602.06541 (2016).

[3] Zhou, Bolei, et al. "Scene parsing through ADE20k dataset." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.


Author: Sam Kwun-Ping Lai , Machine Learning Engineer at Bluvec Technologies Inc.

要查看或添加评论,请登录

Bluvec Technologies的更多文章

社区洞察

其他会员也浏览了