登录查看更多内容

Learning Multi-Modal Alignment for 3D and Image Inputs in Real-Time with 4D-Net

Ciklum India

We develop Digital Solutions for Fortune 500 and fast-growing organisations alike around the world.

发布日期: 2022年2月25日

While it may not be immediately apparent, we all see the world in four dimensions (4D). When walking or driving down the street, for example, we see a stream of visual inputs, which are snapshots of the 3D world that, when combined in time, generate a 4D visual input. Various onboard sensing technologies, such as LiDAR and cameras, enable today's autonomous cars and robots to acquire much of this data.

LiDAR is a ubiquitous sensor that employs light pulses to accurately estimate the 3D coordinates of objects in a scene; nevertheless, it is sparse and has a limited range - the further away from a sensor, the fewer points are returned. This means that far-off objects may only receive a few points, if at all, and may not be detected by LiDAR alone. Simultaneously, images from the onboard camera, which are a dense input, are extremely valuable for semantic comprehensions, such as object detection and segmentation. Cameras with high resolution can detect objects from a long distance but are less precise when calculating distance.

Both LiDAR and onboard camera sensors are used by autonomous vehicles to collect data. Each sensor reading is captured at regular intervals, resulting in a precise depiction of the 4D world. However, due to two key problems, few research algorithms use both of these in conjunction, especially when taken "in time," that is, as a temporally ordered sequence of data. When both sensing modalities are used at the same time, 1) it is challenging to maintain computational efficiency, and 2) matching information from one sensor to another adds additional complication because LiDAR points and onboard camera RGB image inputs are not always in direct agreement.

Data from LiDAR and onboard camera sensors is collected by autonomous vehicles. Each sensor measurement is recorded at regular time intervals, resulting in an accurate representation of the four-dimensional world. However, due to two major challenges, very few research algorithms use both of these in combination, especially when taken "in time," i.e., as a temporally ordered sequence of data. When both sensing modalities are used at the same time, 1) it is difficult to maintain computational efficiency, and 2) pairing the information from one sensor to another adds additional complexity because there is not always a direct correspondence between LiDAR points and onboard camera RGB image inputs.

4D-Net

In our scenario, we use 4D inputs to solve a very common visual understanding task, 3D object box detection. We investigate how to combine two sensing modalities that come from different domains and have features that do not always match — for example, sparse LiDAR inputs span 3D space while dense camera images only produce 2D projections of a scene. Because the exact correspondence between their respective features is unknown, we are attempting to learn the relationships between these two sensor inputs and their feature representations. As shown below, we consider neural network representations in which each of the feature layers can be combined with other potential layers derived from other sensor inputs.