Structure from Motion
Do you know how humans perceive depth information from their environment. If your answer is by comparing images from their left and right eye, try touching things with one or your eye closed. Surprised, that you can still observe three dimensional space. Read on if you want to know how a driver less car navigate around urban landscape.
I started my cloud journey with Intelligent Product Design, a collaboration application for both internal and external (suppliers, design partners, customers etc.) organizations. The application was also capable of collaborating on 3D designs. Later on, I moved to another application that allowed multi physics simulation on these 3D designs completing the loop from design to operate. Both the applications required 3D content (typically CAD / STEP / BIM) to be built on super specialty tools.
Creating 3D models is an involved task requiring AutoCAD / Maya / Blender skills. I wanted to capture this spatial information as easy as clicking a photograph. I was primarily motivated with three use cases unrelated with each other. But hey, Isn't creativity the ability to relate previously unrelated things. My favorite application areas were -
Result
As stated before, we want to build a bare minimum 3D scanning application. The application would run on the smartphones used widespread.
There is heavy duty processing involved with python libraries. Although, it is possible to package all the libraries and the python code using kivy. We would make a PWA to ease application distribution. Why PWA? That's for another coffee corner discussion.
3D scene reconstruction happens on server side. You can deploy the server side on local machine, in any container or even on Jupyter notebook. The server process connects to the client using webRTC, a peer-to-peer communication protocol for data and streams. With this approach, one can deploy the client application even on edge devices.
The application has two modes. Calibration mode to calculate camera intrinsic parameters and distortion coefficients. Scanning mode for 3D scene reconstruction.
Architecture
The client application is built as a PWA (web distribution) used on mobile devices and an Arduino sketch to be deployed on Uno, Nano etc. (can be adapted for other development boards as well). The camera module used for Arduino is OV7670.
Client connects to server process using WebRTC protocol. WebRTC allows captured video to be streamed to the server in real-time. The data channel is used to send commands and exchange other context data like camera intrinsic parameters, geo coordinates etc. Firebase is used as a presence database synchronizing ICE candidates.
Structure from Motion pipeline running on python can be deployed on local machine, docker, kubernetes, or jupyter runtime. Specific aspects of the pipeline are described in below sections. Structure from Motion problem is described as a process to establish the spatial orientation of target objects from movement of one or more observers. Structure refers to the coordinates, shape and relative position of the observer whereas Motion refers to relative translation and rotation of the observer camera frustum.
Camera calibration
At its core SFM works by observing key points across set of images captured with calibrated cameras i.e. focal length and distortion coefficients across both axis are known in advance. When this is not the case the camera(s) must be calibrated.
领英推荐
We must capture images of a well defined pattern (e.g. a chess board). Find some specific points of which we already know the relative positions (e.g. square corners in the chess board). Since, we know the coordinates of these points in real world space and the coordinates in the image, we can solve for the distortion coefficients. For accuracy, we should have at least 10 images.
import cv2 as cv
...
# captured_images from webRTC track
for image in captured_images:
? ? # convert to gray scale
? ?gray = cv.cvtColor(image, cv2.COLOR_RGB2GRAY)
? ? # find chessboard corners
? ? found, corners = cv.findChessboardCornersSB(gray, chessboard_size)
? ? if found == True:
? ? ? ? # define criteria for subpixel accuracy
? ? ? ? criteria = (cv.TERM_CRITERIA_EPS +
? ? ? ? ? ? ? ? ? ? cv.TERM_CRITERIA_MAX_ITER, 30, 0.001)
? ? ? ?
? ? ? ? # refine corner location (to subpixel accuracy) based on criteria.
? ? ? ? corners = cv.cornerSubPix(gray, corners, (5, 5), (-1, -1), criteria)
? ? ? ? # collect objpoints (points in real world space) and imgpoints
? ? ? ? ...
? ? # calibrate camera
? ? ret, K, dist, rvecs, tvecs = cv.calibrateCamera(objpoints, imgpoints,
gray.shape[::-1], None, None)
To undistort the image we can use the getOptimalNewCameraMatrix and undistort method from OpenCV
Scene reconstruction
As described earlier SFM is the process of estimating the 3-D structure of a scene from a set of 2-D images. SFM problem can be solved in many different ways. How you approach the problem depends on whether you use single or multiple cameras and if the images or ordered. In our setup we are using single camera that moves in a stationary scene. That means images have same distortion and are ordered. Our solution approach would have have been computationally less demanding if we were in a stereo (two cameras with known translation and no rotation between them) setup.
The process works on the principles of triangulation. Triangulation (interesting explanation here) is the process of determining the location of a point by forming triangles to the point from known points. Easy as it sound, we need to observe these same points (point correspondences between images) between two images. We can find corresponding points either by matching features or tracking points from image 1 to image 2. We use these points to recover relative pose of camera capturing second image w.r.t camera position for first image.
Theoretically, we can now apply triangulation on these key points and obtain a sparse point cloud. However, in reality this sparse point cloud wouldn't allow us estimating the geometry of the scene with accuracy. So the approach is to apply triangulation on every pixel of the captured images. Comparing every pixel on first image with every pixel on the second image is computationally a very intensive operation. If only we started with the stereo setup, we would have required to compare points horizontally along epipolar lines. Lets convert our monocular vision problem to a stereo vision problem.
The code below describes the key steps in the process -
import cv2 as cv
import numpy as np
...
# assuming K is the camera intrinsic matrix obtained from previous section
...
# compute image correspondences using feature matching
sift = cv2.SIFT_create()
bf = cv2.BFMatcher()
# img1 and img2 captured from webRTC track
kp1, des1 = sift.detectAndCompute(img1, None)
kp2, des2 = sift.detectAndCompute(img2, None)
# collect matched key points
matches = bf.knnMatch(des1, des2, k=2)
good = []
pts1 = []
pts2 = []
for m, n in matches:
? ? if m.distance < 0.7*n.distance:
? ? ? ? good.append([m])
? ? ? ? pts1.append(kp1[m.queryIdx].pt)
? ? ? ? pts2.append(kp2[m.trainIdx].pt)
pts1 = np.array(pts1)
pts2 = np.array(pts2)
# compute fundamental matrix
F, mask = cv2.findFundamentalMat(pts1,pts2,cv2.FM_RANSAC)
# convert monocular to stereo
h1, w1 = img1.shape
h2, w2 = img2.shape
ret, H1, H2 = cv2.stereoRectifyUncalibrated(pts1, pts2, F, imgSize=(w1, h1))
img1_rectified = cv2.warpPerspective(img1, H1, (w1, h1))
img2_rectified = cv2.warpPerspective(img2, H2, (w2, h2))
# compute disparity map
stereo = cv2.StereoSGBM_create(minDisparity = min_disp,
? ? numDisparities = num_disp,
? ? blockSize = 16,
? ? P1 = 8*3*window_size**2,
? ? P2 = 32*3*window_size**2,
? ? disp12MaxDiff = 1,
? ? uniquenessRatio = 10,
? ? speckleWindowSize = 100,
? ? speckleRange = 32
)
disparity_SGBM = stereo.compute(img1_rectified, img2_rectified)
Output disparity map below (brighter pixels are nearer to camera) -
Obtained from below pair of images -
Finally each valid pixel from disparity map can be added to the dense point cloud using reprojectImageTo3D method from OpenCV.
Summary
Structure from Motion technique represents a non-invasive, highly flexible and low-cost methodology to reconstruct 3D structures where access to other ranging methods is restricted. Its application range from geosciences, building information management, engineering and construction, logistics, historical preservation, gaming, manufacturing and medical diagnosis.
Helping Technology & Engineering Companies and Tech Enabled Businesses
2 年Interesting reading