Exploring multi-camera multi-object tracking: Techniques, challenges, and real-World applications - part 1
Introduction
Object tracking is a fundamental task in computer vision that involves identifying and following the movement of objects within a video stream or sequence of images. It has a wide range of applications, including surveillance, robotics, autonomous vehicles, and augmented reality. Object tracking algorithms use various techniques, such as feature extraction, motion estimation, and machine learning, to identify and track objects over time. In this technical blog post, we will explore the basics of object tracking, popular algorithms used in the field, and some of the challenges and considerations when implementing object tracking systems.
Multi-camera multi-object tracking (MCMOT) is an advanced object tracking technique that leverages multiple cameras to track multiple objects simultaneously in complex environments. Unlike single-camera object tracking, MCMOT requires the coordination of data from multiple sources to accurately track objects as they move across camera views. This technique is particularly useful in applications such as crowd monitoring, traffic management, and security surveillance. MCMOT systems typically use computer vision algorithms and machine learning techniques to process data from multiple cameras and generate a cohesive tracking output. We will dive into the details of MCMOT, including the underlying technologies, implementation considerations, and real-world use cases. By the end of this post, you'll understand how MCMOT algorithm works and how to get started with this application on your own by building an object tracking pipeline that works across multiple cameras.
What is Object Tracking?
Object tracking is a computer vision technique that involves locating and following an object in a video or image sequence. It is an essential task in many applications where we need to not only detect an object but understand its motion over time. An example might be to understand how shoppers move inside a store or to understand the driving patterns of a particular vehicle. Going even further, good tracking allows us to forecast where a given object will be in the future, making it possible, for instance, that self-driving cars avoid collisions.
One famous algorithm is Single Online and Realtime Tracking , also known as SORT. To track the bounding boxes, SORT uses a Kalman Filter , for each bounding box. The SORT algorithm performs the data association between two bounding boxes based on the Intersection over Union (IoU) metric. On Figure 1, we see a simple diagram exhibiting a pipeline that uses SORT algorithm as the bounding box tracking algorithm.
In the above figure, we illustrate the pipeline of the SORT algorithm. Once we received new Detections from our Object Detection Model, we check the IoU (Intersection Over Union) metric between the current tracks and the new bounding boxes.
As an example, in the image below we have the overlay between two sequential images.
You can observe that when performing detection on sequential frames, we see that the detections share an overlap across the time. When we compute the IoU metric, initially we don’t know which bounding boxes belong to the current trackers. Then, the idea is to compute the IoU between all the bounding boxes detected and between all current trackers. It will provide us a sparse matrix with values that go from 0 to 1. This sparsity information is useful for the next step of the algorithm.
With the IoU matrix between all the detected boxes and the current tracks, we use the Hungarian Algorithm to assign the detected bounding boxes to the current tracks. The Hungarian Algorithm is a combinatorial optimization algorithm that solves the assignment problem. It is also known as Linear Sum Assignment .
After the assignment, if we have a new bounding box that doesn’t share an overlap with the current tracks, we assign a new track to that object. If we have a bounding box that overlaps with a current track, we perform the update step of the Kalman Filter. At the end, if we have no assignment to a track, we don’t update this track. Every new iteration of the algorithm, we analyze how many times a current track didn’t receive an update. This parameter, in the SORT implementation that we’re using, is the max_ages. If we set, for instance, max_ages = 30, this means that we will delete that track if we have no assignments to it after 30 iterations.
Just a side note, the SORT algorithm doesn’t care if our object detection algorithm misses the label of an object. The data association involved in the SORT algorithm uses only the IoU metric between the two sets of bounding boxes to perform the matching between the objects.
Approaches for Multi-Camera Multi-Object Tracking
Now that we know how a single-camera multi-object tracking algorithm works, it is time to understand how we will perform data association on multi-cameras. To perform tracking on multiple cameras, the main algorithm that we must work on is in the Data Association step.
Overall, a tracking algorithm works as follows:
The Data Association is the core of any tracking algorithm. In the SORT example, how do we verify if two objects’ bounding boxes are the same? By computing the IoU metric between the current detections and the current tracks. However, we can’t apply this metric directly to the Multi-Camera approach, given that our bounding boxes will not overlap.
Then, to associate bounding boxes from different camera perspectives we have two ways to perform the data association.
Geometric Approach
If our cameras share an overlap, we, as humans, can determine which points from one camera are the same on the second camera. This opens a door for a couple of algorithms that we can employ. One of them is the homography matrix.
n homography matrix is a 3x3 transformation matrix between two planes. It also encodes a scale factor in its components. To estimate an homography matrix between two set of 2D points (pixel coordinates), we need, at least, 4 corresponding points sets to estimate it.
Let’s do an example together to understand an application of the homography matrix. ?Let’s assume we have a pair of images that have some overlap
Due to the overlap, we, as humans, are able to point out exactly which point of the first image is represented in the second image. When we address this task to a machine, it must understand how to interpret an image. Each image has its own distinguishable set of features that are unique to that image. For machines, we use a set of algorithms called feature detectors. Those feature detectors employ filters to extract, in general, lines, edges and other distinguishable patterns from an image. Some common feature detectors are SIFT and ORB .
Now, what we’ll do is to use a feature detector to detect common points between the images above. Then, these matches will be used to compute a projective transformation, the homography matrix, that will map points from one image to the other.
In our GitHub repository you’ll see a folder called barcelona. It contains two images from a Structure-from-Motion dataset.
We will employ a common matcher from OpenCV that is the Brute-Force matcher. It only requires the descriptors taken from the set of keypoints detected.
def run_match(desc1, desc2):
? ? matcher = cv2.BFMatcher()
? ? matches = matcher.knnMatch(desc1, desc2, k=2)
? ? good_matches = []
? ? for m, n in matches:
? ? ? ? if m.distance < 0.75 * n.distance:
? ? ? ? ? ? good_matches.append([m.queryIdx, m.trainIdx])
? ? return good_matches
Reading the images and detect and compute the features and descriptors, respectively, we will run our matching function above
sift = cv2.SIFT_create(1000)
? ? curr_img = cv2.imread("./barcelona/DSCN8230.JPG", 0)
? ? prev_img = cv2.imread("./barcelona/DSCN8231.JPG", 0)
? ? t1 = time.perf_counter()
? ? kpts_prev, desc_prev = sift.detectAndCompute(prev_img, None)
? ? kpts_curr, desc_curr = sift.detectAndCompute(curr_img, None)
? ? t2 = time.perf_counter()
? ? spent = (t2 - t1) * 1000.0
? ? print(f"[INFO] Time to compute feature descriptors: {spent:.2f} [ms]")
? ? matches = np.int0(run_match(desc_prev, desc_curr))
Thanks to OpenCV, we have a function to estimate the homography matrix given two set of points from different sources:
kpts_prev = np.int0([kpt.pt for kpt in kpts_prev]).reshape(-1, 2
?kpts_curr = np.int0([kpt.pt for kpt in kpts_curr]).reshape(-1, 2)
?kpts_prev = kpts_prev[matches[:, 0]]
?kpts_curr = kpts_curr[matches[:, 1]]
?curr_H_prev, _ = cv2.findHomography(kpts_prev, kpts_curr, cv2.RANSAC, 3.0))
Now, pay attention to the homography matrix variable name. A good standard to follow when working with geometric transformations is to name your variable as destination_TRANSFORMATION_source. This way, it is simpler to read and understand what are the actors involved in the transformation.
Ok. At this point, we computed an homography matrix that changes the (x, y) from a source and leads to our destination. But, what is its practical meaning?
To answer this question, we see that there are a lot of cars parked on both images. To see what the homography matrix can do we’ll detect the cars using YoloV5. With their bounding boxes, we’ll project the boxes from the left-image to the right image and see how they overlap.
To facilitate our lives, let’s create a function that applies the homography matrix to a set of (x, y) points:
def apply_homography(H, points)
? ? pts_ = []
? ? for x, y in points:
? ? ? ? pt_ = H @ np.reshape([x, y, 1.0], newshape=(3, 1))
? ? ? ? x_ = pt_[0] / pt_[-1]
? ? ? ? y_ = pt_[1] / pt_[-1]
? ? ? ? pts_.append([x_, y_])
? ? return np.int0(pts_):
Then, let’s move forward with our goal:
detector = torch.hub.load("ultralytics/yolov5", "yolov5s"
? ? detector.conf = 0.30
? ? detector.classes = [2]
? ? preds_prev = detector(prev_img).xyxy[0].cpu().numpy()[:, :4]
? ? preds_curr = detector(curr_img).xyxy[0].cpu().numpy()[:, :4]
? ? preds_curr_x1y1_ = apply_homography(curr_H_prev, preds_prev[:, :2]).reshape(-1, 2)
? ? preds_curr_x2y2_ = apply_homography(curr_H_prev, preds_prev[:, 2:]).reshape(-1, 2)
? ? preds_curr_ = np.concatenate([preds_curr_x1y1_, preds_curr_x2y2_], axis=1).reshape(
? ? ? ? -1, 4
? ? )
? ? curr_img = cv2.cvtColor(curr_img, cv2.COLOR_GRAY2BGR)
? ? prev_img = cv2.cvtColor(prev_img, cv2.COLOR_GRAY2BGR)
? ? cv2.namedWindow("Vis", cv2.WINDOW_NORMAL)
? ? cv2.imshow("Vis", np.hstack([prev_img, curr_img]))
? ? cv2.waitKey(0))
At the end you should see something like this
On the second image, the blue boxes are the detections of the boxes on the second image, and the red boxes are the red boxes from the first image, but reprojected to the second image using our homography matrix.
It’s possible to see that, for some cases, the bounding boxes on the left are prone to vanish, cause we don’t look at those cars anymore. However, for the cars that we can observe clearly on the first and on the second image, we have a really good overlap. Showing that our homography matrix is well estimated and working!
The same procedure will be used on the next section for tracking on multiple cameras.
Multi-Camera Tracking using Geometrical Approach
Now we’ll show how to use the Geometrical Approach to create a multi-camera multi-object tracker. For this task we’ll use the WILDTRACK Seven Camera HD Dataset from école Polytechnique Fédérale de Lausanne . To simplify, we’ll use the videos from camera and camera 4.
The starting point is to calibrate both cameras and determine the homography matrix that transforms points from one camera to the other. In the previous section we did this manually. However, now we’ll do it automatically using a simple feature matching algorithm.
First of all, we’ll read the video sources:
video1 = cv2.VideoCapture("./data/cam1.mp4"
video2 = cv2.VideoCapture("./data/cam4.mp4"))
Now, to compute the homography, we must identify the same points on both camera views. For this task, we’re going to use the SIFT descriptor to detect and compute features, then we will use the Brute Force Matcher to compute the match.
_, frame1 = video1.read(
_, frame2 = video2.read()
kpts1, des1 = feat_detector.detectAndCompute(frame1, None)
kpts2, des2 = feat_detector.detectAndCompute(frame2, None)
bf = cv2.BFMatcher()
matches = bf.knnMatch(des1, des2, k=2)
good = []
for m, n in matches:
? ? if m.distance < 0.75 * n.distance:
? ? ? ? good.append(m))
The good variable stores the matches that are more confident according to our distance threshold. But, we don’t have the key points to compute the homography matrix. For this, we can use the cv2.DMatch object to retrieve the corresponding points. The value of 0.75 was arbitrarily chosen. But, it is, in summary, a value that will check if the distance of the nearest neighbor of the descriptor from frame1 with respect to frame2 up to 75% of its radius.
src_pts = np.float32([kpts1[m.queryIdx].pt for m in good]).reshape(-1, 1, 2
dst_pts = np.float32([kpts2[m.trainIdx].pt for m in good]).reshape(-1, 1, 2))
At the end, we use the cv2.findHomography function to estimate the homography matrix between both cameras.
cam4_H_cam1, mask = cv2.findHomography(src_pts, dst_pts, cv2.RANSAC, 5.0)
Is very important here to observe the nomenclature employed. Formally, when dealing with transformations, like rigid body transformations from SE3 or SO3, affine, and so on, we use the following nomenclature pattern:
destination_A_source
Where A is the type of transformation:
If we draw the corresponding features, we’ll have the following:
But, does that mean that we achieve a good calibration? Well, the answer is no. The ideal is to double check by plotting one bounding box in the perspective of the other camera. So, now, we’re going to visually check the results, after all, we’re Computer Vision Engineers.
To apply an homography matrix to a (x, y) image point, we can use the following function:
def apply_homography(uv, H)
? ? uv_ = np.zeros_like(uv)
? ? for idx, (u, v) in enumerate(uv):
? ? ? ? uvs = H @ np.array([u, v, 1]).reshape(3, 1)
? ? ? ? u_, v_, s_ = uvs.reshape(-1)
? ? ? ? u_ = u_ / s_
? ? ? ? v_ = v_ / s_
? ? ? ? uv_[idx] = [u_, v_]
? ? return uv_:
Note that we have to insert a 1 to transform the points into homogeneous coordinates. Doing this for the 2 bounding box corners, the top-left and bottom-right, we have the following:
The green box is the bounding box detected on camera 4, and the red bounding box is the bounding box detected on camera 1 and projected to camera 4 using the homography matrix. This result seems very impressive because the bounding boxes are well aligned, this is only possible due to the well calibrated homography matrix that we computed.
Now, how will we create the logic to track the objects on multiple cameras? For this task, we’re going to use one Sort tracker per camera, this way we smooth the bounding movement. Then, we’ll use, what we called, a global tracker whose purpose is to assign the same objects on the cameras using the detections and the homography matrix.
First, let's create all the background for our application:
video1 = cv2.VideoCapture
? ? "./data/cam1.mp4"
)
assert video1.isOpened(), "Could not open video1"
video2 = cv2.VideoCapture(
? ? "./data/cam4.mp4"
)
assert video2.isOpened(), "Could not open video2"
cam4_H_cam1 = np.load("cam4_H_cam1.npy")
cam1_H_cam4 = np.linalg.inv(cam4_H_cam1)
homographies = list()
homographies.append(np.eye(3))
homographies.append(cam1_H_cam4)
# Loading yolov5 model
detector = torch.hub.load("ultralytics/yolov5", "yolov5m")
detector.agnostic = True
detector.classes = [0]
detector.conf = 0.5
trackers = [sort.Sort(max_age=30, min_hits=3, iou_threshold=0.3) for _ in range(2)]
global_tracker = homography_tracker.MultiCameraTracker(homographies, iou_thres=0.20)
num_frames1 = video1.get(cv2.CAP_PROP_FRAME_COUNT)
num_frames2 = video2.get(cv2.CAP_PROP_FRAME_COUNT)
num_frames = min(num_frames2, num_frames1)
num_frames = int(num_frames)
# NOTE: Second video is 17 frames behind the first video
video2.set(cv2.CAP_PROP_POS_FRAMES, 17)
Is important to say that when we deal with sensor fusion, in our case we’re fusing two cameras, the synchronization is the main thing that we must pay attention to. The reason for this is that we can’t fuse two measures together if they were taken in different timestamps. If we do, our algorithm will not provide accurate results. That’s why we included the last line, to make sure that the videos were aligned correctly.
Now, the main loop:
for idx in range(num_frames)
? ? # Get frames
? ? frame1 = video1.read()[1]
? ? frame2 = video2.read()[1]
? ? # NOTE: YoloV5 expects the images to be RGB instead of BGR
? ? frames = [frame1[:, :, ::-1], frame2[:, :, ::-1]]
? ? # Run object detection
? ? anno = detector(frames)
? ? dets, tracks = [], []
? ? for i in range(len(anno)):
? ? ? ? det = anno.xyxy[i].cpu().numpy()
? ? ? ? det[:, :4] = np.int0(det[:, :4])
? ? ? ? dets.append(det)
? ? ? ? # Updating each tracker measures
? ? ? ? tracker = trackers[i].update(det[:, :4], det[:, -1])
? ? ? ? tracks.append(tracker)
? ? # Upadting the global tracker
? ? global_ids = global_tracker.update(tracks)
? ? for i in range(2):
? ? ? ? frames[i] = utilities.draw_tracks(
? ? ? ? ? ? frames[i][:, :, ::-1], tracks[i], global_ids[i], i, classes=detector.names
? ? ? ? )
? ? vis = np.hstack(frames)
? ? cv2.namedWindow("Vis", cv2.WINDOW_NORMAL)
? ? cv2.imshow("Vis", vis)
? ? key = cv2.waitKey(0)
? ? if key == ord("q"):
? ? ? ? break
video1.release()
video2.release()
cv2.destroyAllWindows()
The MultiCameraTracker class contains the logic to create a global ID for the cameras. To do so, we use the same procedure that the Sort algorithm uses. We create an IoU matrix between the detections of the two cameras, but, in this case, we project the detections from camera 4 to camera 1. That’s why we compute the inverse of the cam4_H_cam1 matrix in the very beginning. So all the bounding boxes take camera 1 as the reference.
Once we assign the corresponding boxes between the cameras using the Hungarian Algorithm, we create a new set of IDs that are different from the IDs computed by the Sort algorithm. The logic is the same as the associate_detections_to_trackers.
In the end, we have:
Multi-Camera Tracking using Deep Feature Association
As we saw, if we have cameras that share an overlap, we can associate two bounding boxes from different sources by creating a calibration between both cameras. We also shared how to calibrate two cameras using a feature matching algorithm.
However, what if we don’t have an overlap between the cameras? Or, what if our homography matrix is not well estimated? In this case, we can try to use a Deep Feature Association. Given that we can detect the bounding box around the object of our interest, we can crop that piece of image from the original image. Then, if we have to check which objects are the same on different cameras, we can run a similarity search and perform, what we call, re-identification (ReID).
If we draw a diagram, we will have a system similar to this one:
But, what is a feature vector? A feature vector, or a description vector, is a N-Dimensional vector that stores information about the image. This vector, generally, is learned through Metric Learning. A famous paper that first introduced this concept, of tracking with feature association, was DeepSort. From its name, we can infer that it uses a Sort algorithm variation based on Deep Feature association.
To illustrate this better, let’s use Torch ReiD to load a pre-trained model to estimate a description vector given a crop of a bounding box.
extractor = torchreid.utils.FeatureExtractor
? ? model_name="osnet_x1_0", model_path="./weights/osnet_x1_0.pth.tar", device="cuda"
)(
Let’s load the same videos from previously again:
video1 = cv2.VideoCapture
? ? "/home/mbenencase/projects/daedalus/datasets/multi-camera-tracking/epfl/cam1.mp4"
)
assert video1.isOpened(), "Could not open video1"
video2 = cv2.VideoCapture(
? ? "/home/mbenencase/projects/daedalus/datasets/multi-camera-tracking/epfl/cam4.mp4"
)
assert video2.isOpened(), "Could not open video2"(
This time, we don’t have to calibrate the cameras, so let’s load the detector and move the to main loop:
# Loading yolov5 mode
detector = torch.hub.load("ultralytics/yolov5", "yolov5m")
detector.agnostic = True
detector.classes = [0]
detector.conf = 0.5
num_frames = int(video1.get(cv2.CAP_PROP_FRAME_COUNT))
COS_THRES: float = 0.80
COLORS = np.random.randint(0, 255, size=(100, 3), dtype="uint8")
Now, the main loop
for idx in range(num_frames)
? ? # Get frames
? ? frame1 = video1.read()[1]
? ? frame2 = video2.read()[1]
? ? # Run object detection
? ? anno = detector([frame1, frame2])
? ? preds1 = anno.xyxy[0].cpu().numpy()
? ? preds2 = anno.xyxy[1].cpu().numpy()
We're already used to the above code. Now, we're going to extract the feature vector for each bounding box detected.
? ? cam1_features = []
? ? cam2_features = []
? ? for pred in preds1:
? ? ? ? x1, y1, x2, y2, _, _ = np.int0(pred)
? ? ? ? crop = frame1[y1:y2, x1:x2, :]
? ? ? ? feat = extractor(crop)[0].cpu().numpy()
? ? ? ? feat = feat / np.linalg.norm(feat)
? ? ? ? cam1_features.append(feat):
Is important to say that we normalize the feature vector to be able to use the Cosine Similarity metric. Then, we do the same for the second camera.
for pred in preds2
? ? ? ? x1, y1, x2, y2, _, _ = np.int0(pred)
? ? ? ? crop = frame2[y1:y2, x1:x2, :]
? ? ? ? feat = extractor(crop)[0].cpu().numpy()
? ? ? ? feat = feat / np.linalg.norm(feat)
? ? ? ? cam2_features.append(feat)
? ? cam1_features = np.array(cam1_features)
? ? cam2_features = np.array(cam2_features):
How does this description vector looks like? If we print its shape, we have the following:
(Pdb) feat.shap
(512,)
So, our description vector is a 512 dimensional array that describes the HxWx3 image space to a lower dimensional space.
Now, we’ll compute a similarity matrix, whose idea is based on the IoU matrix of Sort algorithm.
sim_matrix = cam1_features @ cam2_features.
? ? matched_indices = linear_assignment(-sim_matrix)T
Just as a reminder, we use -sim_matrix because the linear_assigment function will minimize the function. Plotting the results with the following:
for idx, match in enumerate(matched_indices)
? ? ? ? if sim_matrix[match[0], match[1]] < COS_THRES:
? ? ? ? ? ? continue
? ? ? ? else:
? ? ? ? ? ? # Draw bounding boxes
? ? ? ? ? ? x1, y1, x2, y2, _, _ = np.int0(preds1[match[0]])
? ? ? ? ? ? cv2.rectangle(frame1, (x1, y1), (x2, y2), (0, 255, 0), 2)
? ? ? ? ? ? cv2.putText(
? ? ? ? ? ? ? ? frame1,
? ? ? ? ? ? ? ? f"{idx}",
? ? ? ? ? ? ? ? (x1, y1),
? ? ? ? ? ? ? ? cv2.FONT_HERSHEY_SIMPLEX,
? ? ? ? ? ? ? ? 1,
? ? ? ? ? ? ? ? (0, 255, 0),
? ? ? ? ? ? ? ? 2,
? ? ? ? ? ? )
? ? ? ? ? ? x1, y1, x2, y2, _, _ = np.int0(preds2[match[1]])
? ? ? ? ? ? cv2.rectangle(frame2, (x1, y1), (x2, y2), (0, 255, 0), 2)
? ? ? ? ? ? cv2.putText(
? ? ? ? ? ? ? ? frame2, f"{idx}", (x1, y1), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2
? ? ? ? ? ? ):
We have something like this:
The results seems not very good, and this happens because we’re using a pre-trained model that was trained on a different image domain, so is expected to not work well. The second thing that we observe is that there are a lot of persons not being detected, why? Well, this is not fully true. The persons are detected, if we plot only the detections we see that almost all of them are detected. But, the problem arises when we perform the cosine similarity. This model is not calibrated to our image domain, and the threshold that we set, stored in the COS_THRES variable, is too high. If we use a lower score we have another problem, that is the false assignments.
The best way to overcome this problem is to apply a fine-tuning to the feature extraction model, and this will be the theme of our next article so don't forget to subscribe to dtLabs page to be notified about it.
Conclusion
In this post we covered some concepts of advanced computer vision techniques in the subject of object tracking. The Multi-Camera Multi-Object Tracking is a powerful technique that can be applied on a wide variety of applications, such as Smart Cities, Security, Autonomous Vehicles and so on. Is important to note that we covered a simple way that Machine Learning and Computer Vision Engineers can start to address this problem.
If you're looking for an engineered solution, get in touch with us . We have vast expertise implementing Multi-Camera and Multi-Sensor tracking algorithms for companies worldwide and would be happy to provide a free consultation on your problem.
See the full code in our repository: https://github.com/dtlabs-rd/mc-mot
About us
At dtLabs, we believe that great research comes from a combination of technical excellence and a commitment to solving real-world problems. Our private research lab is staffed by world-class experts who are dedicated to pushing the boundaries of science and technology. We are passionate about what we do, and we bring that passion to every project we work on. Our goal is to create solutions that not only meet our clients' needs but exceed their expectations. When you work with dtLabs, you can trust that you're partnering with a team that is committed to excellence in every aspect of our work.
Check out our website for more information and use-cases related to this one:
Computer Vision Research Engineer @ Capgemini Engineering | PhD in Image and Signal Processing
5 个月Great Post. Thanks
Data Analyst | Excel | Power BI | Python | Machine Learning and Statistics | Passionate about Data Driven Decision Making | Looking for Opportunities
6 个月Supporting it! The technology has great potential in security or autonomous vehicle applications. At Orboroi, we help companies with their object detection or computer vision projects. Feel free to visit our site: https://www.orboroi.com/annotation
Founder & CEO, Group 8 Security Solutions Inc. DBA Machine Learning Intelligence
9 个月Appreciate your post!