DeepSORT Algorithm For Object Tracking

DeepSORT Algorithm For Object Tracking

DeepSORT (Deep Simple Online and Realtime Tracking) is an advanced object tracking algorithm that builds upon the original SORT (Simple Online and Realtime Tracking) by incorporating deep learning for more robust performance, especially in complex environments with occlusions and similar-looking objects.

Background: SORT Recap

SORT uses:

  • Kalman Filter: For predicting the next position of objects based on motion (position and velocity).
  • Hungarian Algorithm: For solving the assignment problem, matching detected objects between consecutive frames based on the Intersection over Union (IoU) of their bounding boxes.

Limitations of SORT:

  • Fails in scenarios with occlusion (when objects are temporarily hidden).
  • Struggles with appearance similarity (e.g., two people wearing similar clothes).

Key Components of DeepSORT:

Detection: Requires external object detectors like YOLO, Faster R-CNN, or SSD to provide bounding boxes for objects in each frame.

Motion Model (Kalman Filter): Predicts the object's position in the next frame based on its current motion (velocity, position, etc.).

Data Association (Hungarian Algorithm): Matches current detections with existing tracked objects using a cost matrix.

Appearance Descriptor (Deep Learning):

  • Uses a Convolutional Neural Network (CNN) to extract feature embeddings from detected objects.
  • This helps maintain object identity over time, even during partial occlusions or abrupt motion changes.

Combined Cost Metric:

  • Motion Cost (IoU): Measures overlap between predicted and detected bounding boxes.
  • Appearance Cost (Cosine Distance): Compares feature embeddings of detections and tracked objects.
  • A weighted combination of both metrics improves tracking robustness.

How DeepSORT Works?

Detection (Input): Requires external object detectors like YOLO, Faster R-CNN, or SSD to provide bounding boxes for objects in each frame.

Motion Model (Prediction): Kalman Filter predicts the next position of each tracked object based on its previous state (position, velocity, etc.).

Appearance Descriptor (Deep Learning):

  • DeepSORT introduces a Convolutional Neural Network (CNN) to extract a feature vector (appearance embedding) for each detected object. This helps distinguish between visually similar objects.
  • These embeddings are typically 128-dimensional vectors that capture unique visual characteristics.

Data Association (Matching):

  • Combines IoU (spatial information) and cosine distance (appearance similarity) to associate current detections with existing tracks.
  • Hungarian Algorithm is then used to optimally match detections to tracks based on this combined metric.

Track Management:

  • Confirmed Tracks: Objects that have been consistently detected over multiple frames.
  • Tentative Tracks: Newly detected objects awaiting confirmation.
  • Deleted Tracks: Tracks removed after missing detections for a threshold number of frames.

Why DeepSORT is Effective

  • Robust to Occlusions: Appearance embeddings help maintain identity even when objects are temporarily hidden.
  • Reduced Identity Switches: Combines spatial and visual data, making it less likely to confuse similar-looking objects.
  • Scalable: Can handle multiple objects in real-time applications like pedestrian tracking, vehicle tracking, etc.

Implementation of DeepSORT for object tracking using YOLOv5 as the object detector. This setup will help us detect and track multiple objects in a video stream.

pip install torch torchvision torchaudio
pip install opencv-python
pip install numpy
pip install filterpy
pip install scikit-learn
pip install yolov5        
import cv2
import torch
import numpy as np
from filterpy.kalman import KalmanFilter
from scipy.spatial import distance
from yolov5 import YOLOv5

# Load YOLOv5 model
model = YOLOv5("yolov5s.pt")  # Use 'yolov5s' for speed, 'yolov5m' or 'yolov5l' for better accuracy

# Kalman Filter Tracker class
class Tracker:
    def __init__(self, bbox, feature, tracker_id):
        self.kalman = KalmanFilter(dim_x=7, dim_z=4)
        self.kalman.F = np.array([
            [1, 0, 0, 0, 1, 0, 0],
            [0, 1, 0, 0, 0, 1, 0],
            [0, 0, 1, 0, 0, 0, 1],
            [0, 0, 0, 1, 0, 0, 0],
            [0, 0, 0, 0, 1, 0, 0],
            [0, 0, 0, 0, 0, 1, 0],
            [0, 0, 0, 0, 0, 0, 1]
        ])
        self.kalman.H = np.array([
            [1, 0, 0, 0, 0, 0, 0],
            [0, 1, 0, 0, 0, 0, 0],
            [0, 0, 1, 0, 0, 0, 0],
            [0, 0, 0, 1, 0, 0, 0]
        ])
        self.kalman.R[2:, 2:] *= 10.
        self.kalman.P[4:, 4:] *= 1000.
        self.kalman.P *= 10.
        self.kalman.Q[-1, -1] *= 0.01
        self.kalman.Q[4:, 4:] *= 0.01
        
        self.kalman.x[:4] = bbox.reshape((4, 1))
        self.feature = feature
        self.tracker_id = tracker_id
        self.hits = 1
        self.no_losses = 0

    def predict(self):
        self.kalman.predict()

    def update(self, bbox, feature):
        self.kalman.update(bbox)
        self.feature = feature
        self.hits += 1
        self.no_losses = 0

# Feature extraction using a simple color histogram (placeholder for a CNN-based descriptor)
def get_features(image, bbox):
    x1, y1, x2, y2 = map(int, bbox)
    crop = image[y1:y2, x1:x2]
    hist = cv2.calcHist([crop], [0, 1, 2], None, [8, 8, 8], [0, 256, 0, 256, 0, 256])
    return cv2.normalize(hist, hist).flatten()

# Data association using cosine similarity
def associate_detections(tracks, detections, features):
    if len(tracks) == 0:
        return np.empty((0, 2), dtype=int), np.arange(len(detections)), []

    cost_matrix = np.zeros((len(tracks), len(detections)), dtype=np.float32)

    for i, track in enumerate(tracks):
        for j, feature in enumerate(features):
            cost_matrix[i, j] = distance.cosine(track.feature, feature)
          
    matched_indices = []
    while cost_matrix.size > 0 and cost_matrix.min() < 0.5:  # Threshold
        i, j = np.unravel_index(cost_matrix.argmin(), cost_matrix.shape)
        matched_indices.append((i, j))
        cost_matrix[i, :] = 1
        cost_matrix[:, j] = 1

    unmatched_tracks = list(set(range(len(tracks))) - {i for i, _ in matched_indices})
    unmatched_detections = list(set(range(len(detections))) - {j for _, j in matched_indices})

    return matched_indices, unmatched_tracks, unmatched_detections

# Video capture
cap = cv2.VideoCapture("input_video.mp4")
trackers = []
tracker_id = 0

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # YOLOv5 detection
    results = model.predict(frame)
    detections = results.xyxy[0].numpy()

    # Extract features for detections
    features = [get_features(frame, det[:4]) for det in detections]

    # Predict new locations for all tracks
    for tracker in trackers:
        tracker.predict()

    # Associate detections to existing tracks
    matches, unmatched_tracks, unmatched_detections = associate_detections(trackers, detections, features)

    # Update matched trackers
    for track_idx, det_idx in matches:
        bbox = detections[det_idx][:4]
        feature = features[det_idx]
        trackers[track_idx].update(bbox, feature)

    # Create new trackers for unmatched detections
    for det_idx in unmatched_detections:
        bbox = detections[det_idx][:4]
        feature = features[det_idx]
        trackers.append(Tracker(bbox, feature, tracker_id))
        tracker_id += 1

    # Remove lost trackers
    trackers = [t for t in trackers if t.no_losses < 5]

    # Draw bounding boxes
    for tracker in trackers:
        x1, y1, x2, y2 = map(int, tracker.kalman.x[:4])
        cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
        cv2.putText(frame, f'ID: {tracker.tracker_id}', (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2)

    cv2.imshow("DeepSORT Tracking", frame)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()        

How This Works:

  1. Object Detection: Uses YOLOv5 to detect objects in each frame.
  2. Feature Extraction: Extracts simple color histograms (this can be replaced with a CNN for better accuracy).
  3. Kalman Filter: Predicts the next position of tracked objects.
  4. Data Association: Matches detections to tracks using cosine similarity of features.
  5. Track Management: Updates matched tracks, creates new ones, and removes lost tracks.

要查看或添加评论,请登录

Shashank V Raghavan??的更多文章

  • Deep Learning Models for PID Control in Robotics

    Deep Learning Models for PID Control in Robotics

    PID controllers are widely used in robotics for motion control, trajectory tracking, and balancing tasks. However, they…

  • Optics in Quantum Computers

    Optics in Quantum Computers

    Optics play a crucial role in quantum computing, especially in photonic quantum computing and quantum communication…

    1 条评论
  • AI-enabled optical sensor ViDAR (Visual Detection and Ranging)

    AI-enabled optical sensor ViDAR (Visual Detection and Ranging)

    ViDAR (Visual Detection and Ranging) is an advanced optical sensor technology used for wide-area surveillance…

  • Reinforcement Learning Frameworks for Decision-Making in Autonomous Navigation

    Reinforcement Learning Frameworks for Decision-Making in Autonomous Navigation

    Reinforcement Learning (RL) stands at the forefront of artificial intelligence, offering transformative capabilities…

  • Sensor Fusion (LiDAR + Camera) PointPillars

    Sensor Fusion (LiDAR + Camera) PointPillars

    LiDAR and camera fusion algorithms combine data from LiDAR sensors (which provide precise depth and 3D spatial…

  • Point cloud analysis using ICP

    Point cloud analysis using ICP

    Point cloud analysis in LiDAR systems is a critical aspect of computer vision, enabling tasks like object detection…

  • Noise Filtering: LiDAR Systems

    Noise Filtering: LiDAR Systems

    Noise filtering in LiDAR systems is critical for ensuring accurate and reliable data. Noise in LiDAR data can result…

  • 3D Point Cloud Segmentation

    3D Point Cloud Segmentation

    What is Point Cloud Segmentation? A point cloud is an unstructured 3D data representation of the world, typically…

  • Shadowless 3D Perception

    Shadowless 3D Perception

    Shadowless 3D Perception is a concept often linked to advancements in computer vision, machine learning, and robotics…

  • Robotic Path Planning: RRT and RRT*

    Robotic Path Planning: RRT and RRT*

    The robotic path planning problem is a classic. A robot, with certain dimensions, is attempting to navigate between…

其他会员也浏览了