Using Free Resources to Prepare Data for Training Object Detection Models with NVIDIA TAO Toolkit

Using Free Resources to Prepare Data for Training Object Detection Models with NVIDIA TAO Toolkit

Object detection is a crucial component of computer vision. NVIDIA’s Train Adapt Optimize (TAO) Toolkit, a Python-based AI toolkit, allows users to train, fine-tune, prune, and export AI models, customized with their own data. NVIDIA supports its users by providing helpful sample Python notebooks, such asdetectnet_v2.ipynb(REF). DetectNet_v2 is an object detection model within TAO, and it requires training data in KITTI format.

This article is dedicated to providing a guide on how to prepare KITTI training data by utilizing free resources, thus empowering users to leverage their own datasets effectively with NVIDIA TAO Toolkit.

KITTI Format

Using the KITTI format requires data to be organized in this structure (REF):

  • The images directory contains the images to train on.
  • The labels directory contains the labels to the corresponding images.

It’s essential to ensure that each image and its corresponding label share the same file ID (name) before the file extension. This is critical as the image to label correspondence is maintained using this file name. (REF)

We will first discuss using CVAT (Computer Vision Annotation Tool) to annotate videos and generate the label files in KITTI format. Following that, we will introduce a Python script for generating image files that correspond to these labels. Finally, we’ll discuss additional steps required for processing both the label and image files, making them compatible with NVIDIA’s TAO Toolkit.

Generate Label Files using CVAT

To prepare your training data, start by annotating your videos, and for this purpose, we choose to use CVAT (Computer Vision Annotation Tool), which is an open-source web-based tool designed for the annotation of images and videos for computer vision algorithms. CVAT allows you to annotate your own images or videos and download these annotations in various formats, including the KITTI format.

After completing the annotation process on the CVAT platform, you can export these annotations by navigating to the “Actions” menu in your task and clicking on “Export task dataset”. From there, select “KITTI 1.0” as your export format from the dropdown list.

CVAT offers different levels of access and pricing plans, catering to users with different needs. The Free plan is a good starting point. However, it’s important to note that under the free plan, only the annotation labels can be downloaded, not the image files. Therefore, when using CVAT to annotate a video, you’ll need to independently convert your video into a series of image files for a complete KITTI dataset.

Generate Image Files

To create a complete dataset in the KITTI format, we developed the a Python script to convert your video file into a series of image files.

import cv2
import os

# Path to your MP4 file
video_path = '/home/user/test1.mp4'

# Directory to save the frames
frames_dir = '/home/user/my_project/images'
os.makedirs(frames_dir, exist_ok=True)

# Create a VideoCapture object
cap = cv2.VideoCapture(video_path)

frame_number = 0
while True:
    # Read frame
    ret, frame = cap.read()

    # Break the loop if there are no more frames
    if not ret:
        break

    # Save each frame as an image
    frame_number_str = str(frame_number).zfill(6)
    frame_path = os.path.join(frames_dir, f'frame_{frame_number_str}.png')
    
    cv2.imwrite(frame_path, frame)

    frame_number += 1

# Release the VideoCapture object
cap.release()
print(f'Frames extracted to {frames_dir} directory.')        

As mentioned earlier, it is essential to ensure that each image and its corresponding label share the same file ID (name) before the file extension. Label files typically use the format “frame_00XXXX.txt”. To maintain numerical order in file naming conventions (like 000001, 000002, etc.), it’s necessary to pad the file name string with zeros on the left, ensuring a total length of six characters in this example. This is achieved using the Python code: frame_number_str = str(frame_number).zfill(6)

To ensure that the generated image files correspond accurately to the label files, we have created a Python script. This script overlays the bounding boxes defined in the label files onto their respective image files, allowing for a visual verification of the match.

# add one bbox to an image

import cv2

# Path to your image and label file
image_path = "/home/user/my_project/images/frame_007581.png"
label_file_path = "/home/detectnet_v2/data/my_project/labels/frame_007581.txt"

# Read the first line from the label file
with open(label_file_path, 'r') as file:
    label_str = file.readline().strip()

# Parse the label string to extract bounding box coordinates
parts = label_str.split()
left, top, right, bottom = map(float, parts[4:8])

# Convert float coordinates to integers
bbox = (int(left), int(top), int(right), int(bottom))

# Read the image
image = cv2.imread(image_path)

# Draw the bounding box
cv2.rectangle(image, (bbox[0], bbox[1]), (bbox[2], bbox[3]), (0, 255, 0), 2)

# Display the image
from matplotlib import pyplot as plt
# Convert the image from BGR to RGB color space (OpenCV uses BGR by default)
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

# Display the image using Matplotlib
plt.imshow(image_rgb)
plt.title('Displayed Image')
plt.axis('off')  # Turn off axis numbers and labels
plt.show()

# Optionally, save the image with bbox
#cv2.imwrite("/home/annotation/test1/box_007581.png", image)        

KITTI Data Format Compatible with NVIDIA’s TAO Toolkit

To ensure compatibility with NVIDIA TAO’s detectnet_v2, two additional steps are necessary for the KITTI data. With both the image and label files now prepared, and stored at:

image_path = "/home/user/my_project/images"
label_file_path = "/home/detectnet_v2/data/my_project/labels"        

Firstly, as not every video frame contains identified objects, there are likely fewer label files than image files. This discrepancy means that some images may not have corresponding labels. To address this, we need to remove image files that lack matching label files.

# Prepare kitti training data 
# removing empty box images

from os import listdir
from os.path import isfile, join
import shutil

mypath = "/home/detectnet_v2/data/my_project/labels"
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
selected_images = [sub.replace('txt', 'png') for sub in onlyfiles]

!mkdir -p /home/detectnet_v2/data/my_project/images

source_folder = '/home/user/my_project/images/'
destination_folder = '/home/detectnet_v2/data/my_project/images/'
 
allfiles = listdir(source_folder)
count = 0
 
for f in allfiles:
    source = source_folder + f
    destination = destination_folder + f
    if f in selected_images:
        count = count+1
        shutil.move(source , destination)
print(count)        

Secondly, the label file format produced by CVAT includes an extra data field compared to the format required by NVIDIA TAO. Therefore, we need to modify the label files to align with TAO’s requirements.

A KITTI format label file is a text file containing one line per object. Each line has multiple fields. Here is a description of these fields (REF):

The KITTI label file format required by NVIDIA TAO comprises only 15 data fields, excluding the “score” field (REF). To accommodate this, we have developed a Python script tailored to remove the 16th column, the extra data field, from the label files. This crucial step ensures that the label files conform to NVIDIA TAO’s specific format requirements, facilitating seamless integration with the toolkit.

# remove the 16th data field in the label file
import os

def remove_last_field(file_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()

    with open(file_path, 'w') as file:
        for line in lines:
            fields = line.strip().split(' ')
            # Only remove the last field if there's more than one field
            if len(fields) > 1:
                modified_line = ' '.join(fields[:-1]) + '\n'
            else:
                modified_line = line
            file.write(modified_line)

def main():
    directory = '/home/detectnet_v2/data/my_project/labels'
    for filename in os.listdir(directory):
        if filename.endswith('.txt'):
            file_path = os.path.join(directory, filename)
            remove_last_field(file_path)

if __name__ == "__main__":
    main()        

References

The KITTI Vision Benchmark Suite

The dataset formats for computer-vision apps supported by TAO Toolkit

Object Detection using TAO DetectNet_v2

Computer Vision Annotation Tool (CVAT)


要查看或添加评论,请登录

??Huajing Shi, PhD, CFA的更多文章

社区洞察

其他会员也浏览了