ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Localization and Object Detection with Deep Learning and YOLO (Single shot detectors)

Shashank V Raghavan??

Artificial Intelligence?| Autonomous Systems??| Resident Robot Geek??| Quantum Computing??| Product and Program Management??

å‘å¸ƒæ—¥æœŸ: 2024å¹´9æœˆ2æ—¥

Localization and Object detection are two of the core tasks in Computer Vision , as they are applied in many real-world applications such as Autonomous vehicles and Robotics.

Classification/Recognition: Given an image with an object , find out what that object is. In other words, classify it in a class from a set of predefined categories.
Localization : Find where the object is and draw a bounding box around it.
Object detection: Classify and detect all objects in the image. Assign a class to each object and draw a bounding box around it.
Semantic Segmentation: Classify every pixel in the image to a class according to its context, so that each pixel is assigned to an object.
Instance Segmentation: Classify every pixel in the image to a class so that each pixel is assigned to a different instance of an object.

The goal is to classify the object and localize it, the process is split into two steps: In the first case , we know the number of objects (we will refer to the problem as classification + localization) and in the second we donâ€™t (object detection).

Classification + Localization

If we have only one object or we know the number of objects, it is actually trivial. We can use one convolutional neural network and train it not only to classify the image but also to output 4 coordinates for the bounding box. In that way we treat the localization as a simple regression problem.

For example, we can borrow a well-studied model such as ResNet or Alexnet which consists of a bunch of convolutional, pooling and other layers, and repurpose the fully connected layer to produce the bounding box apart from the category. It is so simple that make us question whether or not it will give results. And it actually works pretty well in practice. Of course, you can get fancy with it and modify the architecture for serving specific problems or enhance its accuracy, but the main idea remains.

Be sure to note that in order to use this model, we should have a training set with images annotated for the class and the bounding box. And it is not the most fun to do such annotations.

But what if we do no know the number of objects a priori? Then we need to get into the rabbitâ€™s hole and talk about some hardcore stuff. Are you ready? Do you want to take a break before? Sure, I understand but I warn you not to leave. This is where the fun begins.

Object Detection

There are some clever ideas to make the system intolerant to the number of outputs and to reduce its computation cost. So, we do not know the exact number of objects in our image and we want to classify all of them and draw a bounding ox around them. That means that the number of coordinates that the model should output is not constant. If the image has 2 objects , we need 8 coordinates . If it has 4 objects, we want 16. So how we build such a model?

One key idea to traditional computer vision is regions proposal. We generate a set of windows that are likely to contain an object using classic CV algorithms, like edge and shape detection and we apply only these windows( or regions of interests) to the CNN.

R-CNN

Given an image with multiple objects , we generate some regions of interests using a proposal method(in RCNNâ€™s case this method is called selective search) and warp the regions into a fixed size. We forward each region to Convolutional Neural Network (such as AlexNet), which will use an SVM to make a classification decision for each one and predicts a regression for each bounding box. This prediction comes as a correction of the region proposed, which may be in the right position but not at the exact size and orientation.

Although the model produces good results, it suffers from a main issue. It is quite slow and computational expensive. Imagine that in an average case, we produce 2000 regions, which we need to store in disk, and we forward each one of them into the CNN for multiple passes until it is trained. To fix some of these problems, an improvement of the model comes in play called â€˜fast-RCNNâ€™.

Fast RCNN

The idea is straightforward. Instead of passing all regions into the convolutional layer one by one, we pass the entire image once and produce a feature map. Then we take the region proposals as before ( using some external method) and sort of project them onto the feature map. Now we have the regions in the feature map instead of the original image and we can forward them in some fully connected layers to output the classification decision and the bounding box correction.

Note that the projection of regions proposal is implemented using a special layer(ROI layer) ,which is essentially a type of max-pooling with a pool size dependent on the input, so that the output always has the same size.

Faster RCNN

And we can take this a step further. Using the produced feature maps from the convolutional layer, we infer regions proposal using a Region Proposal network rather than relying on an external system. Once we have those proposal , the remaining procedure is the same as Fast-RCNN (forward to ROI layer, classify using SVM and predict the bounding box). The trick part is how to train the whole model as we have multiple tasks that need to be addressed:

é¢†è‹±æŽ¨è

?? The Evolution of Generative AI: A Deep Dive into the Life Cycle and Training of Advanced Language Models ??

?? The Evolution of Generative AI: A Deep Dive intoâ€¦

Aritra Ghosh 2 å¹´å‰

AI Explained, Differences and Use Cases

Thomas W. 3 å‘¨å‰

The Role of Generative AI in Supporting Business SLA, Learning, Performance, and Attrition Improvement

The Role of Generative AI in Supporting Business SLAâ€¦

Allen Larsen 7 ä¸ªæœˆå‰

The region proposal network should decide for each region if it contains an object or not
And it needs to produce the bounding box coordinates
The entire model should classify the objects to categories
And again predict the bounding box offsets

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

As the name suggests, FasterRCNN turns out to be much faster than the previous models and is the one preferred in most real-world applications.

Localization and object detection is a super active and interesting area of research due to the high emergency of real world applications that require excellent performance in computer vision tasks (self-driving cars , robotics). Companies and universities come up with new ideas on how to improve the accuracy on regular basis.

There is another class of models for localization and object detection, called single shot detectors, which have become very popular in the last years because they are even faster and require less computational cost in general. Sure, they are less accurate, but they are ideal for embedded systems and similar power-hungry applications.

YOLO - You only look once (Single shot detectors)

YOLO!!! So do we only live once? I sure do not know. What I know is that we only have to LOOK once. Wait what?

Thatâ€™s right. If you want to detect and localize objects in an image, there is no need to go through the whole process of proposing regions of interest, classify them and correct their bounding boxes, as we observed exactly what models like RCNN and Faster RCNN do.

Do we really need all that complexity and computation? Well if we want top-notch accuracy we certainly do. Luckily there is another simpler way to perform such a task, by processing the image only once and output the prediction immediately. These types of models are called Single shot detectors.

Single shot detectors

Instead of having a dedicated system to propose regions of interests, we have a set of predefined boxes to look for objects, which are forwarded to a bunch of convolutional layers to predict class scores and bounding box offsets. Then for each box we predict a number of bounding boxes with a confidence score assigned to each one, we detect one object centered in that box and we output a set of probabilities for each possible class. Once we have all that, we simply and maybe naively keep only the box with a high confidence score. And it works. With very impressive results actually. To elaborate the overall flow even better, letâ€™s use one of the most popular single shot detectors called YOLO.

You only look once (YOLO)

There have been 3 versions of the model so far, with each new one improving the previous in terms of both speed and accuracy. The number of predefined cells and the number of predicted bounding boxes for each cell is defined based on the input size and the classes. In our case, we are going to use the actual numbers used to evaluate the PASCAL VOC dataset.

First, we divide the image into a grid of 13x13, resulting in 169 cells in total.

For every one of the cells, it predicts 5 bounding boxes (x,y,w,h) with a confidence score, it detects one object regardless the number of boxes and 20 probabilities for the 20 classes.

In total, we have 169*5=845 bounding boxes and the shape of output tensor of the mode is going to be (13,13,5*5+20)= (13,13,45). The whole essence of the YOLO models is to build this (13,13,45) tensor. To accomplish that, it uses a CNN network and 2 fully connected layers to perform the actual regression.

The final prediction is extracted after keeping only the bounding boxes with a high confidence score( higher than a threshold such as 0.3).

Because the model may output duplicate detections for the same object, we use a technique called Non-maximal suppression to remove duplicates. In a simple implementation, we sort the predictions by the confidence score and as we iterate them we keep only the first appearances of each class.

As far as the actual model is concerned, the architecture is quite trivial as it consists of only convolutional and pooling layers, without any fancy tricks. We train the model using a multiple loss function, which includes a classification loss, a localization loss and a confidence loss.

The most recent versions of YOLO have introduced some special tricks to improve the accuracy and reduce the training and inference time. Some examples are batch normalization, anchor boxes, dimensions clusters and others. The power of YOLO is not its spectacular accuracy or the very clever ideas behind it, is its superb speed, which makes it ideal for embedded systems and low-power applications. Thatâ€™s why self-driving cars and surveillance cameras are its most common real-world use cases.

As deep learning continues to play along with computer vision (and it will sure do), we can expect many more models to be tailored for low-power systems even if they sometimes sacrifice accuracy. And don't forget the whole Internet of Things kind of thing. This is where these models really work.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Shashank V Raghavan??çš„æ›´å¤šæ–‡ç«

Edge Processing & Data Sharing in LiDAR Over Mesh Network

2025å¹´3æœˆ13æ—¥

Edge Processing & Data Sharing in LiDAR Over Mesh Network

A mesh network in LiDAR systems refers to a decentralized communication framework where multiple LiDAR sensors or edgeâ€¦
Plenoptic sensing for Autonomous Systems

2025å¹´3æœˆ6æ—¥

Plenoptic sensing for Autonomous Systems

Plenoptic sensing refers to the ability to capture, process, and analyze both light intensity and directionalâ€¦
Deep Learning Models for PID Control in Robotics

2025å¹´2æœˆ25æ—¥

Deep Learning Models for PID Control in Robotics

PID controllers are widely used in robotics for motion control, trajectory tracking, and balancing tasks. However, theyâ€¦
DeepSORT Algorithm For Object Tracking

2025å¹´2æœˆ13æ—¥

DeepSORT Algorithm For Object Tracking

DeepSORT (Deep Simple Online and Realtime Tracking) is an advanced object tracking algorithm that builds upon theâ€¦
Optics in Quantum Computers

2025å¹´2æœˆ4æ—¥

Optics in Quantum Computers

Optics play a crucial role in quantum computing, especially in photonic quantum computing and quantum communicationâ€¦

1 æ¡è¯„è®º
AI-enabled optical sensor ViDAR (Visual Detection and Ranging)

2025å¹´1æœˆ16æ—¥

AI-enabled optical sensor ViDAR (Visual Detection and Ranging)

ViDAR (Visual Detection and Ranging) is an advanced optical sensor technology used for wide-area surveillanceâ€¦
Reinforcement Learning Frameworks for Decision-Making in Autonomous Navigation

2025å¹´1æœˆ7æ—¥

Reinforcement Learning Frameworks for Decision-Making in Autonomous Navigation

Reinforcement Learning (RL) stands at the forefront of artificial intelligence, offering transformative capabilitiesâ€¦
Sensor Fusion (LiDAR + Camera) PointPillars

2024å¹´12æœˆ30æ—¥

Sensor Fusion (LiDAR + Camera) PointPillars

LiDAR and camera fusion algorithms combine data from LiDAR sensors (which provide precise depth and 3D spatialâ€¦
Point cloud analysis using ICP

2024å¹´12æœˆ26æ—¥

Point cloud analysis using ICP

Point cloud analysis in LiDAR systems is a critical aspect of computer vision, enabling tasks like object detectionâ€¦
Noise Filtering: LiDAR Systems

2024å¹´12æœˆ24æ—¥

Noise Filtering: LiDAR Systems

Noise filtering in LiDAR systems is critical for ensuring accurate and reliable data. Noise in LiDAR data can resultâ€¦

See all articles

Localization and Object Detection with Deep Learning and YOLO (Single shot detectors)

Shashank V Raghavan??

Artificial Intelligence?| Autonomous Systems??| Resident Robot Geek??| Quantum Computing??| Product and Program Management??

Classification + Localization

Object Detection

R-CNN

Fast RCNN

Faster RCNN

é¢†è‹±æŽ¨è

YOLO - You only look once (Single shot detectors)

Single shot detectors

You only look once (YOLO)

Shashank V Raghavan??çš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Machine Learning and Artificial Intelligence

Unlocking the Future: The Exciting Features of AI

AI in 2023: Emerging Trends and Future Possibilities

Manus AI: Meet the Autonomous AI Agent from China

Artificial intelligence (#AI) and Multi National Enterprises

Generative AI: This Surging Technology Offers Innovation for Businesses

Agentic AI: A New Frontier in Artificial Intelligence

Exploring the Key Features of Artificial Intelligence

Classification + Localization

Object Detection

R-CNN

Fast RCNN

Faster RCNN

é¢†è‹±æŽ¨è

YOLO - You only look once (Single shot detectors)

Single shot detectors

You only look once (YOLO)

Shashank V Raghavan??çš„æ›´å¤šæ–‡ç«

Edge Processing & Data Sharing in LiDAR Over Mesh Network

Plenoptic sensing for Autonomous Systems

Deep Learning Models for PID Control in Robotics

DeepSORT Algorithm For Object Tracking

Optics in Quantum Computers

AI-enabled optical sensor ViDAR (Visual Detection and Ranging)

Reinforcement Learning Frameworks for Decision-Making in Autonomous Navigation

Sensor Fusion (LiDAR + Camera) PointPillars

Point cloud analysis using ICP

Noise Filtering: LiDAR Systems

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Machine Learning and Artificial Intelligence

Unlocking the Future: The Exciting Features of AI

AI in 2023: Emerging Trends and Future Possibilities

Manus AI: Meet the Autonomous AI Agent from China

Artificial intelligence (#AI) and Multi National Enterprises

Generative AI: This Surging Technology Offers Innovation for Businesses

Agentic AI: A New Frontier in Artificial Intelligence

Exploring the Key Features of Artificial Intelligence

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†