ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Object detection in Computer vision:An Intuitive approach

Prashanth Rajan

Technical Lead at Dover Corporation, Fitness enthusiast, Heath and Wellness Coach guiding people towards holistic living ?? ??

å‘å¸ƒæ—¥æœŸ: 2021å¹´9æœˆ26æ—¥

Introduction

Ever wondered how computers percieve the world? Have you ever been intrigued by how self driving cars effectively navigate the raging chaos of traffic. Metropolitan city dwellers will surely know what I am talking about!

We delve into the exciting world of computer vision. Unlike how humans view world objects as as a confluence of colors and shapes, computers see them as an array of numbers.

Our perspective

Computer Perspective

An image is essentially a stacked array consisting of cells called pixels. Each pixel takes on a value ranging from 0 to 255 (In color terms it ranges from black to white respectively and all values in this range represent the shade in grayscale).

A color image is simply a stacked array! In other words a multidimensional array consisting of color channels (RGB).

Color Image distilled into array channels

Now that we know how images are structured, lets understand how computers decode these images and process them for identification. We've all encountered various form of computer vision coming into play be it computers discerning CAT and PET scans for malignous tumours to the ubiquitous face recognition software and of course the self driving car. Suppose we were to feed an image of a cat to a computer tasked with classifying it as such, how would the computer go about it? How does technology like Google lens accurately pinpoint what the user is looking for based on an image search?

Convolutional Neural Networks

Enter Convolutional neural networks (CNN)! The mathematics behind it is quite tedious so I'll just skip over that part. But in essence, CNN's work by gathering distinct features of an image, much like a microscope to examine organisms, running pixel values through a neural net, compute the deviation of its prediction from the ground truth and reiterate till the machine has gotten a fair idea of what it is looking at. In our cat example, the algorithm will try to capture features such as face shape, ear structure, fur texture, eye features and other relevant details that will help in identifying future images of a cat. This is pretty much akin to how we learnt to identify animals in kindergarden. The below image will help make this clear.

The process starts with convolution. This is simply a mathematical operation that a computer performs to identify relevent image features that will help classify an image. This is computationally very efficient as it is superflous for the computer to scan the entire background in addition to the foreground just to identify an object. The output of a covolution is called a feature map (Shown as red border boxes). This is followed by max pooling, another operation which helps cut down the dimensionality of a feature map to reduce memory load on the system.

All these operations precede the final stage of deep neural network training. By passing the final pixel values through a series of weights and activation functions, an output is generated and the deviation of said output from the ground truth is calculated. This deviation will help the network to pint point the weights that ae responsible for for the most error and adjust them to get the closest matching prediction. This is done by another mathematical operation called back propagation and gradient descent. Links down below :

é¢†è‹±æŽ¨è

Transformers Model, The Neural Network Without Convolutional and Recurrent Layer

Transformers Model, The Neural Network Withoutâ€¦

Shanza Khan 7 ä¸ªæœˆå‰

There Is One Thing Computers Will Never Beat Us At

Azeem Azhar 7 å¹´å‰

Pioneers of Neural Networks: John Hopfield and Geoffrey Hinton

Pioneers of Neural Networks: John Hopfield andâ€¦

Solution Analysts 5 ä¸ªæœˆå‰

Now Back to our regularily scheduled programming. This is how neural networks process convoluted data :

In a nutshell this is how computers identify objects it encounters in the world. However practically this is infeasable in many applications such as face recognition, industrial automated robots and self driving cars. Tradional CNN's simply do not have the bandwidth to instantlly identify an object and worse still determine its location which is our main focus of this topic. The real world applications I just listed above run on an even more poweful tool in the arsenal of computer vision called object detection.

Object Detection

We once again take our cat example. Based on the challenges faced by traditional CNN's, it is also important to locate an objects proximity in the image. Self driving cars precisely detect obstacles en route this way and activate the necessary mechanisms to navigate past them. This requires an instant computational time closely matching the reaction time of a human driver for it to be feasible unless you want to run over a cat crossing the car's path.

In such cases simple image classification is no longer going to cut it. We also require the precise location of the object in the image. The basic framework of the CNN remains the same but with a twist. Many object detection techniques exist but the most popular one is called semantic segmentation. In this technique, our ground truth images to base predictions on are segmented pixel wise and the neural net will compensate its weights to not only classify the object but also determine its location

The process starts with seperating out the image's foreground from the background. The original image is segmented pixel wise based on our region of interest (ROI), in this case, the cat's face. The pixel segmented cat face location is also captured and stored as our ground truth. Any deviation of our neural network's preductions or how far it is off will be calculated based on this segmented ground truth and inturn compensated to make an accurate estimate of the object location!

Convolution is a downsampling process. Meaning the pixels are trimmed down to the bare essentials to only look for what is necessary. However to make our predictions, this downsampled feature map will now have to be upsampled to the original input size to regain any lost information. These are acheved through transpose convolution and upsampling (inverse of max pooling). Again these are mathematically intense processes and beyond the scope of his topic.

Conclusion

Object detection goes a step beyond mere image classification and is a very intriguing technique used by a computer to navigate real world obstacles. However the downside is as follows :

Large amount of dats is required for training
Tedious task to draw bounding boxes prior to pixelation and storing their coordinates
Retraining needs to be done for new information that comes in. For instance, a U net trained on cat images will not be able to identify an image of a dog. Class disparities are a major achillies heel. However an ingenious technique called siamese networks have emerged and works on one shot learning eliminating the need for retraining on new data. I will cover this topic in my next blog on how face recognition works.

So that was a very broad distillation of object detection! Despite some of its shortcomings, it is a powerful tool in the hands of data scientists that can be moulded to suit an industrial need. While pre trained models exist as open source codes in git hub and Kaggle to help offset the enormous training times required, it ultimately boils down to the experience of a machine learningengineer to build a model to suit a desired application.

References

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Prashanth Rajançš„æ›´å¤šæ–‡ç«

NLP- Text Classification

2023å¹´3æœˆ18æ—¥

NLP- Text Classification

In my last article post entitled "How Machines Learn Lauguages" , I had presented a succinct illustration of ChatGPTâ€¦
How Machines learn Languages

2023å¹´3æœˆ13æ—¥

How Machines learn Languages

Lately a highly disruptive, technological breakthrough has been making the rounds in our society. It marks a paradigmâ€¦
An Intuitive approach to Ensemble learning

2021å¹´7æœˆ4æ—¥

An Intuitive approach to Ensemble learning

Introduction Machine Learning is fast emerging as a new automated way of dealing with data. Everyday new information isâ€¦
The road to "Normal"

2021å¹´6æœˆ14æ—¥

The road to "Normal"

The word normal often has a negative connotation to it. It is synonymous with monotony, conformity, of staying put inâ€¦

Object detection in Computer vision:An Intuitive approach

Prashanth Rajan

Technical Lead at Dover Corporation, Fitness enthusiast, Heath and Wellness Coach guiding people towards holistic living ?? ??

Introduction

Convolutional Neural Networks

é¢†è‹±æŽ¨è

Object Detection

Conclusion

References

Prashanth Rajançš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Computer Vision

How Artificial Intelligence is Changing the Way We Work

How Physics-Informed Neural Networks Are Shaping the Future of Engineering!

Unraveling the Mysteries of Neural Networks with 200-Year-Old Math

Simple Approach for Violence Detection Explained

Exploring the Evolution Beyond Transformers: Unveiling the Power of State Space Models with Mamba

When Neurons Go Quantum: The Next Evolution in Artificial Intelligence

How Computers See Images: Turning Colors into Numbers

Neural network architecture in 30 minutes

Introduction

Convolutional Neural Networks

é¢†è‹±æŽ¨è

Object Detection

Conclusion

References

Prashanth Rajançš„æ›´å¤šæ–‡ç«

NLP- Text Classification

How Machines learn Languages

An Intuitive approach to Ensemble learning

The road to "Normal"

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Computer Vision

How Artificial Intelligence is Changing the Way We Work

How Physics-Informed Neural Networks Are Shaping the Future of Engineering!

Unraveling the Mysteries of Neural Networks with 200-Year-Old Math

Simple Approach for Violence Detection Explained

Exploring the Evolution Beyond Transformers: Unveiling the Power of State Space Models with Mamba

When Neurons Go Quantum: The Next Evolution in Artificial Intelligence

How Computers See Images: Turning Colors into Numbers

Neural network architecture in 30 minutes

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†