Object detection in Computer vision:An Intuitive approach

Object detection in Computer vision:An Intuitive approach

Introduction

Ever wondered how computers percieve the world? Have you ever been intrigued by how self driving cars effectively navigate the raging chaos of traffic. Metropolitan city dwellers will surely know what I am talking about!

We delve into the exciting world of computer vision. Unlike how humans view world objects as as a confluence of colors and shapes, computers see them as an array of numbers.

Our perspective

Human Perspective

Computer Perspective

No alt text provided for this image

An image is essentially a stacked array consisting of cells called pixels. Each pixel takes on a value ranging from 0 to 255 (In color terms it ranges from black to white respectively and all values in this range represent the shade in grayscale).

A color image is simply a stacked array! In other words a multidimensional array consisting of color channels (RGB).

Color Image distilled into array channels

Now that we know how images are structured, lets understand how computers decode these images and process them for identification. We've all encountered various form of computer vision coming into play be it computers discerning CAT and PET scans for malignous tumours to the ubiquitous face recognition software and of course the self driving car. Suppose we were to feed an image of a cat to a computer tasked with classifying it as such, how would the computer go about it? How does technology like Google lens accurately pinpoint what the user is looking for based on an image search?

Convolutional Neural Networks

Enter Convolutional neural networks (CNN)! The mathematics behind it is quite tedious so I'll just skip over that part. But in essence, CNN's work by gathering distinct features of an image, much like a microscope to examine organisms, running pixel values through a neural net, compute the deviation of its prediction from the ground truth and reiterate till the machine has gotten a fair idea of what it is looking at. In our cat example, the algorithm will try to capture features such as face shape, ear structure, fur texture, eye features and other relevant details that will help in identifying future images of a cat. This is pretty much akin to how we learnt to identify animals in kindergarden. The below image will help make this clear.

No alt text provided for this image

The process starts with convolution. This is simply a mathematical operation that a computer performs to identify relevent image features that will help classify an image. This is computationally very efficient as it is superflous for the computer to scan the entire background in addition to the foreground just to identify an object. The output of a covolution is called a feature map (Shown as red border boxes). This is followed by max pooling, another operation which helps cut down the dimensionality of a feature map to reduce memory load on the system.

All these operations precede the final stage of deep neural network training. By passing the final pixel values through a series of weights and activation functions, an output is generated and the deviation of said output from the ground truth is calculated. This deviation will help the network to pint point the weights that ae responsible for for the most error and adjust them to get the closest matching prediction. This is done by another mathematical operation called back propagation and gradient descent. Links down below :


Now Back to our regularily scheduled programming. This is how neural networks process convoluted data :

No alt text provided for this image

In a nutshell this is how computers identify objects it encounters in the world. However practically this is infeasable in many applications such as face recognition, industrial automated robots and self driving cars. Tradional CNN's simply do not have the bandwidth to instantlly identify an object and worse still determine its location which is our main focus of this topic. The real world applications I just listed above run on an even more poweful tool in the arsenal of computer vision called object detection.

Object Detection

We once again take our cat example. Based on the challenges faced by traditional CNN's, it is also important to locate an objects proximity in the image. Self driving cars precisely detect obstacles en route this way and activate the necessary mechanisms to navigate past them. This requires an instant computational time closely matching the reaction time of a human driver for it to be feasible unless you want to run over a cat crossing the car's path.

No alt text provided for this image
No alt text provided for this image

In such cases simple image classification is no longer going to cut it. We also require the precise location of the object in the image. The basic framework of the CNN remains the same but with a twist. Many object detection techniques exist but the most popular one is called semantic segmentation. In this technique, our ground truth images to base predictions on are segmented pixel wise and the neural net will compensate its weights to not only classify the object but also determine its location

No alt text provided for this image

The process starts with seperating out the image's foreground from the background. The original image is segmented pixel wise based on our region of interest (ROI), in this case, the cat's face. The pixel segmented cat face location is also captured and stored as our ground truth. Any deviation of our neural network's preductions or how far it is off will be calculated based on this segmented ground truth and inturn compensated to make an accurate estimate of the object location!

Convolution is a downsampling process. Meaning the pixels are trimmed down to the bare essentials to only look for what is necessary. However to make our predictions, this downsampled feature map will now have to be upsampled to the original input size to regain any lost information. These are acheved through transpose convolution and upsampling (inverse of max pooling). Again these are mathematically intense processes and beyond the scope of his topic.

Conclusion

Object detection goes a step beyond mere image classification and is a very intriguing technique used by a computer to navigate real world obstacles. However the downside is as follows :

  1. Large amount of dats is required for training
  2. Tedious task to draw bounding boxes prior to pixelation and storing their coordinates
  3. Retraining needs to be done for new information that comes in. For instance, a U net trained on cat images will not be able to identify an image of a dog. Class disparities are a major achillies heel. However an ingenious technique called siamese networks have emerged and works on one shot learning eliminating the need for retraining on new data. I will cover this topic in my next blog on how face recognition works.

So that was a very broad distillation of object detection! Despite some of its shortcomings, it is a powerful tool in the hands of data scientists that can be moulded to suit an industrial need. While pre trained models exist as open source codes in git hub and Kaggle to help offset the enormous training times required, it ultimately boils down to the experience of a machine learningengineer to build a model to suit a desired application.

References


要查看或添加评论,请登录

Prashanth Rajan的更多文章

  • NLP- Text Classification

    NLP- Text Classification

    In my last article post entitled "How Machines Learn Lauguages" , I had presented a succinct illustration of ChatGPT…

  • How Machines learn Languages

    How Machines learn Languages

    Lately a highly disruptive, technological breakthrough has been making the rounds in our society. It marks a paradigm…

  • An Intuitive approach to Ensemble learning

    An Intuitive approach to Ensemble learning

    Introduction Machine Learning is fast emerging as a new automated way of dealing with data. Everyday new information is…

  • The road to "Normal"

    The road to "Normal"

    The word normal often has a negative connotation to it. It is synonymous with monotony, conformity, of staying put in…

社区洞察

其他会员也浏览了