登录查看更多内容

From Sight to Vision

Sriharsha Ganjam

AI & Analytics for Marketing

发布日期: 2024年10月13日

All the complex lifeforms on our planet have 5 sense organs. The power of Sight, Sound, Smell, Taste and Touch. And of all these "sight" is perhaps the most important and critical to our survival. Around 30%-40% of our brain activity is dedicated towards processing information related to sight, far greater than any other sense organs. Vision is one of the holy grails of our advancing AI capability and a lot of work has been accomplished in defining Vision as an AI feasible goal. Senses like Taste and Smell are still not well represented in AI compared to Touch and Sound apart from Sight, something that the makers of "The Matrix" realized long back and so expertly presented.

This is probably because of the various real life applications for automating vision in real life. Before we go any further, I think it's important we define the difference between Sight and Vision. We often use these terms loosely in our language and hence risk under estimating the complexity involved in converting "Sight" into "Vision". Sight happens in our eyes, whereas Vision is enabled in our brain after the information from our sight is processed. This gives rise to actions that are driven from insights derived from sight. Vision always involves the application of Intelligence (Human or Artificial) to sight. Sight is what a camera can see and capture. Vision is when the image is processed to identify faces, or read text, or predict the speed of a moving vehicle, etc.

Understanding this basic definition will help us better align our expectations of this rapid advancing technology and also help us to draft business goals that can benefit the most from the application of Computer Based AI Empowered Vision.

SIGHT + INTELLIGENCE = VISION

Now that we have successfully differentiated Sight from Vision, we can now "see" (pun intended) where AI fits into the whole scheme of things. In it's broadest sense Vision consists of 3 steps

Detection - Subconscious with no need for Intelligence
Classification - Mostly Subconscious with minimal use of Intelligence
Identification - Conscious with adequate application of Intelligence

If you have ever watched a toddler play with Shapes and Blocks you would understand these steps better.

Detection:

This is the most first step involved in developing Vision. And because it's so important and complex in enabling Vision, we train on this the longest. Babies start learning this just a few months after they are born by learning to focus their eyes on different objects. This is a purely subconscious activity and does not involve any intelligence. This is also very critical for our survival helping us develop split second reflex actions without needing to compute and analyze actions. When you have a fast moving projectile headed towards your eye, it's more important that you dodge first to prevent it from injuring your eye, rather than determining its identity.

Similarly a computer driven Vision system also starts first by Detection. In fact artificial systems with the objective of object Detection have been in existence for a long time now. One common application for object detection can be found in Cameras to accomplish Auto Focus. Although the technology used for detection has improved substantially over the years, the basic construct for this technology is the same and it involves what is called as Edge Detection. According to Wikipedia, "Edge detection includes a variety of mathematical methods that aim at identifying edges, defined as curves in a digital image at which the image brightness has discontinuities". There are different ways of identifying this, but one of the most popular functions that can do this involves using a Sobel Operator to convert the image into a Black & White image that emphasizes edges.

Using the Sobel Operator to detect an object by emphasizing edges (Image Source Wikipedia)

Classification:

After Detection, comes Classification. This is where we classify what we just saw. A toddler takes her first steps towards vision by learning to classify what she sees. This includes classifying sights into bright colors, shapes and faces. Learning to classify makes us differentiate a human face from say a triangle. Virtually all things seen by the eye (except that fast moving projectile heading toward your eye) are passed through this classification step and most of it happens subconsciously without the need to think about it unless its something new. This step needs some intelligence to map the visual characteristics needed to enable classification.

AI empowered Vision that masters classification is one where all the detected objects are grouped into labelled or unlabelled "Clusters". Clusters are nothing but similar looking objects that are grouped together. Going back to our toddler playing with shapes example, she learns to detect and classify a triangle even before learning that it's called a Triangle. As an example, let's say I am interested in counting the number of cars passing through a traffic junction I would start by learning to classify objects in the images as cars and non cars. The type or model of the car is not needed and is not important.

Some functions that are commonly used in AI empowered image classification tasks include:

Softmax
ReLU
Sigmoid

While the underlying maths for each of these functions are different and out of the scope for this article, all we need to be aware of in the perspective of a real world application is that the outputs for all these functions is a probability score. This probability determines if an object falls into a Cluster or not. This is determined through a set "Threshold level". Setting a high threshold will make the model more conservative in classification. This reduces false positives (i.e., incorrectly predicting the positive class), but it might increase false negatives (i.e., missing true positives). On the other hand setting a low threshold will make the model more likely to predict positive classes, but could also lead to more false positives. Often the threshold is defined by a businesses tolerance for accepting error.

Identification:

The next step after generic classification is Identification. A toddler Identifying if the face she is seeing belongs to her Mother's or a stranger involves processing quite nuanced visual attributes. And not surprisingly this is quite complex and energy intensive. For example she sees her mother's face thousands of times before associating her visual characteristics. Then she draws on the skills learned from her Classification exercise to classify if the face is her Mother or not. And unlike classification where objects can be classified subconsciously, identification needs a conscious involvement.

Drawing inspiration from this knowledge AI empowered Vision also trains models on Identification by presenting thousands of images of the test objects from different angles, in different light conditions, etc so the algorithm identifies its visual characteristics in all scenarios. And just like human Identification, computer based Identification tasks too are computationally intensive as it requires the model to distinguish between instances that may look very similar, e.g., differentiating between different faces. There are several AI models that are popular in decoding such visual information and are called CNN (Convolutional Neural Networks) models, a few popular CNN models include:

YOLO - For identifying objects and shapes
MTCNN and FaceNet - For identifying faces and Extracting Features
OCR - For identifying text from images

Just like the Threshold level for controlling the accuracy of classification, Identification too involves controlling its efficiency through "Hyper parameters". These are different knobs that we can control on these CNN models that affect the model's accuracy at identifying objects. Parameters like Learning Rate, Dropout rate, Epochs, Patience, etc are used. The impact of altering these parameters on your AI model is something that I will cover in a future upcoming article. And similar to the error tolerance for classification, the tolerance for errors in your business for identification determines the values for these hyper parameters.

So Finally with the Identification of an object a Visual is born! We are seeing its application in a lot of domains starting from unlocking our phones to detecting cancer from CT images. While it may sound like we have mastered Artificial Vision, there is still one area that is still yet to be mastered, it involves making vision based predictions that enable actions. Simply put its replicating our ability to predict if a cup that is positioned precariously at the edge of the table will fall onto the floor or not. And that will be the next big goldmine for us Visuals!

#AI #artificial #intelligence #Vision #Sight #ComputerisedVision #CNN #Objects #identification #Classification #mindandmachine #image

PS: If you found this topic interesting then I would like to point you to a free session on "Visual Intelligence and Object Detection Using AI" I conducted on 19th Oct 2024. You can also join my new AI driven initiative called Learning Twice to be notified of future upcoming sessions hosted by AI enthusiasts like me.

Mind & Machine

211 位关注者

要查看或添加评论，请登录

Sriharsha Ganjam的更多文章

The Quantum Dichotomy

2025年2月25日

The Quantum Dichotomy

Given the rate at which things are moving, it may not be too long before we are out shopping for our first quantum…
5 Types Of Disruptions In The AI world

2025年2月8日

5 Types Of Disruptions In The AI world

The only thing constant in this world is Change. The internet is abuzz with all the recent turmoils in the AI world and…

3 条评论
Job losses and AI Agents!

2025年1月20日

Job losses and AI Agents!

Microsoft recently launched the Copilot 365 chat with an Agent Builder that enables users with to build and train AI…
Driving an AI Agent

2025年1月11日

Driving an AI Agent

A lot is being said about the upcoming AI Agentic revolution these days, so much so that the year 2025 is touted to be…
The Growth Of Agentic AI

2024年12月6日

The Growth Of Agentic AI

There is a lot of excitement on Agentic AI these days and it's barely been a couple of years since AI became…
Something that nobody told you about RAG

2024年11月19日

Something that nobody told you about RAG

What is a Question..

1 条评论
Fighting AI Recalcitrance

2024年10月24日

Fighting AI Recalcitrance

Machine Learning is specific subset of AI that allows computers to learn from data and make predictions without being…
What's, Why's and I's

2024年9月16日

What's, Why's and I's

Growth is universal. From the cosmos to our very own children, everything grows! When Human's first emerged a couple of…
From Forecast to Foresight!

2024年9月9日

From Forecast to Foresight!

We all know what "AI" stands for right? And we all think we know why the "A" in AI expands to "Artificial". But what if…

3 条评论
Finetuned Existence

2024年9月2日

Finetuned Existence

Humans and AI systems share a strange connection, we are both of us, awesome Reverse Engineers! An AI system learns…

See all articles

SIGHT + INTELLIGENCE = VISION

Detection:

Classification:

Identification:

Mind & Machine

211 位关注者

Sriharsha Ganjam的更多文章

The Quantum Dichotomy

5 Types Of Disruptions In The AI world

Job losses and AI Agents!

Driving an AI Agent

The Growth Of Agentic AI

Something that nobody told you about RAG

Fighting AI Recalcitrance

What's, Why's and I's

From Forecast to Foresight!

Finetuned Existence