From Sight to Vision
All the complex lifeforms on our planet have 5 sense organs. The power of Sight, Sound, Smell, Taste and Touch. And of all these "sight" is perhaps the most important and critical to our survival. Around 30%-40% of our brain activity is dedicated towards processing information related to sight, far greater than any other sense organs. Vision is one of the holy grails of our advancing AI capability and a lot of work has been accomplished in defining Vision as an AI feasible goal. Senses like Taste and Smell are still not well represented in AI compared to Touch and Sound apart from Sight, something that the makers of "The Matrix" realized long back and so expertly presented.
This is probably because of the various real life applications for automating vision in real life. Before we go any further, I think it's important we define the difference between Sight and Vision. We often use these terms loosely in our language and hence risk under estimating the complexity involved in converting "Sight" into "Vision". Sight happens in our eyes, whereas Vision is enabled in our brain after the information from our sight is processed. This gives rise to actions that are driven from insights derived from sight. Vision always involves the application of Intelligence (Human or Artificial) to sight. Sight is what a camera can see and capture. Vision is when the image is processed to identify faces, or read text, or predict the speed of a moving vehicle, etc.
Understanding this basic definition will help us better align our expectations of this rapid advancing technology and also help us to draft business goals that can benefit the most from the application of Computer Based AI Empowered Vision.
SIGHT + INTELLIGENCE = VISION
Now that we have successfully differentiated Sight from Vision, we can now "see" (pun intended) where AI fits into the whole scheme of things. In it's broadest sense Vision consists of 3 steps
If you have ever watched a toddler play with Shapes and Blocks you would understand these steps better.
Detection:
This is the most first step involved in developing Vision. And because it's so important and complex in enabling Vision, we train on this the longest. Babies start learning this just a few months after they are born by learning to focus their eyes on different objects. This is a purely subconscious activity and does not involve any intelligence. This is also very critical for our survival helping us develop split second reflex actions without needing to compute and analyze actions. When you have a fast moving projectile headed towards your eye, it's more important that you dodge first to prevent it from injuring your eye, rather than determining its identity.
Similarly a computer driven Vision system also starts first by Detection. In fact artificial systems with the objective of object Detection have been in existence for a long time now. One common application for object detection can be found in Cameras to accomplish Auto Focus. Although the technology used for detection has improved substantially over the years, the basic construct for this technology is the same and it involves what is called as Edge Detection. According to Wikipedia, "Edge detection includes a variety of mathematical methods that aim at identifying edges, defined as curves in a digital image at which the image brightness has discontinuities". There are different ways of identifying this, but one of the most popular functions that can do this involves using a Sobel Operator to convert the image into a Black & White image that emphasizes edges.
Classification:
After Detection, comes Classification. This is where we classify what we just saw. A toddler takes her first steps towards vision by learning to classify what she sees. This includes classifying sights into bright colors, shapes and faces. Learning to classify makes us differentiate a human face from say a triangle. Virtually all things seen by the eye (except that fast moving projectile heading toward your eye) are passed through this classification step and most of it happens subconsciously without the need to think about it unless its something new. This step needs some intelligence to map the visual characteristics needed to enable classification.
AI empowered Vision that masters classification is one where all the detected objects are grouped into labelled or unlabelled "Clusters". Clusters are nothing but similar looking objects that are grouped together. Going back to our toddler playing with shapes example, she learns to detect and classify a triangle even before learning that it's called a Triangle. As an example, let's say I am interested in counting the number of cars passing through a traffic junction I would start by learning to classify objects in the images as cars and non cars. The type or model of the car is not needed and is not important.
Some functions that are commonly used in AI empowered image classification tasks include:
While the underlying maths for each of these functions are different and out of the scope for this article, all we need to be aware of in the perspective of a real world application is that the outputs for all these functions is a probability score. This probability determines if an object falls into a Cluster or not. This is determined through a set "Threshold level". Setting a high threshold will make the model more conservative in classification. This reduces false positives (i.e., incorrectly predicting the positive class), but it might increase false negatives (i.e., missing true positives). On the other hand setting a low threshold will make the model more likely to predict positive classes, but could also lead to more false positives. Often the threshold is defined by a businesses tolerance for accepting error.
Identification:
The next step after generic classification is Identification. A toddler Identifying if the face she is seeing belongs to her Mother's or a stranger involves processing quite nuanced visual attributes. And not surprisingly this is quite complex and energy intensive. For example she sees her mother's face thousands of times before associating her visual characteristics. Then she draws on the skills learned from her Classification exercise to classify if the face is her Mother or not. And unlike classification where objects can be classified subconsciously, identification needs a conscious involvement.
Drawing inspiration from this knowledge AI empowered Vision also trains models on Identification by presenting thousands of images of the test objects from different angles, in different light conditions, etc so the algorithm identifies its visual characteristics in all scenarios. And just like human Identification, computer based Identification tasks too are computationally intensive as it requires the model to distinguish between instances that may look very similar, e.g., differentiating between different faces. There are several AI models that are popular in decoding such visual information and are called CNN (Convolutional Neural Networks) models, a few popular CNN models include:
Just like the Threshold level for controlling the accuracy of classification, Identification too involves controlling its efficiency through "Hyper parameters". These are different knobs that we can control on these CNN models that affect the model's accuracy at identifying objects. Parameters like Learning Rate, Dropout rate, Epochs, Patience, etc are used. The impact of altering these parameters on your AI model is something that I will cover in a future upcoming article. And similar to the error tolerance for classification, the tolerance for errors in your business for identification determines the values for these hyper parameters.
So Finally with the Identification of an object a Visual is born! We are seeing its application in a lot of domains starting from unlocking our phones to detecting cancer from CT images. While it may sound like we have mastered Artificial Vision, there is still one area that is still yet to be mastered, it involves making vision based predictions that enable actions. Simply put its replicating our ability to predict if a cup that is positioned precariously at the edge of the table will fall onto the floor or not. And that will be the next big goldmine for us Visuals!
#AI #artificial #intelligence #Vision #Sight #ComputerisedVision #CNN #Objects #identification #Classification #mindandmachine #image
PS: If you found this topic interesting then I would like to point you to a free session on "Visual Intelligence and Object Detection Using AI" I conducted on 19th Oct 2024. You can also join my new AI driven initiative called Learning Twice to be notified of future upcoming sessions hosted by AI enthusiasts like me.