Overview of Computer Vision
Background
This article is for people who wonder what this Computer Vision is all about and why there is so much hype around the same. If you are already an expert practicing computer vision and related technologies on a day to day basis, you may not find it useful and interesting.
1. Introduction
Before we jump into the details of the technologies that power Computer Vision, let’s try to understand what it tries to achieve.
Computer Vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images and videos. It seeks to understand and automate tasks that the human visual systems do. This is the wikipedia definition of Computer Vision.
To put it in simpler terms, Computer Vision is an attempt to let computers mimic the human behavior of how as humans we view and analyze images and videos.
Typically computers help us automate a variety of tasks, where many of them being highly complicated.
But what do you think about this problem of enabling computers to understand digital images and videos?
How complex do you think it can become?
Just to understand the background and complexity of this problem, let’s look at this small example. Imagine as humans these two identity cards shown above in the picture are shown to some people. Let’s say there are two different ways of identifying someone uniquely from this.
- One way is to remember this long 32 digit numerical identifier
- Other way is to remember through the faces of captured in the picture
As humans what do we prefer? It is far easier to remember through the faces rather than this long numerical identity right?
Now think about computers. Let’s say you are developing a small application to compare the details from two different cards and check if they are unique or not.
As we know for computers to process any piece of information it has to be represented digitally and processed. Now imagine what it takes to represent and store this long numerical identifier and the picture. As we know, pictures are stored with the RGB values or grayscale values for every pixel they represent. As can be seen it is far more easier for the computers to store and process this numerical value rather than a small picture.
Even if you manage to develop a piece of code that can compare the values from every single pixel, even a small variation in the picture at one single pixel can screw up the computation and the uniqueness identification. You can imagine how as humans we consider this simple task of identifying someone by face becomes a really hard one for computers to solve.
Hopefully you get the complexity involved in this setup and making computers to mimic the human behavior on this.
2.Applications
Again before we jump into the details of the technology, it is good to understand some details of the applications of Computer Vision.
As we have seen, enabling computers to mimic human behavior by understanding pictures and videos is a hard problem to solve. There have been several developments done in the fields of Computer Science and NeuroScience to let computers to mimic the human brain.
But over a period of instead of going after this broad objective, researchers have focused on specific tasks and started achieving greater success.
Picture above shows some possible applications which attempt to solve and have achieved some success already. For example a problem as simple as letting a computer program detect cat vs dog, requires training a deep learning algorithm with millions of images (i.e. labeled as cat or dog). Projects such as imagenet focus on building a repository of images for this purpose. With predefined labels and helping someone to train their deep learning algorithms with the datasets.
For a moment ignore the bottom layers of the deep learning techniques which enable achieving them. Let’s try to understand how these building blocks help us solve various real world business applications.
Object detection and localization as a technique for example is used in several domains like,
- Retail to detect what items are placed in the aisles of a supermarket and what are in the shopping cart of someone. Like what Amazon Go provides.
- Autonomous driving to detect various objects, humans, obstacles, etc. present in the road ahead of a vehicle driven automatically.
- Industrial automation setup for the robots to detect various objects it encounters in the factory.
- Medical and life science field by detecting complex patterns from the scan reports of cancer patients, etc.
Similarly all the other techniques listed above in the picture find their applications in several real world business setups. For example video activity detection, intrusion detection, etc. helps in video surveillance through CCTV footage.
3.Technology
Hopefully with some background on computer vision let’s review and understand some of the technologies that are used behind the scenes.
3.1.Neural networks
Development of neural networks was the first major breakthrough in the attempt of mimicking the human brain. They helped us to put together a bunch of simple neurons in a network to solve some of the problems.
Picture above shows the, basic building block of neural networks
- Typically used with structured and small dimensional data
- There will be interconnection from every node in a layer to every other node in the subsequent layer
- Raw data (features) will be fed into the input layer
- Output layer can be either binary, multi-class or continuous value depending on the prediction problem
- Compute power required increases drastically and exponentially as the input dimension grows
For more details on neural networks you can refer to one of my previous articles or many other resources available online.
3.2.Convolutional Neural networks
Convolutional neural networks (CNN) is one of the techniques used in training deep learning algorithms.
Many of you might have an assumption that deep learning is just about training deep layered neural networks with large numbers of layers. While it is true to some extent that typically large numbers of layers are used to solve complex problems, deep learning is more than just about the number of layers.
Techniques like Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) really make deep learning help us solve really complex problems. Like CNN with computer vision in our case with images and RNN with any problem with sequence matters in the data like NLP, videos, etc. You can think of RNN being used in conjunction with CNN in the case of video.
In fact going deep with more layers is not just a problem of availability of computational power. We encounter challenges like vanishing gradients while training models with more layers.
Now let’s try to understand what CNN is and how it helps us solve problems like object detection, etc. with images. Typically representing and storing images requires a lot of storage. Even smaller dimension of images will have a huge number of pixels.
Imagine if we have to use fully connected neural networks as can be seen above with each pixel feeding into a neuron in the network. We will end up creating millions of neurons in the network and resulting in a huge network. Moreover it doesn’t help us learn something from every single pixel separately. We end up in the problem of overfitting a model for the dataset. Rather the network should be able to detect smaller features of the images like edges, curves, etc. and learn from the same.
Hence we use this mathematical operation called convolution which can be applied in the form of matrix computation on the image representation. It helps the network learn functions like edges, lines, etc.
4.CNN network architectures
There are several libraries like TensorFlow, opencv, etc. which have reusable functions to perform the data processing, setting up the network, training the model, etc. Choosing the right network architecture is really the biggest challenge in training the CNN models. There are several famous and widely used architectures like VGG, LeNet, AlexNet, ResNet, Inception, etc. Which have proven to be working well on several challenges and problems. Hence it is worthwhile to try reusing some of these architectures instead of devising new architectures for problems where we have to train a CNN model.
Other sets of challenges with CNN will arise with the need for a large volume of datasets (labeled) and abundance of compute power to train them. The key point to emphasis here is to have the labeled dataset. As it takes a lot of effort to generate labeled dataset for computer vision problems.
Hence it is always advised to recognize the business problem at hand and check if there is a real need for training a new model or any pre-trained model can be reused.
Another really useful technique here is transfer learning. For example a pre-trained model which is trained using a large volume of data from an image net can be reused with a transfer learning approach on a completely different setup with different dataset.
Conclusion
With this article I have just scratched the surface and provided some high level overview. The idea is to get someone introduced to the Computer Vision technology. Recognize the usage and applications of them when there is a real world problem at hand to solve.
I would strongly advise to review the lectures from deeplearnig.ai to learn further details on the same.
Data Analyst @ MISO | Purdue Data Science Alum
2 年Great article. Very helpful!
Cloud | Infrastructure | DevSecOps | Architect | CCSP | Open source evangelist | Microservice | Everything as Code | Automate | 2x Kubernetes | 5x AWS | 2x Azure | Docker | Containers | IaC
4 年Great work Vivek Murugesan !
Data Engineer | UAE| Big Data | Spark | Kafka Streaming | Cloud | Docker | Kubernetes | Python | Scala
4 年Awesome Vivek Murugesan
Solution Lead/Senior Software Engineer. Worked on banking, insurance, retail and Healthcare domains.
4 年Thanks for sharing the article about computer vision Vivek. Wish you a happy new year :)
CBO | CRO | Salespreneur | Innovator | Investor | Mentor
4 年Great article Vivek. Thanks for sharing. Wishing you and your family a happy 2021.