登录查看更多内容

Overview of Computer Vision

Vivek Murugesan

Principal Software Engineering Manager | Data Leader at Microsoft

发布日期: 2020年12月31日

Background

This article is for people who wonder what this Computer Vision is all about and why there is so much hype around the same. If you are already an expert practicing computer vision and related technologies on a day to day basis, you may not find it useful and interesting.

1. Introduction

Before we jump into the details of the technologies that power Computer Vision, let’s try to understand what it tries to achieve.

Computer Vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images and videos. It seeks to understand and automate tasks that the human visual systems do. This is the wikipedia definition of Computer Vision.

To put it in simpler terms, Computer Vision is an attempt to let computers mimic the human behavior of how as humans we view and analyze images and videos.

Typically computers help us automate a variety of tasks, where many of them being highly complicated.

But what do you think about this problem of enabling computers to understand digital images and videos?

How complex do you think it can become?

Just to understand the background and complexity of this problem, let’s look at this small example. Imagine as humans these two identity cards shown above in the picture are shown to some people. Let’s say there are two different ways of identifying someone uniquely from this.

One way is to remember this long 32 digit numerical identifier
Other way is to remember through the faces of captured in the picture

As humans what do we prefer? It is far easier to remember through the faces rather than this long numerical identity right?

Now think about computers. Let’s say you are developing a small application to compare the details from two different cards and check if they are unique or not.

As we know for computers to process any piece of information it has to be represented digitally and processed. Now imagine what it takes to represent and store this long numerical identifier and the picture. As we know, pictures are stored with the RGB values or grayscale values for every pixel they represent. As can be seen it is far more easier for the computers to store and process this numerical value rather than a small picture.

Even if you manage to develop a piece of code that can compare the values from every single pixel, even a small variation in the picture at one single pixel can screw up the computation and the uniqueness identification. You can imagine how as humans we consider this simple task of identifying someone by face becomes a really hard one for computers to solve.

Hopefully you get the complexity involved in this setup and making computers to mimic the human behavior on this.

2.Applications

Again before we jump into the details of the technology, it is good to understand some details of the applications of Computer Vision.

As we have seen, enabling computers to mimic human behavior by understanding pictures and videos is a hard problem to solve. There have been several developments done in the fields of Computer Science and NeuroScience to let computers to mimic the human brain.

But over a period of instead of going after this broad objective, researchers have focused on specific tasks and started achieving greater success.

Picture above shows some possible applications which attempt to solve and have achieved some success already. For example a problem as simple as letting a computer program detect cat vs dog, requires training a deep learning algorithm with millions of images (i.e. labeled as cat or dog). Projects such as imagenet focus on building a repository of images for this purpose. With predefined labels and helping someone to train their deep learning algorithms with the datasets.

For a moment ignore the bottom layers of the deep learning techniques which enable achieving them. Let’s try to understand how these building blocks help us solve various real world business applications.

Object detection and localization as a technique for example is used in several domains like,

Retail to detect what items are placed in the aisles of a supermarket and what are in the shopping cart of someone. Like what Amazon Go provides.
Autonomous driving to detect various objects, humans, obstacles, etc. present in the road ahead of a vehicle driven automatically.
Industrial automation setup for the robots to detect various objects it encounters in the factory.
Medical and life science field by detecting complex patterns from the scan reports of cancer patients, etc.

Similarly all the other techniques listed above in the picture find their applications in several real world business setups. For example video activity detection, intrusion detection, etc. helps in video surveillance through CCTV footage.

3.Technology

Hopefully with some background on computer vision let’s review and understand some of the technologies that are used behind the scenes.

3.1.Neural networks

Development of neural networks was the first major breakthrough in the attempt of mimicking the human brain. They helped us to put together a bunch of simple neurons in a network to solve some of the problems.

Picture above shows the, basic building block of neural networks

Typically used with structured and small dimensional data
There will be interconnection from every node in a layer to every other node in the subsequent layer
Raw data (features) will be fed into the input layer
Output layer can be either binary, multi-class or continuous value depending on the prediction problem
Compute power required increases drastically and exponentially as the input dimension grows

For more details on neural networks you can refer to one of my previous articles or many other resources available online.

3.2.Convolutional Neural networks

Convolutional neural networks (CNN) is one of the techniques used in training deep learning algorithms.

Many of you might have an assumption that deep learning is just about training deep layered neural networks with large numbers of layers. While it is true to some extent that typically large numbers of layers are used to solve complex problems, deep learning is more than just about the number of layers.

Techniques like Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) really make deep learning help us solve really complex problems. Like CNN with computer vision in our case with images and RNN with any problem with sequence matters in the data like NLP, videos, etc. You can think of RNN being used in conjunction with CNN in the case of video.

In fact going deep with more layers is not just a problem of availability of computational power. We encounter challenges like vanishing gradients while training models with more layers.

Now let’s try to understand what CNN is and how it helps us solve problems like object detection, etc. with images. Typically representing and storing images requires a lot of storage. Even smaller dimension of images will have a huge number of pixels.

Imagine if we have to use fully connected neural networks as can be seen above with each pixel feeding into a neuron in the network. We will end up creating millions of neurons in the network and resulting in a huge network. Moreover it doesn’t help us learn something from every single pixel separately. We end up in the problem of overfitting a model for the dataset. Rather the network should be able to detect smaller features of the images like edges, curves, etc. and learn from the same.

Hence we use this mathematical operation called convolution which can be applied in the form of matrix computation on the image representation. It helps the network learn functions like edges, lines, etc.

4.CNN network architectures

There are several libraries like TensorFlow, opencv, etc. which have reusable functions to perform the data processing, setting up the network, training the model, etc. Choosing the right network architecture is really the biggest challenge in training the CNN models. There are several famous and widely used architectures like VGG, LeNet, AlexNet, ResNet, Inception, etc. Which have proven to be working well on several challenges and problems. Hence it is worthwhile to try reusing some of these architectures instead of devising new architectures for problems where we have to train a CNN model.

Other sets of challenges with CNN will arise with the need for a large volume of datasets (labeled) and abundance of compute power to train them. The key point to emphasis here is to have the labeled dataset. As it takes a lot of effort to generate labeled dataset for computer vision problems.

Hence it is always advised to recognize the business problem at hand and check if there is a real need for training a new model or any pre-trained model can be reused.

Another really useful technique here is transfer learning. For example a pre-trained model which is trained using a large volume of data from an image net can be reused with a transfer learning approach on a completely different setup with different dataset.

Conclusion

With this article I have just scratched the surface and provided some high level overview. The idea is to get someone introduced to the Computer Vision technology. Recognize the usage and applications of them when there is a real world problem at hand to solve.

I would strongly advise to review the lectures from deeplearnig.ai to learn further details on the same.

Harnish Modi

Data Analyst @ MISO | Purdue Data Science Alum

2 年

Great article. Very helpful!

Suraj Rajan

4 年

Great work Vivek Murugesan !

1 次回应

Pradeep Sekar

4 年

Awesome Vivek Murugesan

Mano Prakadeesh Venkadasamy

Solution Lead/Senior Software Engineer. Worked on banking, insurance, retail and Healthcare domains.

4 年

Thanks for sharing the article about computer vision Vivek. Wish you a happy new year :)

Raheel Khan

4 年

Great article Vivek. Thanks for sharing. Wishing you and your family a happy 2021.

查看更多评论

要查看或添加评论，请登录

Vivek Murugesan的更多文章

CISC, RISC and GPU architecture

2023年9月24日

CISC, RISC and GPU architecture

Introduction If you are working on building machine learning, deep learning and applications that leverage these…

2 条评论
Astonishing numbers from the game of chess

2022年3月3日

Astonishing numbers from the game of chess

As you are aware the game of chess has been played for several centuries. But every time you play, you may end up…

3 条评论
Our MLOps journey

2022年1月27日

Our MLOps journey

This article is a continuation of the previous article on Overview of MLOps. Here we will go through the details on how…

2 条评论
Introduction to MLOps

2021年10月7日

Introduction to MLOps

ML Ops is a set of practices that combine efforts from Machine Learning, DevOps and Data Engineering teams to get the…

6 条评论
Introduction to NoSQL systems

2020年4月7日

Introduction to NoSQL systems

I am writing this article, As an extension to my previous article on NoSQL systems. While I focused on some specific…

9 条评论
An introduction to Event Driven Architecture

2020年4月3日

An introduction to Event Driven Architecture

Event Driven Architecture (EDA) is a software architecture pattern promoting the production, detection, consumption of…

7 条评论
Part2: Does math really help with coding?

2020年3月2日

Part2: Does math really help with coding?

This is a continuation of the article I published a few days ago. Following are items I promised to capture in the…
Does math really help with coding?

2020年2月27日

Does math really help with coding?

Idea behind this article is to talk about the importance of mathematical models/functions and their importance in…

2 条评论
Evolution of Eventual Consistency

2019年12月31日

Evolution of Eventual Consistency

Consistency is one of the really critical aspects of the legacy, Database systems. But some of the modern day…

6 条评论
Scalable Graph Computation for Data Science

2019年11月7日

Scalable Graph Computation for Data Science

1. Background Typically aspiring Data Scientists and some of the experienced Data Scientists as well, overlook the…

2 条评论

See all articles

Overview of Computer Vision

Vivek Murugesan

Principal Software Engineering Manager | Data Leader at Microsoft

Vivek Murugesan的更多文章

社区洞察

其他会员也浏览了

AI Built with Human Brain Cells? Why the CL1 Bio Computer Is a Game-Changer

Spintronic-Photonic Neuromorphic Computing: A Glimpse into the Future of AI

DEMYSTIFYING THE "COMPUTER"

Orthogonal vs. Non-Orthogonal Hypervectors for Hyperdimensional Computing (HDC)

You are Single Threaded

Now That Computers Can See

DeepThoughts on The Terahertz Age of Computing ??

DNA computers will solve specific types of problems

Featured Reviews from Computers, Materials & Continua (Vol.80, No.2, 2024)

DECOHERENCE IN DIP

Vivek Murugesan的更多文章

CISC, RISC and GPU architecture

Astonishing numbers from the game of chess

Our MLOps journey

Introduction to MLOps

Introduction to NoSQL systems

An introduction to Event Driven Architecture

Part2: Does math really help with coding?

Does math really help with coding?

Evolution of Eventual Consistency

Scalable Graph Computation for Data Science

社区洞察

其他会员也浏览了

AI Built with Human Brain Cells? Why the CL1 Bio Computer Is a Game-Changer

Spintronic-Photonic Neuromorphic Computing: A Glimpse into the Future of AI

DEMYSTIFYING THE "COMPUTER"

Orthogonal vs. Non-Orthogonal Hypervectors for Hyperdimensional Computing (HDC)

You are Single Threaded

Now That Computers Can See

DeepThoughts on The Terahertz Age of Computing ??

DNA computers will solve specific types of problems

Featured Reviews from Computers, Materials & Continua (Vol.80, No.2, 2024)

DECOHERENCE IN DIP