How Convolutional Neural Networks (CNNs) for Image Classification Works ?
Tejas Shastrakar
Masters Student Mechatronics and Robotics | ADAS and Computer Vision Enthusiast | Ex TCSer | Machine Learning | Deep Learning
What is a Convolutional Neural Network (CNN)?
A Convolutional Neural Network (CNN), also known as ConvNet, is a specialized type of deep learning algorithm mainly designed for tasks that necessitate object recognition, including image classification, detection, and segmentation. CNNs are employed in a variety of practical scenarios, such as autonomous vehicles, security camera systems, and others.
The importance of CNNs
There are several reasons why CNNs are important in the modern world, as highlighted below:
Inspiration Behind CNN and Parallels With The Human Visual System
Convolutional neural networks were inspired by the layered architecture of the human visual cortex, and below are some key similarities and differences:
Illustration of the correspondence between the areas associated with the primary visual cortex and the layers in a convolutional neural network (source )
CNNs mimic the human visual system but are simpler, lacking its complex feedback mechanisms and relying on supervised learning rather than unsupervised, driving advances in computer vision despite these differences.
Key Components of a CNN
The convolutional neural network is made of four main parts.
But how do CNNs Learn with those parts?
They help the CNNs mimic how the human brain operates to recognize patterns and features in images:
This section dives into the definition of each one of these components through the example of the following example of classification of a handwritten digit.
Convolution layers
This is the first building block of a CNN. As the name suggests, the main mathematical task performed is called convolution, which is the application of a sliding window function to a matrix of pixels representing an image. The sliding function applied to the matrix is called kernel or filter, and both can be used interchangeably.
In the convolution layer, several filters of equal size are applied, and each filter is used to recognize a specific pattern from the image, such as the curving of the digits, the edges, the whole shape of the digits, and more.
Put simply, in the convolution layer, we use small grids (called filters or kernels) that move over the image. Each small grid is like a mini magnifying glass that looks for specific patterns in the photo, like lines, curves, or shapes. As it moves across the photo, it creates a new grid that highlights where it found these patterns.
For example, one filter might be good at finding straight lines, another might find curves, and so on. By using several different filters, the CNN can get a good idea of all the different patterns that make up the image.
Let’s consider this 32x32 grayscale image of a handwritten digit. The values in the matrix are given for illustration purposes.
Also, let’s consider the kernel used for the convolution. It is a matrix with a dimension of 3x3. The weights of each element of the kernel is represented in the grid. Zero weights are represented in the black grids and ones in the white grid.
Do we have to manually find these weights?
In real life, the weights of the kernels are determined during the training process of the neural network.
Using these two matrices, we can perform the convolution operation by applying the dot product, and work as follows:
The dimension of the convoluted matrix depends on the size of the sliding window. The higher the sliding window, the smaller the dimension.
Application of the convolution task using a stride of 1 with 3x3 kernel
Another name associated with the kernel in the literature is feature detector because the weights can be fine-tuned to detect specific features in the input image.
领英推荐
For instance:
The more convolution layers the network has, the better the layer is at detecting more abstract features.
Activation function
A ReLU activation function is applied after each convolution operation. This function helps the network learn non-linear relationships between the features in the image, hence making the network more robust for identifying different patterns. It also helps to mitigate the vanishing gradient problems.
Pooling layer
The goal of the pooling layer is to pull the most significant features from the convoluted matrix. This is done by applying some aggregation operations, which reduce the dimension of the feature map (convoluted matrix), hence reducing the memory used while training the network. Pooling is also relevant for mitigating overfitting.
The most common aggregation functions that can be applied are:
Below is an illustration of each of the previous example:
Also, the dimension of the feature map becomes smaller as the pooling function is applied.
The last pooling layer flattens its feature map so that it can be processed by the fully connected layer.
Fully connected layers
These layers are in the last layer of the convolutional neural network, and their inputs correspond to the flattened one-dimensional matrix generated by the last pooling layer. ReLU activations functions are applied to them for non-linearity.
Finally, a softmax prediction layer is used to generate probability values for each of the possible output labels, and the final label predicted is the one with the highest probability score.
Overfitting and Regularization in CNNs
Overfitting is a common challenge in machine learning models and CNN deep learning projects. It happens when the model learns the training data too well (“learning by heart”), including its noise and outliers. Such a learning leads to a model that performs well on the training data but badly on new, unseen data.
This can be observed when the performance on training data is too low compared to the performance on validation or testing data, and a graphical illustration is given below:
Underfitting Vs. Overfitting
Deep learning models, especially Convolutional Neural Networks (CNNs), are particularly susceptible to overfitting due to their capacity for high complexity and their ability to learn detailed patterns in large-scale data.
Several regularization techniques can be applied to mitigate overfitting in CNNs, and some are illustrated below:
Practical Applications of CNNs
Convolutional Neural Networks have revolutionized the field of computer vision, leading to significant advancements in many real-world applications. Below are a few examples of how they are applied.
Conclusion
This article has provided a complete overview of what a CNN in deep learning is, along with their crucial role in image recognition and classification tasks.
It started by highlighting the inspiration drawn from the human visual system for the design of CNNs and then explored the key components that allow these networks to learn and make predictions.
The issue of overfitting was acknowledged as a significant challenge to CNNs' generalization capability. To mitigate this, a variety of relevant strategies to mitigate overfitting and improve CNNs overall performance were outlined.
Finally, some major deep learning CNN frameworks have been mentioned, along with the unique features of each one and how they compare to each other.