Convolutions, Pooling & Flattening

Convolutions, Pooling & Flattening

While building neural networks for visual tasks like image recognition, object detection or boundary detection, convoluted neural networks work very effectively. Do you know why?

Let us take a high-resolution image with colours of dimensions 1000 px * 1000 px. And since the image has colours, it will have 3 channels (i.e Red, Green & Blue). So that would have 1000 * 1000 * 3 (3 million features to train on).

Training a neural network on 3 million features could be problematic because

  1. Computation becomes very expensive
  2. The accuracy of the network will take a hit as there are a lot of features to learn from

To solve this we use Convoluted neural networks(CNN). They use Convolutions, Filtering & Pooling. They are used to share parameters, understand sparse connections and filter the required features from the image. Doing these operations (Convolving & Pooling) on volumes (i.e 3d matrices) of images will make it computationally efficient.

After these steps, CNNs pass the output to a fully connected layer (ie A Deep Neural Network). From there we will have the same process & advantages of a DNN.

Let's dive deeper into the following in this article:

  1. Convolutions
  2. Pooling
  3. Fully Connected Layers
  4. A CNN Example

Convolutions

In the first layer of a CNN, we extract the features from an image using filters, as we apply this filter on a 3d volume of an image, we get a result of a volume that is much smaller and a particular feature of the image is emphasised.

Let us take an example to understand this, let us take a vertical edge detection filter. A basic vertical edge detection could be as follows.

vedge_filter = [[1, 0, -1], [1, 0, -1], [1, 0, -1]]        

So if you see the above matrix is a 3 x 3 matrix. The image 1000 x 1000 dimensions, is divided into smaller matrices of 3 x 3. An element-wise product is carried out on each submatrix to reduce the size of the image as well as it will emphasise vertical edges in the image. We can add activation functions for Convolution layers to get more non-linearity (More on this at the end of the article).

There are many more filters that you can be applied. Few other filters are horizontal edge detection filter, Sobel filter & Schar filter.

Padding

When the above convolution operation is carried out, the corner pixels and the pixels on the border are neglected as they are not part of as many operations as the middle pixels.

To have them in the same number of operations and to give them the importance they need, we add dummy rows and columns to the image. These are called padding.

You can also step over the sub-matrices one at a time or you can define that hyper-parameter as well called step and tune it.

The shape of the final output after a convolution is given by:

import math

f[l] # filter size
p[l] # padding size
s[l] # stride size

Nw[l] = math.floor( (Nw[l - 1] + 2 * p[l] -f[l]) / s[l]  + 1)

Nh[l] = math.floor( (Nh[l - 1] + 2 * p[l] - f[l]) / s[l] + 1)

Nc[l] = no_of_filters

# Each filter shape is f[l] * f[l] * Nc[l-1]

# Activation shape for these layers will be: Nh[l] * Nw[l] * Nc[l]

# Total No. of Weights will be f[l] * f[l] * Nc[l-1] * Nc[l]        

Pooling

Pooling is another operation that a CNN does to shrink the image and make feature detection more robust by emphasising the features.

There are many types of pools available. Let us take MaxPool as an example and understand it. In MaxPool we take a sub-matrix of the image of size [pool_height x pool_width] and take the maximum element of that pool and use it. We get the values from all the sub-matrices of the images and create a new image with the max elements.

pool_height and pool_width are hyper-parameters to tune

Since there is no computation happening here there will not be any parameters for gradient descent to learn

Fully Connected Layer

Once the image is passed through convolution and pooling stages, it shrinks to a very smaller scale as well as the features in the image are emphasised.

These shrunk images with emphasised features are fed to a Deep Neural Network. The image is flattened (all pixels are spread out into tensors) and passed as input to the first later of neural networks.

From there the gradient descent and backpropagation happens to be like for any other neural network (You can refer to previous articles to see how that works and how to optimise it).

A CNN Example

You can see in the below image how an image passes through various phases of a CNN.

No alt text provided for this image

You can see that from left to right as the number of filters increases, the size of the image reduces and the depth of it increases, the depth is nothing but the output of each filter. (i.e if you apply 20 filters the depth will be 20 matrices which are the outputs of each filter).

You can also see that at the end the final volume is passed to a deep neural network.

Convolution in detail:

How does backprop work in Convolitions you might ask, it is not very different from a regular backprop, the only difference we see is that all the operations are applied on volumes instead of matrices.

No alt text provided for this image

In the above case, an image is being multiplied with two different filters and each filter outputs a separate 2d matrix. Each of them is passed through an activation function (ReLu in this case). Each is further combined to form the final output of the convolution.

Please also observe that each has its weights and biases, these will be updated in the backprop step.

要查看或添加评论,请登录

Pranav Kumar PB的更多文章

社区洞察

其他会员也浏览了