Deep learning--CNN: Padding, strided convolution, convolution over volume, pooling layer
In order to build deep neural networks, one modification to the basic convolutional operation that you need to really use is padding. Recap that you take a 6x6 input image and convolve it with a 3x3 filter, you end up with a 4x4 output image. And the generic map is if you have an nxn input image and convolve it with an fxf filter you'll get an (n-f+1)x(n-f+1) output image. There are two downsides to this map, one is that if everytime you do apply convolutional operator your image shrinks, you can do this few times before your image starts getting really small. The second downside is if you look at the pixel at the corner of that input image, these pixels are touched and used only in few 3x3 regions, particularly the upleft corner pixel is touched in only one 3x3 region, whereas if you take pixel in the middle of the input image then there are a lot of 3x3 regions that overlap that pixel, so it's as if pixel on the corners around the edges of input image is used much less in the output so you're throwing away a lot of the information near the edge of the input image.
So, in order to fix both of these problems, what you can do is before applying convolutional operation you can pad the image, so in the following case, you can pad the input image with an additional border of one pixel around the edges, then instead of 6x6 image you've now padded this to an 8x8 image, and if you convolve an 8x8 image with a 3x3 filter you now get not the 4x4 but a 6x6 output image. So, you've managed to preserve the original input size of 6x6.
Generally speaking, given an nxn input image and an fxf filter, if you apply the convolutional operation then you'll end up with an (n-f+1)x(n-f+1) output image. And if you pad the input image with an additional border of p pixels around the edges and then apply the convolutional operation you'll get an (n+2p-f+1)x(n+2p-f+1) output image. And in order to have the output size being same as the input size, you should let n+2p-f+1=n and then get p=(f-1)/2 this makes sense when f is an odd number.
Strided convolution is another piece of the building block of convolutions as used in convolutional neural networks. Let's say you wanna convolve a 7x7 image with a 3x3 filter, instead of doing it in the usual way we're gonna do it in a stride of 2, what that means is you take the element-wise product as usual in the upper left 3x3 region and then multiply and add and that give you 91, but then instead of stepping the blue box over by one step we're gonna step it over by two steps and then you do the usual product and summing and that give you 100, and when you go to the next row you again actually take two steps instead of one step, so by doing the product and summing it gives you 69, and you again step over two steps this gives you 91, and so on. So, in this example, we convolve a 7x7 input image with a 3x3 filter and we get a 3x3 output image, so the input and output dimension turns out to be governed by the following formula, if you have nxn input image and convolve it with a fxf filter and if you use padding by p pixels and stride by s steps, then you'll end up with an output that is in size of ((n+2p-f)/s+1)x((n+2p-f)/s+1).
If you're reading different math textbook or signal processing textbook there's an inconsistency in the notation which is that if you look at a typical math textbook the way that the convolution is defined before doing the element-wise product and summing there's actually one other step you would first take which is to flip the filter on the horizontal as well as the vertical axis, and you actually multiply out the elements of this flipped matrix in order to compute the upper-left hand most element of the 4x4 output. And then you take those 9 numbers and you shift them by one to compute the next element for the output matrix, and again shift them by one and compute, and so on. So, the way we defined the convolutional operation in our articles is that we skipped this flipping or mirroring operation, and technically what we're actually doing is sometimes cross-correlation and sort of convolution, but in the deep learning literature by convention, we just call this as a convolution operation.
You've seen how convolution over 2d images works, now let's see how you can implement convolutions over not just 2d images but over three-dimensional volumes. Let's say you wanna detect features not just in a gray-scale image but in an RGB image, so an RGB image might be instead of 6x6 image it could be 6x6x3 where 3 here corresponds to the three color channels, so you think the input volume as a stack of three 6x6 images. So, in order to detect the edges or some other features in this image, you convolve it not with a 3x3 filter like you've done previously but now with a 3x3x3 filter, so the filter self also has three layers corresponding to the red, green and blue channels, and sometimes it's drawn as a cube. The number of channels of input volume must match the number of channels in your filter. To compute the output of this convolution operation what you would do is to take the 3x3x3 filter and first place in the upper-leftmost position of input volume, and notice that this 3x3x3 filter has 27 numbers and what you do is to take each of these 27 numbers and multiply them with the corresponding numbers from the red, green and blue channels in the image, then add up all those products and this gives you the first number in the output.
And then to compute the next output you take the cube and slide over by one step and again do the 27 multiplications, add up all the 27 products that give you the next output, and so on.
Generally speaking, let's have a nxnxn_c input RGB image, and take it and convolve with a fxfxn_c filter, you'll get an (n-f+1)x(n-f+1) output matrix, and if you have n'_c filters then you'll have a stack of n'_c of (n-f+1)x(n-f+1) image, or saying (n-f+1)x(n-f+1)xn'_c cube.
Let's look at how to build a layer of the convolutional neural network, recap you take 6x6x3 input volume and convolve it with two different 3x3x3 filters and obtains two different 4x4 output images. The final thing to turn it into a convolutional network layer is that for each of these outputs we're gonna add a bias b_i which is a real number, and with Python broadcasting, you add b_i to every element of the output matrix, and apply a non-linearity, for illustration let's say here's Relu non-linearity and this gives you a 4x4 output after applying bias and the non-linearity. Remember that one step of forward-propagation is Z^[1]=W^[1]*a^[0]+b^[1], and a^[1]=g(Z^[1]), and here the input volume is a^[0], the weight is the stack of 3x3x3 filters, and stack of output after applying bias and the non-linearity is actual a^[1].
Let's recap all the parameters of one layer, suppose we have filter size f^[i], padding size p^[i], stride s^[i], number of filters n^[i]_c, input size is n^[i-1]_h x n^[i-1]_w x n^[i-1]_c then output size should be n^[i]_h x n^[i]_w x n^[i]_c, here the number of channels in output should be equal to number of filters. n^[i]_h should be computed by floor((n^[i-1]_h+2*p^[i]-f^[i])/s^[i]+1), and n^[i]_w is computed in the similar way. The weight should be the stack of all fitlers, i.e., the size is f^[i] x f^[i] x n^[i-1]_c x n^[i]_c , where n^[i-1]_c is the number of channels of each filter, that is equal to the number of channels of the input volume, and n^[i]_c is the number of filters. If we have m input volumes and apply convolution on each inputs you'll get m output volumes, and after applying bias and non-linearity on these output volumes, you got A^[i] with size of m x n^[i]_h x n^[i]_w x n^[i]_c.
Now let's go through a concrete example of a deep convolutional network (ConvNet), let's say you have a 39x39x3 input volume to do classification, and convolve it with 10 different 3x3x3 filters with stride 1 and padding 0, and generate 37x37x10 volume in the second layer, and then take the newly generated volume and convolve it with 20 different 5x5x10 filters with stride 2 and padding 0 and obtain 17x17x20 volume in the third layer, and then take the new volume and convolve it with 40 different 5x5x20 filters with stride 2 and padding 0 to get 7x7x40 volume, finally we take this generated volume and flattern it and feed into a logistic regression unit or softmax unit depending on whether you're trying to recognize cat or no cat or trying to recognize any one of ten different objects and just have this to give the final output y hat for the neural network, so just be clear this last step is taking all of the numbers in the last volume and unrolling them into a very long vector and feed it to softmax or logititics regression in order to make prediction for the final output.
Other than convolutional layers ConvNet often uses pooling layers to reduce the size of the representation to save the computation as well as make some features it detects a little more robust. Let's say you have 4x4 input image, and a 2x2 empty filter, you move the filter over the upper-leftmost corner of the input image, pick the max number from the 2x2 region and obtain the first element in the output matrix. And then shift right the filter by 2 steps and again pick the max number from the new 2x2 region and obtain the second element for the output, and then shift down by 2 steps and move to the leftmost for the third element for output matrix, and finally shift right the filter by 2 steps for the fourth element of the output.
So, here is the intuition what max pooling is doing, if you think of 4x4 input as some set of features, i.e., the activation in some layer of the network then a large number means that is maybe a detect to the particular feature, so the upper-left quadrant has this particular feature may be a vertical edge or maybe a cat eye. So, what the max operation does is so long as the features are detected anywhere in one of these quadrants it then means preserving the output of max pooling. One interesting property of max pooling it has a set of hyperparameters but it has no parameters to learn.
Let's summarize the hyperparameters for pooling are filter size f, stride s, and suppose your input volume is n_h x n_w x n_c, then the output volume should be floor((n_h-f+1)/s+1) x floor((n_h-f+1)/s+1) x n_c. The number of the input channel is equal to the number of output channel because pooling applies to each of your channels independently. Although you can add padding this is very very rarely used, and one thing about pooling is there are no parameters to learn, so when you implement backpropagation you'll find there are no parameters that backpropagation will adapt through max pooling. There're just these hyperparameters you set once maybe you set once by hand or set using cross-validation and then beyond that you've done, it's just a fixed function.
You now know pretty much all the building blocks, so let's look building a full ConvNet. Let's you have a 32x32x3 input RGB image, maybe you're gonna to hand writing digit recognition, and take the volume and convolve it with 6 different 5x5x3 filters with stride 1 to generate a 18x28x6 volume, and then apply max pooling using 2x2 filters with stride 2 to get a 14x14x6 volume, and again take the new generated volume to convolve with 16 different 5x5x6 filters with stride 1 to obtain 10x10x16 volume, and apply max pooling over that volume with stride 2 to generate 5x5x16 volume, and then flattern the generated volume into a long vector and feed the vector to fully connected layer with 120 units beacuase each of 400 elemens in the vector is connect to each of the 120 units, and lastly you take 120 units and add another fully connected layer with smaller number of units let's say 84, and finally you now have 84 real numbers that you can feed to a softmax unit, and if you're trying to hand writing digit recognition you know this would be 0,1,2, ..., up to 9 this would be a softmax with 10 outputs.
It turns out in the literature of a ConvNet there are two conventions which are slightly inconsistent about what you call a layer. One convention is that convolution layer together with close followed pooling layer is called one layer, another convention is that count convolution layer as a layer and pooling layer as another layer. When people report the number of layers in the neural network usually people report just a number of layers that have weights that have parameters and because pooling layer has no weights has no parameters only have few hyperparameters I'm gonna use convention that put convolution and pooling together as one layer.
Finally, let's talk about why convolution are so useful when you include them in your neural networks. Let's say you have a 32x32x3 input GRB image, and you'll use 6 different 5x5x3 filters to convolve with the input image and this gives you a 28x28x6 output image. If we flatten input volume as 3072 units in a neural networks layer and unroll output volume as 4704 in another layer, and if you were to connect every one of these neurons and the weight matrix, the number of parameters in the weight matrix would be 3071*4704 ~ 14 million. Whereas in the convolution network, each filter has only 5*5=25 parameters, plus bias parameter for each filter, so there are 6*26=156 parameters. And so the number of parameter in this convolution layer remains quite small
And there're two reasons why ConvNet has a relatively small number of parameters. One is parameter sharing which is motivated by the observation that a feature detector such as vertical edge detector that is used for one part of the image is probably used for other parts of the image, and what I mean is that if you figured out saying a 3x3 filter for detecting vertical edges then every 3x3 region of the input image, saying stride is 1, can share the same 3x3 filter. The second way of ConvNet getting away with having relatively few parameters is by having sparse connections, so here's what I mean, if you look 0 in the upper-leftmost of the output matrix, this is computed by 3x3 convolution so it depends only on 3x3 input grid of cells, so is as if the output unit 0 is connected only to 9 out of 6*6=36 input features. And through these two mechanisms, a neural network has a lot fewer parameters which allow it to be trained to having a smaller training set and is less prone to be overfitting.
Finally, let's put it all together and see how you can train a ConvNet, let's say you wanna build a cat detector, and you have a labeled training set where x is an image and y can be binary labels or one of the k classes. And let's say you've chosen a convolution neural network structure, maybe it starts with the image and then having convolutional and pooling layers and then some fully connected layers followed by a softmax output that then output y hat. The convolutional layers and fully connected layers have parameters of weight and bias, and any setting of the parameters, therefore, lets you define a cost function, where you randomly initialize parameters w and b, you can compute the cost J as the sum of losses of you neural network predictions on your entire training set and may be divided by m. So, to train your network all you need to do is then use gradient descent or some other algorithm like momentum or RMSprop or Adam, in order to optimize all the parameters to try to reduce J.
Chen Yang