Deep learning--CNN: classic ConvNet, residual networks, inception network
chen yang

Deep learning--CNN: classic ConvNet, residual networks, inception network

There are some classic neural network architectures LeNet-5, AlexNet, and VGG-16. First, let's look at the following LeNet-5 example, let's say you set up a 32x32x1 grayscale image, and the goal of the network is to recognize handwritten digit. In the first step, you use a set of six 5x5 filters with a stride of 1 and you end up with 28x28x6 volume, then the network applies the average pooling, and back then when this paper (in the left bottom corner of the following slide) is written people used average pooling much more if you're building a modern variant you probably use max pooling instead. In this example, the network uses an average pooling with a 2x2 filter in stride 2 you end up with a 14x14x6 volume, and again it uses 16 different 5x5x6 filters with a stride of 1 and obtains a 10x10x16 volume, then does average pooling with a 2x2 filter in stride 2, you end up with a 5x5x16 volume, if we multiply out 5x5x16 and get 400 there, and the next layer is a fully connected layer that fully connects each of these 400 nodes with every one of 120 neurons in the fully connected layer. And there's another fully connected layer with 84 neurons, and then the final step is it uses essentially 84 features and make prediction y hat, and y hat takes on 10 possible values corresponding 10 possible values for each of digits from 0 to 9. In the modern version of this network, we use a softmax layer with a 10-way classification in the output layer.

This network was small by modern standards and had about 60 thousand parameters and Today you often see the neural network with from 10 million to a hundred million parameters, they are literally about a thousand times bigger. But one thing you do see as you go down deeper in the LetNet-5 network the height and width tend to go down, whereas the number of channels tends to increase. One other pattern you've seen in this neural network is you may have one or more convolutional layers followed by pooling layer and then one, sometimes more than one convolutional layers followed by pooling layer, and then some fully connected layers and then the output, so this type of arrangement of layers is quite common.

Now, let's look at the following AlexNet example, it starts with a 227x227x3 RGB image and the first layer applies a set of 96 11x11x3 filters in a stride of 4, and because it uses a large stride the volume largely shrinks to 55x55x96. And then applies max pooling with a 3x3 filter in stride 2 and so this reduces the volume to 27x27x96, and then it performs a 5x5 same convolution, so we're padding and end up with a 27x27x256, max pooling again this reduces the height and width to 13x13x256, and then another same convolution, so we're padding and end up with 13x13x384 by 384 filters, and then by several 3x3 same convolutions again followed a max pooling it gives you 6x5x256, if you multiply out 6x5x256 and get 9216, so we're gonna unroll to a vector of 9216 nodes, and finally it has few fully connected layers and then uses a softmax to output which one of 1000 classes the object could be.

So, this type of networks often have a lot of similarities to LeNet but it was much bigger it had about 60 million parameters. It had a lot more hidden units in training on a lot more data that allowed it to have this remarkable performance. Another aspect of this architecture is they made much better than LeNet, they were using the ReLu activation function.

Let's look the third architecture example called VGG or VGG-16 network, and a remarkable thing about VGG-16 is that they said instead of having so many hyperparameters let's use a much simpler network where you focus on just having convolutional layers that are just 3x3 filters with a stride of 1 and always use the same padding and made all your max-pooling layers 2x2 over stride of 2. In the following example, [CONV 64]x2 means you do two times of same convolution operations over the input image, that says the first time you do some padding on 224x224x3 input and convolve it with 64 different 3x3x3 filters and generate 224x224x64, and the second time you again do the padding on the generated volume and convolve it with 64 different 3x3x64 filters and obtain 224x224x64 output. And you can interpret [CONV 128]x2, [CONV 256]x3, and [CONV 512]in a similar way. The following example VGG network is pretty large it has about 138 million parameters, but the simplicity of this network architecture made a quite appealing you can tell this architecture is really quite uniform, there are few convolutional layers followed by a pooling layer which reduce the height and width, also if you look at the number of filters in the convolutional layers, from 64 filters and then you double it to 128, and then 256 and then 512, it roughly doubling on every step. Doubling through every stack of convolutional layers was another simple principle used to design the architecture of this network.

Very very deep networks are difficult to train because of vanishing and exploding gradients type of problems. Residual networks skip connections which allows you to take the activation from one layer and suddenly feed it to another layer even much deeper in your neural network. This enables you to train very deep networks sometimes even networks of over a hundred layers.

Here are two layers of a neural network, where you start off with some activation a^[i], and the first thing you do is apply linear operator to it, which is governed by equation z^[i+1]=W^[i+1]*a^[i]+b^[i+1], after that you apply ReLu non-linearity to get a^[i+1], that is governed by equation a^[i+1]=g(z^[i+1]), then in the next layer, again you apply linear and ReLu steps, and then two layers later the activation is a^[i+1]. So, in other words for information from a^[i] to flow to a^[i+2], it needs to go through all the steps which I'm gonna call the main path of this set of layers, with the residual net we're gonna make a change to this we're gonna take a^[i] and just fast forward it and copy it much further in layer i+2 of the network, just add before applying ReLu non-linearity in i+2 layer. I'm gonna call this the shortcut, so rather than following the main path the information from a^[i] can now follow in a shortcut to go much deeper in the neural network. And what it means the equation a^[i+2]=g(z^[i+2]) goes away and we instead have a^[i+2]=g(z^[i+2]+a^[i]), each node in layer i+2 applies a linear function and ReLu non-linear function, so, a^[i] is injected after the linear part and before ReLu part.

Let's look at the following network, to turn the plain network to a residual network, what you do is to add these skip-connections or shortcuts, so every two layers end up with a residual block. This is actually five residual blocks stacking together to form a residual network. And it turns out if you use standard optimization algorithms, such as gradient descent or other fancy optimization algorithms, to train a plain network without extra shortcuts and particularly you find that as you increase the number of layers the training error will tend to decrease after a while but they'll tend to go back up, and in theory as you make neural network deeper you know it should only do better and better on the training set, but in reality having a plain network does very deep means your optimization algorithm just has much harder time in training and so in practice your training error get worse. But what happens with ResNet is that even as a number of layers get deeper you can help the performance of the training error and keep on going down. Taking these activations from shortcuts and allowing it to go deeper, this really helps with the vanishing/exploding gradient problems. So, ResNet allows you to train much deeper neural network without really appreciable loss in the performance.

So, why do ResNet work? Let's say you have input X to feed in some neural network and this output some activation a^[i], and you're gonna modify this neural network to make it a little bit deeper, and add some couple of layers and get output a^[i+1], only let's these additional two layers a residual block with an extra shortcut. For the sake of argument let's say throughout this network we're using the ReLu activation function so all the activations are greater than or equal to zero. Now we have a^[i+2] = g(z^[i+2]+a^[i]) = g(W^[i+2]*a^[i+1]+b^[i+2]+a^[i]), now notice something, if you're using L2 regularization or weight decay that will tend to shrink the value of W^[i+2], and if W^[i+2] is equal to zero and let's say, for the sake of argument, b is also equal to zero then we have a^[i+2] = g(a^[i]) = a^[i] because here g(a^[i]) means ReLu activation function applys to non-negative quantity and then you just get back a^[i]. So, what this shows is that the identity function is easy for a residual block to learn, it's easy to get a^[i+2] equal to a^[i] because of the extra skip connection, and what that means is that adding these two layers in your network it doesn't really hurt your neural network's ability to do as well as the simpler network without these two layers.

But the goal is not just to not hurt the performance, so you can imagine that for all of these hidden units in the two additional layers if they actually learn something useful then maybe you can do even better than learning the identity function. What goes wrong in very deep plain nets without these skip connections is that when you make the network deeper and deeper it's actually very difficult for it to choose parameters that learn even the identity function which is why a lot of layers end up making your result worse.

In term of designing ConvNet architectures, one of the ideas that really help is using 1x1 convolution. Let's say you convolve a 6x6x1 image with a 1x1 filter, as shown in the following slide, you end up just taking the image and multiply each element by two. And so a convolution by a 1x1 filter it doesn't seem totally useful you just multiply it by some number, but if you have a 6x6x32 image then a convolution with a 1x1x32 filter can do something that makes much more sense. And in particular what a 1x1 convolution will do is it will look at each of 36 different positions and it will take the element-wise product between 32 numbers in the input volume and 32 numbers in the filter and then apply a ReLu non-linearity to it. Generally, if you have not just one filter but if you have multiple filters then you will end up with output 6x6x#filters.

Here is an example illustrating why 1x1 convolution is useful, let's say you have a 28x28x192 input volume, if you want to shrink the height and width you can use pooling layer, but what if you want to reduce the number of channels, you can use 32 filters that are of size 1x1x192 and convolve them with the input volume and obtain 28x28x32 output volume. And we'll see later how this idea one-by-one convolutions allow you to shrink, keep the same or increase the number of channels and this is actually very useful for building the inception network.

When designing a layer for ConvNet you might have to pick do you want a 1x3 filter or 3x3 or 5x5 or do you want pooling layer, what the inception network does is it says why shouldn't do them all and this makes the network architecture more complicated but it also works remarkably well. For the sake of the example, you have a 28x28x192 dimensional input volume, what an inception layer says is instead of choosing what filter size you want in convolutional layer or even do you want a convolutional layer or pooling layer let's do them all, you can use 1x1 convolution and that will output a 28x28x64 output volume, and maybe you also try a 3x3 same convolution and that might output 28x28x128 volume, and again you also try a 5x5 same convolution and that might output 28x2832 volume, and you may not want a convolutional layer and let's apply pooling and that has some other output, saying 28x28x32 volume. Finally, let's stack all the output volume together to obtain an inception layer. But with an inception module like this, you can input some volume and output, in this case, I guess if you add up all these numbers 32+32+128+64=256 so the output is 28x28x256 volume.

The basic idea is that instead of you needing to pick those filter sizes or pooling you want and commiting to that you can do them all and just concatenate all the outputs and let the network learn whatever parameter it want to use wahtever combinations of these filter sizes it wants.

It turns out there's a problem with the inception network is computational cost. Let's say we have a 28x28x192 input volume and apply 5x5x192 same convolution with 32 filters and then obtain a 28x28x32 output, so the computation cost is the multiplication of numbers of input and filters size, that is 120 million.

If instead you use 1x1 convolution with 16 filters and then apply 5x5x16 same convolution as the following, you actually first shrink the channel size in the second layer (also we call bottleneck layer) and meanwhile reduce the computation cost because the total cost is sum of 1x1 convolution and 5x5 same convolution, that is 12.4 million.

Basically, the inception module takes the activation output from the previous layer, and in the previous slide is just one example we work through in-depth is 1x1 convolution followed by 5x5 same convolution on the activation and obtain 28x28x32 output volume, then to save computation you can also do 1x1 convolution followed by 3x3 same convolution and output 28x28x128 volume, or maybe you wanna consider a 1x1 convolution as well and let's say this outputs 28x28x64, and finally it's the pooling layer, so here we're gonna do something funny in order to concatenate all the outputs at the end we're gonna use a same type of padding for pooling so the output height and width is still 28x28, and then follow a 1x1 convolution to shrink the number of channel and get 28x28x32 volume, finally you take all of these output blocks and do channel concatenation.

And what an inception network does is to put a lot of inception module together, you notice there's a lot of repeated blocks, and if you look at one of the blocks there, that block is basically the inception module. It turns out there's one last detail to the inception network which is there're additional side branches because the last few layers are fully connected layers followed by softmax layer they try to make a prediction, what these side branches do is to take some hidden layer and passes it through few fully connected layers followed by a softmax layer to make a prediction, it helps ensure that the features even computed in hidden unit, even at intermediate layers and they're not too bad for predicting output class of an input image. This appears to have a regularization effect how to prevent this network from overfitting.

If you're building a computer vision application, rather than training weights from scratch from random initialization you often make much faster progress if you download weights that someone else has already trained on the network architecture and use that as pretraining and transfer that to a new task that you might be interested in. Let's say you're building a cat detector that recognizes your own pet cat. So, let's say your cat is named Tigger or Misty, then you have a classification problem with three classes, is this input picture is Tigger or Misty or neither? And you probably don't have a lot of pictures of Tigger or Misty so your training set will be small. I recommend you to go online and download some open source implementation of a neural network, and download not just the code but also the weights. And there're a lot of neural networks to download that have been trained on, for example, the ImageNet (www.image-net.org) dataset which has a thousand different classes so the network might have a softmax unit that outputs one of a thousand possible classes. And what you can do is to get rid of the softmax layer and create your own softmax unit that output Tigger, Ministry or neither. When you just have small training dataset you can freeze parameters of all the earlier layers and just train your own softmax layer. But if you have a larger training dataset you can freeze fewer layers, the more data you have for your tasks the more layers you could train.

Most of computer vision problem could use more data, so data augmentation is one of the techniques to improve the performance of computer vision systems. In practice, almost all computer vision tasks having more data will help. Commonly used data augment techniques include horizontally flipping a picture, for example on the left in the following slide, or randomly crop to get an image on the right, and in theory, you could also use rotation, shearing or distorting of your image, introducing some local wrapping and etc.

The second type of data augmentation that is commonly used is color shifting, give the original picture you could add to RGB different distortion.

We will move forward to object detection from next talk.

Chen Yang

Geoffroy Raduriau

Producteur de plantes aromatiques PPAM & extraction. Agroecologie

7 年

Get some trouble with pictures display

回复

要查看或添加评论,请登录

Chen Yang??????的更多文章

社区洞察

其他会员也浏览了