Summary - ImageNet Classification with Deep Convolutional Neural Networks
Based on the paper “ImageNet Classification with Deep Convolutional Neural Networks”. You cand found it on https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
This article is a summary of the paper published by Alex Krichevsky, Ilya Sutskever and Geoffrey E. Hinton in 2012. The main objective of this document was the presentation of a Deep Neural Convolutional Network designed to classify 1.2 Million high-resolution images in 1000 classes. different. The design and implementation of this neural network was part of a contest in which they obtained the best results ever reported up to that moment with the data set training delivered.
The challenges faced by the designers of this network were mainly in two parts; Training Data Set size and response times. Convolutional networks have a great learning capacity and due to their architecture, they are flexible in terms of input data and have better connections and parameters that allow for better training.
The Training Data Set is a subset of ImageNet (over 15 million images tagged with over 22,000 categories). This subset of images consisted of approximately 1.2 million images tagged with 1,000 categories, 50,000 validation images, and 150,000 test images. The images were normalized to a size of 256 x 256 with the RGB color model.
The CNN architecture implemented consisted of:
Activation:
The activation function used for the different layers of CNN was ReLU since training times using this function are significantly faster than using other functions such as hyperbolic tangential.
1. Using multiple GPUs:
Due to the size of the Data Set Training, I opted to work with two GPUs working in parallel, but in a crossed way since the GPUs communicate in some layers without having to go through the central processing system. For example, Layer 3 filters take the inputs of all Layer 2 responses from the other GPU, thereby increasing processing efficiency while maintaining cross-relation of the data.
2. Standardization:
Although the ReLU function has the advantage of not requiring normalization in the input to avoid its saturation, the decision was made to use normalization to help generalization. Normalization is applied only in some layers after the activation function.
3. Overlapping Pooling
Although the traditional pooling system does not overlap its results, the decision was made to do it in this network with the aim of reducing overfitting and error.
1. General architecture.
In graph 1 the following layers can be seen:
· The input image is 224 x 224 with 3 channels (RGB).
· The Data Set Training images are divided into two GPUs working in parallel.
· Each GPU has a first convolution layer with 96 kernels (48 in the first GPU and 48 in the second GPU) of 11x11x3 size. This layer includes a max-pooling with a stride of 4.
· The second convolution layer has 256 kernels (similar distribution of the previous layer) of size 5x5x48. Includes max-pooling without a stride.
· The third convolution layer has 384 kernels with a size of 3x3x256 and in this case, the information is crossed between the GPUs. It does not include normalization or pooling.
· The fourth convolution layer has 384 kernels with a size of 3x3x192 without normalization or pooling.
· The fifth convolution layer has 256 kernels of size 3x3x192.
· Finally we have two fully connected layers with 4096 nodes divided by halves in each GPUs.
· The output layer has 1000 classes
2. Overfitting
In order to reduce overfitting, different techniques were applied as well.
Data Augmentation
This technique consists of increasing the Data Set Training by generating new images that retain the existing labeling. The first way to perform this Data Augmentation is by making horizontal translations and reflections of the original images, extracting 224x224 size images from the original 256x256 images. With this method, the Data Set Training is increased by 2048.
The second way to do Data Augmentation is by altering the intensities of the RGB channels by adding multiples of the main components found.
Dropout
In order to decrease training times, neurons with a probability less than 0.5 are turned off. This technique is used in the first two fully connected layers. Without the use of this technique, the network behaves with high overfitting.
Conclusion
The paper "ImageNet Classification with Deep Convolutional Neural Networks" presents a convolutional neural network design with the application of different techniques to decrease training time, overfitting and therefore the associated error.
Personal notes
The implementation of this CNN is a beautiful example of several techniques used in Machine Learning as Convolution, Pooling, optimization and other ones with emphasis to reach several objectives of efficacy. The handling of multiples GPUs is a nice way to take advantage of Python features.