Convolutional Neural Networks how Artificial intelligence see
Juan David Tuta Botero
Data Science | Machine Learning | Artificial Intelligence
Recently the idea to use Artificial intelligence to analyze images and videos is becoming a trending topic even for people who are not related in the world of machine learning or software development at all. Is common to hear conversations about self-driving cars, deep fake videos, bots able to detect diseases, and so on. But even when we already talk about neural networks and how to optimize them, the task to analyze a video becomes quite a challenge when we realize each entry to the NN(Neural Network) is becoming a pixel, and if it is in color, well each pixel has 3 values so not only we face the problem to find a big data set to work but also each example of the data consist to thousands of entries which require computing power out of our minds. Well, the solution to this problem was found using CNN (Convolutional Neural Networks) and it is exactly was this article is going to be about, most specifically from a document published in 2012 by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton Researches from Toronto university who participate in the contest ImageNet LSVRC-2010 were they get the top 1 and top 5? error rates.
Convolutional neural network
The CNN is a type of Artificial Neural Network with supervised learning that processes its layers imitating the visual cortex of the human eye to identify different characteristics in the inputs that ultimately make it able to identify objects and "see". To do this, the CNN contains several specialized hidden layers with a hierarchy: this means that the first layers can detect lines, curves and are specialized until they reach deeper layers that recognize complex shapes such as a face or the silhouette of an animal.
The ways to apply these filters are known as Kernels, these constitute by a matrix of specific size corresponding to a hyperparameter the user can choose and work transforming the input as shown in the figure.
Normalization
In the project, they use an activation function called Relu, but why I’m talking about the activation function in the part of the normalization, well normalization is a procedure where we took data and try to uniform the distribution of the different points that compose it. In that way when we apply the procedure of optimization the derivates from the backward propagation can move smoothly from the hyperspace that composes the error function if the normalization process is avoided can occur the vector generate start to diverge due to the shape and couldn’t find the point of minimum loss.
The activation function Relu has the desirable property that they do not require input normalization to prevent them from saturating. If at least some training examples produce a positive input to a ReLU, learning will happen in that neuron. However, we still find that the following local normalization scheme aids generalization.
Procedure
Well now that we kind of how a convolutional neural network works differently from a traditional neural network, let’s talk a little bit about what is built, the net contains the 8 layers with weights, the first 5 layers are convolutional and the other three are fully connected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels.?
The kernels of the second, fourth and fifth convolutional layers are connected only to those kernel maps in the previous layer which reside in the GPU. The Kernel of the third convolutional layer is connected to all kernel maps in the second layer. The neurons in the fully connected layer are connected to all neurons in the previous layers. Response normalization layers follow the first and the second convolutional layers. The first convolutional layer filters the 224X224X3 input image with 96 kernels of size 11X11X3 with a stride of 4 pixels (this is the distance between the receptive field centers of neighboring
An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer parts at the top of the figure while the other runs the layer parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–4096–4096–1000.
领英推荐
Results
The network achieves top-1 and top-5 test set error rates of 37.5% and 17.0%5. The best performance achieved during the ILSVRC- 2010 competition was 47.1% and 28.2% with an approach that averages the predictions produced from six sparse-coding models trained on different features, and since then the best-published results are 45.7% and 25.7% with an approach that averages the predictions of two classifiers trained on Fisher Vectors (FVs) computed from two types of densely-sampled features.
The researcher also presented to the same competition two years before that was presented in this article but due to new conditions of the contest they can not present the result but the interesting idea is the find really good losses level even when they dispose of with a lot of less machine power than their challengers.
Finally, they also report their error rates on the Fall 2009 version of ImageNet with 10,184 categories and 8.9 million images. On this dataset, they follow the convention in the literature of using half of the images for training and a half for testing. Since there is no established test set, split necessarily differs from the splits used by previous authors, but this does not affect the results appreciably. Their top-1 and top-5 error rates on this dataset are 67.4% and 40.9%, attained by the net described above but with an additional, sixth convolutional layer over the last pooling layer. The best-published results on this dataset are 78.1% and 60.9%.
Comparison of error rates on ILSVRC-2012 validation and test sets. In italics are best results achieved by others. Models with an asterisk* were “pre-trained” to classify the entire ImageNet 2011 Fall release.
Eight ILSVRC-2010 test images and the five labels are considered most probable by their model. The correct label is written under each image, and the probability assigned to the correct label is also shown with a red bar (if it happens to be in the top 5).
Conclusions
Their results show that a large, deep convolutional neural network is capable of achieving record-breaking results on a highly challenging dataset using purely supervised learning. It is notable that our network’s performance degrades if a single convolutional layer is removed. To simplify their experiments, they did not use any unsupervised pre-training even though we expect that it will help, especially if they obtain enough computational power to significantly increase the size of the network without obtaining a corresponding increase in the amount of labeled data.
Personal Notes
According to the things presented in the article we can clearly see the foundation of new technology and develop of convolutional neural networks, in that moment image recognition was something really hard to accomplish but in recent days is getting integrated into countless applications. And is not hard to believe in a few years from now be present in our day by day.