Going Deeper with Convolutions (Inception | GoogLeNet)
1 . What is an inception model?
Inception is?an image recognition model that has been shown to attain greater than 78.1% accuracy on the ImageNet dataset. The model is the culmination of many ideas developed by multiple researchers over the years.
The model comprises symmetric and asymmetric building blocks, including convolutions, average pooling, max pooling, concatenations, dropouts, and fully connected layers. Batch normalization is used extensively throughout the model and applied to activation inputs. Loss is computed using Softmax.
2 . Inception V1 :
When multiple deep layers of convolutions were used in a model it resulted in the overfitting of the data. To avoid this from happening the Inception-V1 model uses the idea of using multiple filters of different sizes on the same level. Thus in the inception models instead of having deep layers, we have parallel layers thus making our model wider rather than making it deeper.
The above-depicted Inception module simultaneously performs 1 * 1 convolutions, 3 * 3 convolutions, 5 * 5 convolutions, and 3 * 3 max pooling operations.
Thereafter, it sums up the outputs from all the operations in a single place and builds the next feature. The architecture does not follow the Sequential model approach where every operation such as pooling or convolution is performed one after the other.
The Inception module with dimension reduction works in a similar manner as the na?ve one with only one difference. Here features are extracted on a pixel level using 1 * 1 convolutions before the 3 * 3 convolutions and 5 * 5 convolutions. When the 1 * 1 convolution operation has been performed the dimension of the image is not changed. However, the output achieved offers better accuracy.
Inception architecture :
?3 . Inception-V2 :
In the Inception-V2 architecture. The?5×5?convolution is replaced by the two?3×3?convolutions. This also decreases computational time and thus increases computational speed because a?5×5?convolution is 2.78 more expensive than a?3×3?convolution. So, Using two?3×3?layers instead of?5×5?increases the performance of architecture.
This architecture also converts?nXn?factorization into 1xn and nx1 factorization. As we discussed above that a 3×3 convolution can be converted into?1×3?then followed by a 3×1 convolution which is 33% cheaper in terms of computational complexity as compared to?3×3.
To deal with the problem of the representational bottleneck, the feature banks of the module were expanded instead of making it deeper. This would prevent the loss of information that causes when we make it deeper.?
3 . Inception-V3 :
Inception-v3 mainly focuses on burning less computational power by modifying the previous Inception architectures. This idea was proposed in the paper?Rethinking the Inception Architecture for Computer Vision, published in 2015. It was co-authored by Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, and Jonathon Shlens.
领英推荐
Inception-v3 architecture:
Inception-v3 is similar to Inception-v2 with some updates in loss functions, optimizer, and batch normalization.
What’s new ?
These are some updates in Inception-v3 concerning inception-v2 :
4 . Inception-v4 :
The architecture of the network was made deeper in Inception v4 with the change in the stem part (stem refers to the starting part of Inception architecture) and made uniform choices for the Inception blocks.
What’s new?
5 . Inception ResNet v2 :
Inspired by the performance of the?ResNet,?residual connections?are introduced in inception modules.
Input and concatenate output after several operations should have the same dimension, therefore the?padding?is applied in each operation, and?at the end, 1*1 convolution is applied?to make the number of channels equal as shown below.