Unraveling the Power of Vision: A Deep Dive into the Different Types of CNNs
Nourhan Moustafa
British Council Women in STEM Scholarship Awardee 2022/2023 | AI/ML Applied Researcher | Data Science Enthusiast | STEM Ambassador 100+ hrs of Engagement @ STEM Learning UK
Types of CNN Architectures:
Welcome to the fascinating world of convolutional neural networks (CNNs)! In today's ever-evolving tech landscape, where visual data is abundant and invaluable, understanding the diverse types of CNNs is like holding the keys to unlocking the limitless potential of image analysis and recognition. These specialised neural networks have reshaped industries, from healthcare to autonomous vehicles, and are at the forefront of cutting-edge innovations. So, let's embark on a journey through the unique architectures and applications of various CNN types, where each one plays a distinct role in transforming the way we see and understand the visual world.
1. ResNet
The Residual Network (ResNet), introduced by He et al. in 2016, emerged as a milestone in deep learning, winning the ILSVRC 2015 challenge by slashing the top-5 error rate to 3.6% from 6.7% the previous year. Its subsequent application, alongside models like GoogleNet Inception, in an ensemble, achieved an even lower error rate of 3.0% in the 2016 contest.
The defining attribute of ResNet lies in its identity skip connections within residual blocks, enabling training of very deep CNN architectures. In a residual block, the input is added to its transformation via a direct link that bypasses transformation layers, thus forming the "skip identity connection" This split of the transformation function into an identity term and a residual term effectively simplifies the learning of residual feature maps, making the training of very deep models more stable.
As presented in the below figure, similar to GoogleNet's inception module, ResNet comprises multiple stacked residual blocks. The model which clinched the ILSVRC title had 152 weight layers, about eight layers deeper than the VGGnet-19. Importantly, it demonstrated that residual connections were critical to the improved accuracy of such deep networks, as networks without these connections exhibited higher error rates.
ResNet follows each weight layer in the residual block with batch normalization and a ReLU activation layer. Notably, He et al. demonstrated a shift from this "post-activation" design to a "pre-activation" one, where normalization and ReLU layers precede the weight layers. This "unhindered" identity connection further strengthens deep networks' feature learning capacity. In fact, this modified design allowed the training of networks with a depth of 200 layers without overfitting, an improvement from the original ResNet design.
2. LeNet-5
LeNet-5 is one of the earliest CNNs that is proposed by Yann LeCun in 1998. It was used primarily for digit recognition tasks such as recognizing digits for postal mail sorting. This model is trained on the MNIST dataset. It is called LeNet-5 because it is composed of 5 weight layers overall. As shown in the below figure, the architecture comprises a pair of convolutional layers, with each immediately succeeded by a subsampling layer (max-pooling) for feature extraction. Subsequently, another convolutional layer is employed which is followed by two fully connected layers situated towards the end of the model. These function as classifiers, analyzing the features that have been extracted. It also influenced the design of later CNN architectures.
3. AlexNet
AlexNet, a seminal work by Krizhevsky et al., in 2012, played a vital role in revitalizing the application of deep neural networks, particularly CNNs, in the field of image preprocessing and computer vision. Its remarkable performance in the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) set a new precedent and encouraged the extensive adoption of CNNs in subsequent competitions.
Notable in the below figure, AlexNet comprised eight layers with tunable parameters, five of which were convolutional and the remaining three were fully connected. This increased depth significantly expanded the number of adjustable parameters compared to preceding architectures, enhancing the model's capacity to learn complex features.
Another feature that set AlexNet apart from its predecessors was its implementation of regularization techniques, including the introduction of dropout after the first two fully connected layers and the use of data augmentation. The deployment of dropout led to a decrease in overfitting, thereby improving the model's generalization capabilities when dealing with unseen data.
The output layer of the AlexNet model, also a fully connected layer, was designed for classifying input images into one of the thousand categories from the ImageNet dataset. Thus, it consisted of 1,000 units.
Furthermore, AlexNet employed the Rectified Linear Unit (ReLU) nonlinearity after each convolutional and fully connected layer. This was a significant advancement over the traditionally used tanh function, primarily because it substantially enhanced the efficiency of the training process. This combination of depth, novel regularization methods, and the usage of ReLU nonlinearity contributed to the exceptional performance and impact of the AlexNet architecture.
领英推荐
4. VGGNet
Introduced by Simonyan and Zisserman in 2014, the VGGNet architecture quickly gained recognition as one of the most influential CNNs. While it did not claim victory at ILSVRC'14, its inherent simplicity and use of small-sized convolutional kernels leading to considerably deep networks contributed to its enduring popularity.
The VGGNet architecture offered the most remarkable results. This network strictly utilised 3x3 convolutional kernels in tandem with intermediate max-pooling layers for feature extraction, followed by three fully-connected layers at the end for the classification task. In this architecture, every convolution layer was succeeded by a Rectified Linear Unit (ReLU) layer.
As presented in the below figure, the strategic choice of smaller kernels led to fewer parameters, thereby enhancing efficiency in both training and testing. Furthermore, the stacking of multiple 3x3 kernels increased the effective receptive field to larger values (e.g., simulating 5x5 with two layers or 7x7 with three layers). This allowed the network to gain the benefits of a larger receptive field while maintaining the advantages of smaller convolutions. Notably, using smaller filters enabled the construction of deeper networks, thereby enhancing performance on visual tasks. This insight underscores the key tenet of VGGNet: the usage of deeper networks for improved feature learning.
VGGNet-16, one of the best-performing configurations, consisted of 138 million parameters. Emulating AlexNet, the VGGNet architecture also implemented activation dropouts in the initial two fully connected layers to mitigate overfitting, further bolstering the model's capacity to generalize to unseen data (Khan et al., 2018).
5. GoogLeNet
GoogleNet, introduced by Szegedy et al. in 2015, emerged as a groundbreaking model that ventured into intricate architecture incorporating multiple network branches. Its innovative design, coupled with a top-5 error rate of 6.7%, clinched victory in the ILSVRC'14 competition. The discussion here is focused on this iteration of GoogleNet.
As illustrated in the below figure, its architecture comprises 22 weight layers and uses the unique "Inception Module" as its foundation, leading some to refer to it as the "Inception Network". Unlike the sequential processing of its predecessors, this module operates in parallel. The core concept is to process basic blocks, typical in conventional convolutional networks, in parallel and amalgamate their output feature representations. This approach enables stacking of multiple Inception Modules without fretting over individual layer design. One major hurdle was the high-dimensional feature output resulting from concatenating all feature representations. To counteract this, the complete Inception Module introduces dimensionality reduction using a 1x1 convolution operation before applying the 3x3 and 5x5 convolution filters. This effectively reduces feature dimensions and enhances the Inception Module's performance.
A glance at the Inception Module reveals the motive behind bundling diverse operations into one block. Features are extracted using a range of filter sizes (1x1, 3x3, 5x5) corresponding to different receptive fields, encoding features at various levels from the input. The inclusion of a max-pooling layer further aids in feature representation. Coupled with the ReLU nonlinearity after each convolution layer, this equips GoogleNet with exceptional capabilities in modeling nonlinear relationships.
GoogleNet stacks nine Inception Modules, yielding a 22-layer deep network. Similar to NiN, GoogleNet employs global average pooling followed by a fully connected layer for classification, ensuring swift computations, superior classification accuracy, and fewer parameters. It also introduces several output branches in intermediate layers for improved gradient flow. GoogleNet incorporates dropout before the final fully connected layers in each output branch for regularization.
Despite appearing more complex than predecessors like AlexNet and VGGNet, GoogleNet has significantly fewer parameters (around 6 million compared to 62 million in AlexNet and 138 million in VGGNet), proving that efficient design choices can lead to a more intuitive CNN architecture that achieves high accuracy with an impressively small memory footprint (Khan et al., 2018).
Each of these architectures introduced new concepts to CNNs, such as Inception Modules in GoogleNet or skip connections in ResNet which have helped to shape the current state of deep learning.
Reference: Khan, S., Rahmani, H., Shah, S.A.A. and Bennamoun, M. (2018). A Guide to Convolutional Neural Networks for Computer Vision. Synthesis Lectures on Computer Vision, 8(1), pp.1–207. doi: https://doi.org/10.2200/s00822ed1v01y201712cov015 .