Decoding the CNN Architecture: Unveiling the Power and Precision of Convolutional Neural Networks - Part Ⅰ

Decoding the CNN Architecture: Unveiling the Power and Precision of Convolutional Neural Networks - Part Ⅰ

Architecture of Convolutional Neural Networks (CNN)

As discussed in the previous article, CNNs are a class of deep learning models that have shown exceptional performance in various fields, particularly in image analysis, video processing, and computer vision niches. The architecture of a CNN is designed to take advantage of the 2D structure of an input image (or other 2D input such as a speech signal). This is achieved with the use of feature extraction, a special type of layer, a convolutional layer, and pooling layers.

Convolutional Layer

The convolutional layer is considered as the integral building blocks of a CNN. Their primary function is to identify and respond to local patterns or features presented in the previous layer. As shown in the below figure, this task is accomplished by processing a three-dimensional input tensor and creating an output tensor through a set of adaptive filters or kernels. Each filter, in essence, scans over the width and height of the input tensor, performing a dot product computation between the elements of the filter and the input. The outcome is a two-dimensional activation map that reveals how the filter responds at every spatial location. As the network learns, filters are tuned to activate upon recognizing certain visual characteristics such as an edge or a specific color contrast.

A 2D Image Filter

To better visualize, consider the operation where a filter (depicted in green) of size (2 x 2) aligns with a matching area (represented in orange) within a (4 x 4) input feature map. In the below figure, this process is shown in stages from (a) to (i). Each alignment results in a multiplication operation, the sums of which yield an equivalent data point (shown in blue) on the output feature map. This is iteratively done for each convolution step as the filter slides across the input feature map, the filter takes a step of 1 along the horizontal or vertical position calculating the corresponding output value for the feature map. Intuitively, the network will learn filters that activate when they detect some type of visual feature such as an edge or a particular color contrast.

Convolutional Layer Architecture – Source: (Khan et al., 2018)

Pooling Layer

Pooling layers are usually inserted between successive convolutional layers in a CNN architecture. Their function is to progressively reduce the spatial size of the representation, to reduce the number of parameters and computation in the network providing translation invariance and dimensionality reduction, leading to less computational requirements and preventing overfitting. A pooling layer is an essential component of CNNs. They primarily function to progressively reduce the spatial size (width and height) of the input representation, making the network invariant to minor translations. This down-sampling operation helps in reducing overfitting by providing an abstracted form of the representation, reducing the computational complexity, and hence, the training time. The pooling layer operates independently on every depth slice of the input and resizes it spatially. A pooling operator operates on individual feature channel, aggregating data of a local region (e.g., a rectangle) and transforming them into one single value. A pooling layer performs operations on sections of the input feature map to merge the feature activations. The method of combination is determined by a pooling function. Similar to the convolution layer, the size of the pooling area and the stride need to be determined. The max pooling operation, for example, selects the highest activation from a chosen value block, as depicted in the above figure. This window is then moved across the input feature maps in steps determined by the stride size (which is 1 in the below Figure case). Given a pooled region size of (f x f) and a stride (s), the dimensions of the output feature map are thus established. Common choices include max pooling (using the maximum operator) and average pooling (using the average operator), both of which are hand-crafted which are the most common pooling approaches.

Max Pooling

Max pooling is the most widely used pooling operation and works by defining a spatial neighborhood usually a 2x2 window and taking the maximum element from the rectified feature map within that window. It can also be understood as a "feature detector" that retains only the highest value in a particular feature map region, discarding all other information as shown in the below figure:

Max Pooling - Source: (Khan et al., 2018)

Average Pooling

Average pooling, as its name suggests, calculates the average of the elements in a feature map region. Unlike max pooling which only keeps the maximum response, average pooling considers the average of the responses, thus retaining more information than max pooling. For instance, in the above figure (a): the maximum element of (1, 3, -2, 0) is (3). On the other hand, the average pooling response will be (0.5). However, average pooling is less common in practice compared to max pooling because it effectively captures the dominant features in a feature map, offering better preservation of critical information. As max pooling provides a form of minor regularization that can mitigate overfitting. Historically, foundational deep learning architectures like LeNet-5 and AlexNet employed max pooling, and their successes cemented its use. Empirically, models using max pooling often outperform those with average pooling on standard benchmarks, further bolstering its popularity in the deep learning community.

Fully Connected Layer

As presented in the cover image, fully connected layers connect every neuron in one layer to every neuron in another layer. It corresponds essentially to convolution layers with filters of size 1x1. Each unit in a fully connected layer is densely connected to all the units of the previous layer. It is in principle the same as the traditional multi-layer perceptron neural network (MLP). The fully connected layer is usually the last layer in a CNN architecture, and its purpose is to use these features for classifying the input image into various classes based on the training dataset. Its operation can be represented as a simple matrix multiplication followed by adding a vector of bias terms and applying an element-wise nonlinear function.

Where x and Y are the vectors of input and output activations, respectively, W?denotes the matrix containing the weights of the connections between the layer units and represents the bias term vector.

Fully Connected Layer – Source: (Saleem et al., 2022)

Reference: Khan, S., Rahmani, H., Shah, S.A.A. and Bennamoun, M. (2018). A Guide to Convolutional Neural Networks for Computer Vision. Synthesis Lectures on Computer Vision, 8(1), pp.1–207. doi: https://doi.org/10.2200/s00822ed1v01y201712cov015 .


Kajal Singh

HR Operations | Implementation of HRIS systems & Employee Onboarding | HR Policies | Exit Interviews

8 个月

Well summarised. Besides Support Vector Machines, during 1980 and 2010, researchers worked on expanding MultiLayer Perceptrons (MLPs) which were invented by Ivankhnenko and Lapa in 1965 and began to be called Deep Learning Networks (DLNs) in 1986. As mentioned in a previous blog, a one layer Perceptron network consists of an input layer connected to a hidden layer, which is connected to an output layer of Perceptrons (or vertices). The Perceptron multiplies incoming signals by their weights and adds them together. If the sum of the weighted signals exceeds a specified value, the Perceptron "fires".Activation functions, such as Tanh, ReLU, and Sigmoid, are used to determine if a Perceptron fires. Artificial Neural Networks (ANNs) are simply Perceptrons or other similar neurons that may have different activation functions. DLNs have more than one hidden layear and are complex due to the non-linear nature of activation functions, making them unexplainable "black boxes". Researchers like Hinton, LeCun and Schmidhauber popularized variants of DLNs, e.g., Fully Connected Networks, Autoencoders, Convolution Neural Networks, Recurrent Neural Networks, Long Short Term Memory, and Deep Belief Networks.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了