BxD Primer Series: Convolutional Neural Networks
Hey there ??
Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Convolutional Neural Networks. Let’s get started:
The What:
CNNs are inspired by the visual cortex of brain to processes visual information. They use convolutional layers to extract features from input image, followed by pooling layers to reduce dimensionality. Finally, fully connected neuron layers are used to classify the input.
??Key components of CNNs:
??Peculiarities of Convolutional Neural Networks:
Note: CNNs are very similar to traditional?feed forward neural networks ?but they are specifically designed to first “convolute” the input in tasks related to image and frames of video, audio etc. Convolution allows to reduce the number of parameters and achieve better results efficiently.
Anatomy of a CNN:
Convolutional Layer in a CNN:
Convolutional layer applies a set of filters to input image or feature map, where each filter slides across the input and computes a dot product between its weights and corresponding local region of input.
Output of convolution operation is a set of feature maps, where each feature map corresponds to a single filter and captures a particular pattern in input data. These filters are learned during training process using back-propagation and gradient descent.
The size of filters is typically smaller than the size of input image, and the filters are applied with a certain stride and padding.
Note 1: There are two types of padding: valid and same padding.
Note 2: Choice of stride and padding have a significant impact on performance of a CNN.
Note 3: In a stricter sense, ‘kernel’ term is used for a single filter matrix that hovers over image and ‘filter’ typically consists of multiple kernels.
Types of Kernels:
Kernel type depends on the activation of pixel. For example, a 3*3 kernel has 9 pixels which can be activated column/ row/ diagonal wise or in a different configuration as per requirements. Commonly kernel types are:
? Identity kernel simply performs the identity operation and is used to preserve original information in input.
? Edge Detection Kernels:
? Blur and Smoothing Kernels:
? Sharpening Kernels:
? Embossing Kernel enhances the edges in an image by simulating a 3D embossed effect.
? Custom Kernels
Choosing Number and Size of Filters in Convolutional Layer:
Number of filters determines the?depth of output feature map. More filters capture more diverse features in input image and increase the expressiveness of network. However, more filters also mean more parameters and a more computation, which can slow down training and make the network prone to overfitting.
The size of filters determines the?receptive field of neurons?in the layer. A larger filter size captures more global features, while a smaller filter size captures more local features.
Choice of filter size depends on the scale of features you want to capture in the image.
It is common to start with small number of filters in first layer of network and gradually increase number of filters in deeper layers. Low-level features captured by early layers are combined to form more complex features in deeper layers.
It is also common to use a smaller filter size in early layers and gradually increase filter size in deeper layers.
Final choice is usually done by trial and error, using cross-validation and grid search techniques to find optimal number and size of filters for given task and dataset.
Receptive Field of a Neuron:
Receptive field of a neuron refers to the region of input image that influences activation of that neuron. It is determined by the size of filters in preceding layers of network.
A filter slides over the image one stride at a time, computing a dot product between filter weights and the values in receptive field. The result of this dot product is a single value, which is the output of filter?for that particular location?in image.
As the filter is applied to different locations in image, the receptive field of neuron changes.
In first layer of the network, receptive field is typically small, because the filters are small. As the network becomes deeper, the filters become larger resulting in a larger receptive field.
In a typical CNN, the receptive field in?output layer?is usually large enough to capture entire input image, which enables the network to recognize objects regardless of their position in the image.
Difference between 1D, 2D, and 3D Convolution:
This difference has to do with the dimensionality of input data.
In 1D convolutional layer, the input data is a one-dimensional sequence, such as a time series or a sequence of words. The filter slides along the input sequence in one direction.
In 2D convolutional layer, the input data is a two-dimensional image, such as a grayscale or color image. The filter slides over the image in two dimensions.
In 3D convolutional layer, the input data is a three-dimensional volume, such as a video. The filter slides over volume in three dimensions, capturing shapes, movements, and spatial relationships.
These convolutions can also be used in combination in a single CNN architecture. For example, a 2D CNN may be used to process the individual frames of a video, followed by a 3D CNN that processes sequences of frames to capture temporal patterns.
Pooling Layer:
Pooling layers are typically placed after each set of convolutional layers in a CNN. Purpose of pooling layer is to reduce the dimensionality of feature maps and thereby reduce computation and prevent overfitting.
It works by dividing the input feature map into a grid of non-overlapping regions called windows. For example, if we have an input feature map of size 4x4 and we use a 2x2 pooling layer with a stride of 2, the output feature map will have size 2x2.
Three types of pooling layers are typically used:
? Max Pooling?selects the maximum value of pixel within the window.
? Average Pooling?takes the average pixel value within the window. It tends to blur the input information more than max pooling.
领英推荐
? L2 Pooling?takes the square root of sum of squared values of pixels within the window.
Shallow v/s Deep CNN:
The depth of CNN refers to the number of layers it has.
Purpose of Residual Connections:
Residual connections, also known as skip connections, is a technique to improve the training of very deep networks. Basic idea is to?add shortcut connections between layers?so that output of a layer can be directly added to the output of a later layer, bypassing several layers in between.
When a CNN is very deep, it becomes difficult for the network to learn useful features in later layers. This causes the problem of vanishing gradients, where gradients of loss function become very small in early layers, leading to a small meaningless weight update.
By using residual connections, gradients are able to flow more easily through the network, allowing for better learning of features in later layers.
Choosing?Batch Size:
Batch size determines the number of samples that are propagated through a CNN before weights are updated (using back-propagation) during training. It affects both the speed and quality of training process.
There is no one-size-fits-all answer for choosing batch size. Here are some thumb rules:
Note: Learning rate is a hyper-parameter that determines the step size at which parameters of CNN are updated during training.
The How:
Suppose we have an input image?X, which is represented as a 3-dimensional tensor with dimensions (H, W, C) and we want to classify the image into one of?K?possible classes.
Where,
??Convolution: Apply a set of?F?filters, each of size (KhKwC), to the input image to obtain a set of feature maps?Z(1), where?Z_{i,j,k}(1)?represents the activation of i’th filter at position (j,k).
Each filter has its own set of learnable parameters, which are updated during training using back-propagation. Convolution operation can be expressed as:
Where?W_{u,v,c,i}?represents the weight of filter at position (u,v,c) for i’th filter, and?b_i?represents the bias term for i’th filter.
??Activation Function: Apply a nonlinear activation function?f?element-wise to the feature maps?Z(1)?to introduce nonlinearity into the network:
??Pooling: Apply a pooling operation to the feature maps?A(1)?to reduce their spatial resolution and extract higher-level features. Suppose we use max pooling with a pooling window size of?Ph*Pw:
??Fully Connected Layers: Flatten the pooled feature maps?P(1)?into a 1-D vector and pass it through one or more fully connected layers to map the features to output classes.
Suppose we have?L?fully connected layers with weights?W(l)?and biases?b(l), and the output of l’th layer is denoted as?Z(l).
Output of last fully connected layer is passed through a softmax function to obtain a probability distribution over?K?output classes:
??Training: During training, we minimize a loss function?L?with respect to the network parameters using back-propagation and stochastic gradient descent. The loss function measures the discrepancy between predicted output probabilities and true class labels:
Where?y_i?is the true label for class?i, and?A_i(L)?is the predicted probability for class?i.
We can update the weights and biases of network using back-propagation and stochastic gradient descent.
Gradients of loss function with respect to the weights and biases can be computed using chain rule:
Update the weights and biases using gradients and learning rate???:
Repeat the forward and backward passes on mini-batches of training data until convergence or a stopping criterion is met.
Classifier v/s Detector in CNN:
A classifier is a type of model that is trained to classify images into one of several predefined categories or classes. Goal of a classifier is to?learn a mapping?from input image to the correct output label, based on features extracted by the network.
A detector is a type of model that is trained to detect the presence and location of objects in an image, as well as classify them. Goal of a detector is to?identify regions in input image?that contain objects of interest, and then classify them into one of several predefined classes.
To build a detector using a CNN, additional layers are added to the network, such as ‘region proposal networ k’ or ‘anchor-based detectors ’, which generate candidate object regions in input image.?These regions?are then passed through a classification layer, which assigns a label to each region.
Output of a detector is typically a set of bounding boxes that indicate the location of objects in image, along with their corresponding class labels and confidence scores.
Bounding boxes are generated by using ‘non-maximum suppression ’?technique to filter out overlapping detections and retain only the most confident ones.
The Why:
Reasons for using CNNs:
The Why Not:
Reasons for not using CNNs:
Time for you to support:
In next edition, we will cover Deconvolutional Neural Networks.
Let us know your feedback!
Until then,
Have a great time! ??