BxD Primer Series: Convolutional Neural Networks

BxD Primer Series: Convolutional Neural Networks

Hey there ??

Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Convolutional Neural Networks. Let’s get started:

The What:

CNNs are inspired by the visual cortex of brain to processes visual information. They use convolutional layers to extract features from input image, followed by pooling layers to reduce dimensionality. Finally, fully connected neuron layers are used to classify the input.

??Key components of CNNs:

  1. Convolutional layers?apply?filters?(also called kernels) to the input image, sliding them across the image and computing a dot product to produce a feature map. The filters are learned during training and can detect different types of patterns and features in image, such as edges, corners, and textures.
  2. Pooling layers?downsample the output of convolutional layers by taking the maximum or average value of each small region of feature map. This reduces the dimensionality of data and helps to prevent overfitting.
  3. Activation functions?are used to introduce nonlinearity into the model, allowing it to learn complex relationships between input and output. Commonly used functions are ReLU, sigmoid, and tanh.
  4. Fully connected layers?take the output of convolutional and pooling layers and flatten it into a vector, which is then fed into a traditional neural network for classification.
  5. Dropout?regularization technique randomly drops out a fraction of neurons during training, forcing the network to learn more robust features and prevent overfitting.

??Peculiarities of Convolutional Neural Networks:

  1. CNNs shares parameters of filters across entire image. The?same filter?is applied at?every location?in the image, rather than learning a separate set of weights for each location.
  2. CNNs are able to recognize objects in an image?regardless of their position, because the filters are applied locally and same filters are used across entire image. This property is known as?translation invariance.
  3. CNNs learn a hierarchy of features, starting from low-level features like edges and corners and building up to more complex features like shapes and textures. This is achieved by using?multiple layers of filters, which progressively abstract input data and capture complex patterns.
  4. The datasets used to train CNN may not be representative of full range of variations that can occur in real world. To address this,?data augmentation?techniques are used to artificially increase the size of training set by applying transformations to input data, such as rotating, flipping, or scaling images.
  5. Due to the high computational cost of training CNNs from scratch,?pre-trained models?are often used as a starting point for new tasks. These models can be fine-tuned on a smaller dataset to recognize a specific set of objects or perform a different task. This is called?transfer learning?and we will cover it extensively in a separate edition.

Note: CNNs are very similar to traditional?feed forward neural networks ?but they are specifically designed to first “convolute” the input in tasks related to image and frames of video, audio etc. Convolution allows to reduce the number of parameters and achieve better results efficiently.

Anatomy of a CNN:

No alt text provided for this image

Convolutional Layer in a CNN:

Convolutional layer applies a set of filters to input image or feature map, where each filter slides across the input and computes a dot product between its weights and corresponding local region of input.

Output of convolution operation is a set of feature maps, where each feature map corresponds to a single filter and captures a particular pattern in input data. These filters are learned during training process using back-propagation and gradient descent.

The size of filters is typically smaller than the size of input image, and the filters are applied with a certain stride and padding.

  • Stride determines how much the filter?moves across input?for each convolution operation. For example, a stride of 2 means that the filter moves 2 pixels at a time, effectively reducing the size of output feature map. By default, most CNN use a stride of 1.
  • Padding adds?extra null pixels around the edge of input?(before applying the convolutional filter) to ensure that information from edges is not lost or underrepresented.

Note 1: There are two types of padding: valid and same padding.

  • Valid padding means no padding is added to input.
  • Same padding means input is padded with zeros so that output feature map has same dimensions as input.

Note 2: Choice of stride and padding have a significant impact on performance of a CNN.

  • Larger stride results in a smaller output feature map, which reduces computational but also result in a loss of information.
  • Padding helps to preserve spatial information in input, but also increases the computational.

Note 3: In a stricter sense, ‘kernel’ term is used for a single filter matrix that hovers over image and ‘filter’ typically consists of multiple kernels.

Types of Kernels:

Kernel type depends on the activation of pixel. For example, a 3*3 kernel has 9 pixels which can be activated column/ row/ diagonal wise or in a different configuration as per requirements. Commonly kernel types are:

? Identity kernel simply performs the identity operation and is used to preserve original information in input.

? Edge Detection Kernels:

  • Sobel Kernel detects vertical and horizontal edges by convolving the input with two separate kernels.
  • Prewitt Kernel also detects vertical and horizontal edges.
  • Scharr Kernel provides more sensitivity to diagonal edges compared to Sobel and Prewitt kernels.

? Blur and Smoothing Kernels:

  • Gaussian Kernel applies a blur to input, effectively reducing noise to smooth the image.
  • Box (or Average) Kernel performs a simple averaging operation on input, which helps to blur and smooth the image.

? Sharpening Kernels:

  • Laplacian Kernel enhances edges in an image by emphasizing sudden intensity changes.
  • Unsharp Masking Kernel sharpens an image by subtracting a blurred version of image from original.

? Embossing Kernel enhances the edges in an image by simulating a 3D embossed effect.

? Custom Kernels

Choosing Number and Size of Filters in Convolutional Layer:

Number of filters determines the?depth of output feature map. More filters capture more diverse features in input image and increase the expressiveness of network. However, more filters also mean more parameters and a more computation, which can slow down training and make the network prone to overfitting.

The size of filters determines the?receptive field of neurons?in the layer. A larger filter size captures more global features, while a smaller filter size captures more local features.

Choice of filter size depends on the scale of features you want to capture in the image.

  • If you want to detect small details like edges and corners, you would use smaller filter size.
  • If you want to detect larger features like shapes and textures, you would use larger filter size.

It is common to start with small number of filters in first layer of network and gradually increase number of filters in deeper layers. Low-level features captured by early layers are combined to form more complex features in deeper layers.

It is also common to use a smaller filter size in early layers and gradually increase filter size in deeper layers.

Final choice is usually done by trial and error, using cross-validation and grid search techniques to find optimal number and size of filters for given task and dataset.

Receptive Field of a Neuron:

Receptive field of a neuron refers to the region of input image that influences activation of that neuron. It is determined by the size of filters in preceding layers of network.

A filter slides over the image one stride at a time, computing a dot product between filter weights and the values in receptive field. The result of this dot product is a single value, which is the output of filter?for that particular location?in image.

As the filter is applied to different locations in image, the receptive field of neuron changes.

In first layer of the network, receptive field is typically small, because the filters are small. As the network becomes deeper, the filters become larger resulting in a larger receptive field.

In a typical CNN, the receptive field in?output layer?is usually large enough to capture entire input image, which enables the network to recognize objects regardless of their position in the image.

Difference between 1D, 2D, and 3D Convolution:

This difference has to do with the dimensionality of input data.

In 1D convolutional layer, the input data is a one-dimensional sequence, such as a time series or a sequence of words. The filter slides along the input sequence in one direction.

In 2D convolutional layer, the input data is a two-dimensional image, such as a grayscale or color image. The filter slides over the image in two dimensions.

In 3D convolutional layer, the input data is a three-dimensional volume, such as a video. The filter slides over volume in three dimensions, capturing shapes, movements, and spatial relationships.

These convolutions can also be used in combination in a single CNN architecture. For example, a 2D CNN may be used to process the individual frames of a video, followed by a 3D CNN that processes sequences of frames to capture temporal patterns.

Pooling Layer:

Pooling layers are typically placed after each set of convolutional layers in a CNN. Purpose of pooling layer is to reduce the dimensionality of feature maps and thereby reduce computation and prevent overfitting.

It works by dividing the input feature map into a grid of non-overlapping regions called windows. For example, if we have an input feature map of size 4x4 and we use a 2x2 pooling layer with a stride of 2, the output feature map will have size 2x2.

Three types of pooling layers are typically used:

? Max Pooling?selects the maximum value of pixel within the window.

  • This has the effect of retaining most salient features in each region while discarding the rest.
  • It also helps to reduce the sensitivity of network to small variations in input.

? Average Pooling?takes the average pixel value within the window. It tends to blur the input information more than max pooling.

? L2 Pooling?takes the square root of sum of squared values of pixels within the window.

Shallow v/s Deep CNN:

The depth of CNN refers to the number of layers it has.

  • Shallow CNN has only a few convolutional layers followed by one fully connected layer. It is often used when limited training data is available, or when input data has a simple structure.
  • Deep CNN has many convolutional layers followed by one or more fully connected layers

Purpose of Residual Connections:

Residual connections, also known as skip connections, is a technique to improve the training of very deep networks. Basic idea is to?add shortcut connections between layers?so that output of a layer can be directly added to the output of a later layer, bypassing several layers in between.

When a CNN is very deep, it becomes difficult for the network to learn useful features in later layers. This causes the problem of vanishing gradients, where gradients of loss function become very small in early layers, leading to a small meaningless weight update.

By using residual connections, gradients are able to flow more easily through the network, allowing for better learning of features in later layers.

Choosing?Batch Size:

Batch size determines the number of samples that are propagated through a CNN before weights are updated (using back-propagation) during training. It affects both the speed and quality of training process.

  • Larger batch size takes advantage of parallel processing and data is processed faster, but it also leads to compromise in performance and model generalization to unseen data.
  • Smaller batch size take longer to train because it requires more iterations to update weights, but it is more effective at reducing overfitting.

There is no one-size-fits-all answer for choosing batch size. Here are some thumb rules:

  1. Batch size should fit within the memory constraints of GPU or CPU being used for training.
  2. A larger dataset require larger batch size to make the most of parallel processing.
  3. A more complex model or input data require a smaller batch size to prevent overfitting.
  4. A larger batch size require a?larger learning rate?to prevent the model from getting stuck in suboptimal solution.
  5. A larger batch size may be preferred to speed up the training process, even if it leads to slightly worse performance.

Note: Learning rate is a hyper-parameter that determines the step size at which parameters of CNN are updated during training.

  • If learning rate is too small, the CNN will take a long time to converge, and the training may get stuck in local minima.
  • If learning rate is too large, the CNN may overshoot optimal weights and diverge from optimal solution.

The How:

Suppose we have an input image?X, which is represented as a 3-dimensional tensor with dimensions (H, W, C) and we want to classify the image into one of?K?possible classes.

Where,

  • H?is the height of image
  • W?is the width of image
  • C?is the number of channels (3 for RGB images).

??Convolution: Apply a set of?F?filters, each of size (KhKwC), to the input image to obtain a set of feature maps?Z(1), where?Z_{i,j,k}(1)?represents the activation of i’th filter at position (j,k).

Each filter has its own set of learnable parameters, which are updated during training using back-propagation. Convolution operation can be expressed as:

No alt text provided for this image

Where?W_{u,v,c,i}?represents the weight of filter at position (u,v,c) for i’th filter, and?b_i?represents the bias term for i’th filter.

??Activation Function: Apply a nonlinear activation function?f?element-wise to the feature maps?Z(1)?to introduce nonlinearity into the network:

No alt text provided for this image

??Pooling: Apply a pooling operation to the feature maps?A(1)?to reduce their spatial resolution and extract higher-level features. Suppose we use max pooling with a pooling window size of?Ph*Pw:

No alt text provided for this image

??Fully Connected Layers: Flatten the pooled feature maps?P(1)?into a 1-D vector and pass it through one or more fully connected layers to map the features to output classes.

Suppose we have?L?fully connected layers with weights?W(l)?and biases?b(l), and the output of l’th layer is denoted as?Z(l).

Output of last fully connected layer is passed through a softmax function to obtain a probability distribution over?K?output classes:

No alt text provided for this image

??Training: During training, we minimize a loss function?L?with respect to the network parameters using back-propagation and stochastic gradient descent. The loss function measures the discrepancy between predicted output probabilities and true class labels:

No alt text provided for this image

Where?y_i?is the true label for class?i, and?A_i(L)?is the predicted probability for class?i.

We can update the weights and biases of network using back-propagation and stochastic gradient descent.

Gradients of loss function with respect to the weights and biases can be computed using chain rule:

No alt text provided for this image

Update the weights and biases using gradients and learning rate???:

No alt text provided for this image

Repeat the forward and backward passes on mini-batches of training data until convergence or a stopping criterion is met.

Classifier v/s Detector in CNN:

A classifier is a type of model that is trained to classify images into one of several predefined categories or classes. Goal of a classifier is to?learn a mapping?from input image to the correct output label, based on features extracted by the network.

A detector is a type of model that is trained to detect the presence and location of objects in an image, as well as classify them. Goal of a detector is to?identify regions in input image?that contain objects of interest, and then classify them into one of several predefined classes.

To build a detector using a CNN, additional layers are added to the network, such as ‘region proposal networ k’ or ‘anchor-based detectors ’, which generate candidate object regions in input image.?These regions?are then passed through a classification layer, which assigns a label to each region.

Output of a detector is typically a set of bounding boxes that indicate the location of objects in image, along with their corresponding class labels and confidence scores.

Bounding boxes are generated by using ‘non-maximum suppression ?technique to filter out overlapping detections and retain only the most confident ones.

The Why:

Reasons for using CNNs:

  1. Specifically designed to handle images.
  2. Shared weights in convolutional layers greatly reduces the number of parameters and improves the generalization ability of network.
  3. Pre-trained CNNs can be used as a starting point for a new task, allowing for fast training with less data.
  4. State-of-the-art performance on a wide range of computer vision tasks.

The Why Not:

Reasons for not using CNNs:

  1. CNNs are designed specifically for image processing tasks and are not be suitable for other types of data or tasks that do not involve spatial relationships.
  2. Require large amounts of computation, both during training and inference, making them impractical for use in certain situations (e.g. mobile devices).
  3. Can be prone to overfitting if the model is too complex or if there is insufficient training data.
  4. Challenging to diagnose and correct errors because it has a ‘black box’ approach to arrives at its decisions
  5. Pre-trained CNNs are not be directly transferable to new domains or tasks, requiring significant re-training or fine-tuning to achieve good performance.

Time for you to support:

  1. Reply to this email with your question
  2. Forward/Share to a friend who can benefit from this
  3. Chat on Substack with BxD (here )
  4. Engage with BxD on LinkedIN (here )

In next edition, we will cover Deconvolutional Neural Networks.

Let us know your feedback!

Until then,

Have a great time! ??

#businessxdata ?#bxd ?#Convolutional #neuralnetworks ?#primer

要查看或添加评论,请登录

Mayank K.的更多文章

  • What we look for in new recruits?

    What we look for in new recruits?

    Personalization is the #1 use case of most of AI technology (including Generative AI, Knowledge Graphs…

  • 500+ Enrollments, ?????????? Ratings and a Podcast

    500+ Enrollments, ?????????? Ratings and a Podcast

    We are all in for AI Driven Marketing Personalization. This is the niche where we want to build this business.

  • What you mean 'Build A Business'?

    What you mean 'Build A Business'?

    We are all in for AI Driven Personalization in Business. This is the niche where we want to build this business.

  • Why 'AI-Driven Personalization' niche?

    Why 'AI-Driven Personalization' niche?

    We are all in for AI Driven Personalization in Business. In fact, this is the niche where we want to build this…

  • Entering the next chapter of BxD

    Entering the next chapter of BxD

    We are all in for AI Driven Personalization in Business. And recently we created a course about it.

    1 条评论
  • We are ranking #1

    We are ranking #1

    We are all in for AI Driven Personalization in Business. And recently we created a course about it.

  • My favorites from the new release

    My favorites from the new release

    The Full version of BxD newsletter has a new home. Subscribe on LinkedIn: ?? https://www.

  • Many senior level jobs inside....

    Many senior level jobs inside....

    Hi friend - As you know, we recently completed 100 editions of this newsletter and I was the primary publisher so far…

  • People need more jobs and videos.

    People need more jobs and videos.

    From the 100th edition celebration survey conducted last week- one point is standing out that people need more jobs and…

  • BxD Saturday Letter #202425

    BxD Saturday Letter #202425

    Please take 2 mins to send your feedback. Link: https://forms.

社区洞察

其他会员也浏览了