Decoding the CNN Architecture: Unveiling the Power and Precision of Convolutional Neural Networks - Part ⅠⅠ
Nourhan Moustafa
British Council Women in STEM Scholarship Awardee 2022/2023 | AI/ML Applied Researcher | Data Science Enthusiast | STEM Ambassador 100+ hrs of Engagement @ STEM Learning UK
Architecture of Convolutional Neural Networks (CNN)
As discussed in the previous article, how feature extraction works in CNNs, identifying important patterns and features, like edges and textures, in images through convolutional and pooling layers. These features are gradually abstracted, enabling the network to understand complex visual information. This process facilitates accurate image analysis and recognition tasks, and the classification part through the full-connected layers. Now, let's discover activation and loss functions and optimizers. These components work together during the training process to adjust the model's parameters and minimize the loss, resulting in a neural network that can make accurate predictions or classifications on unseen data.
1- Activation Functions
Activation functions are used to introduce non-linearity into the CNN model. Without them, CNN would only be able to learn linear relationships. The activation function takes a real-numbered input and confines it within a limited range such as [0; 1] or [-1; 1]. The introduction of a nonlinear function after the weight layers is crucial because it empowers a neural network to learn nonlinear correlations. Without these nonlinearities, a layered network of weight layers would only equal a linear transformation from the input domain to the output domain.
An activation function can also be seen as a decision-making or selection tool, determining if a neuron should fire based on its inputs. In order to facilitate the backpropagation of errors, the activation functions used in deep networks are usually differentiable. The below figure shows a list of the most frequently used activation functions in deep neural networks.
Sigmoid: The sigmoid activation function accepts a real-numbered input and processes a result within the interval of [0,1]. It is calculated as:
Tanh: The tanh activation function employs the hyperbolic tangent operation to confine input values within the spectrum of [-1, 1]. It can be expressed in the following manner:
Algebraic Sigmoid Function: The algebraic sigmoid function also maps the input within the range of [1; 1]. It is formulated as:
The Rectified Linear Unit (ReLU): ReLU is a unique activation function, valued for its efficient computation. ReLU function assigns a zero value to negative inputs and maintains positive inputs as they are. The ReLU activation draws its inspiration from the processing mechanisms of the human visual cortex (Hahnloser et al., 2000). Its effectiveness and wide usage have spawned several variants which will be discussed next. These adaptations rectify some of the ReLU activation function's limitations, for instance, the Leaky ReLU does not fully suppress negative inputs to zero. It can be described as:
The Noisy ReLU: It introduces a twist to the conventional ReLU by injecting a sample drawn from a Gaussian distribution with zero mean and a variance reliant on the input value (? (x)) when the input is positive. Its mathematical representation is as follows:
The Leaky ReLU: It rectifies the issue of completely nullifying the output in the event of negative input, inherent in the standard ReLU function. Instead of setting the output to zero, the Leaky ReLU outputs a diminished version of the negative input. This is governed by the following equation, where c is a constant, referred to as the leak factor, typically set to a minuscule value as follows:
Parametric Linear Units: It operates in a manner akin to the Leaky ReLU, the distinguishing feature being that the adjustable leak parameter is established during the course of network training. Here, a represents the leak factor which is systematically learned throughout the training process as follow:
2- Loss Functions
The loss function is used to measure the inconsistency between the predicted (output of the model) and actual data. It guides the optimizer to reach the global minimum. Common choices for loss functions include mean squared error for regression problems and cross-entropy for classification.
Mean Squared Error (MSE):
Mean squared error (MSE) is a concept used in statistics to measure the difference between two sets of values. It is a popular loss function used in regression problems. It is used to quantify the discrepancy between the predicted and actual outcomes. It is a simple way to check how well a certain model or prediction is working. Mathematically, it is defined as the average of the squared differences between the predicted and actual values. MSE helps in quantifying how good those predictions were. It can be calculated it, using the following steps:
1.????? Subtract the predicted outcome from the actual outcome (this is called an "error" - it's the difference between what really happened and what was predicted to happen).
2.????? Square each of these errors (which means multiplying the number by itself). Squaring is done to make sure all errors are positive (because it doesn't matter if the prediction was too high or too low, only how much it was off).
3.????? Calculate the average (or "mean") of these squared errors.
领英推荐
The result is the MSE. One important aspect of the MSE is its relation to the variance of the target values. When the target values have a large variance, the MSE might give a large value even when the model predicts the values quite accurately. This is one of the reasons why the MSE is sensitive to outliers, as they tend to increase the variance. Hence, the lower the MSE, the higher the quality of the predictions. If the MSE is zero, that means the predictions are perfect - they exactly match the actual outcomes. On the other hand, if the MSE is notable, that means the predicted values are frequently deviating from the actual values.
Cross Entropy
Cross-entropy is a measure from the field of information theory that quantifies the dissimilarity between two probability distributions. It is commonly used as a loss function for classification tasks, especially when the output of a model is a probability distribution which is often the case with models that output probabilities for each class in a classification task. The Cross-Entropy Loss can be referred to as "Log Loss" or "Soft-Max Loss" which is defined mathematically as follows:
the cross-entropy between the true labels and the predicted labels. In a classification problem with n classes, and for each instance in the dataset, the model outputs a predicted probability distribution over these classes denoted by, and the actual distribution denoted by.
Cross-entropy loss is a method used to score how well the guesses matched the actual correct answers. The better the guesses, the lower the score. If the guesses are inaccurate, the score will go higher. This comparison process is a bit more complicated than just a right-or-wrong check, it's more like guessing the likelihood of each answer being correct. Cross-entropy loss measures the difference between what was guessed and what is true. It is a little like checking how close the dart lands to the bullseye on a dartboard. The closer to the bullseye (i.e., the true answer), the better the score (i.e., lower loss).
3- Optimizers
Optimizers are algorithms used to change the attributes of the neural network such as weights and learning rate to reduce the losses. Optimizers help to get results faster. Common optimization methods, such as Batch Gradient Descent, Mini-Batch Gradient Descent, and Stochastic Gradient Descent, form the backbone of improving neural network performance by refining the network parameters to minimize the cost function. Techniques such as Momentum and ADAM (Adaptive Moment Estimation) further enhance these optimization methods by providing nuanced adjustments, helping to prevent local oscillations and accelerating convergence. Together, these methods significantly elevate the efficiency and effectiveness of neural network training.
Gradient-Based Optimization: This optimization forms the cornerstone of training neural networks. It leverages the concept of a gradient which is a vector that points in the direction of the greatest rate of increase of a function. In the context of neural networks, this function is the cost function, a measure of the error between the network's predictions and the actual values. The goal is to minimize this cost function, and the gradient points the way to achieve this is as follows:
Batch Gradient Descent: This is the simplest form of gradient descent, where the gradient is computed over the entire dataset before a step is taken in the parameter space. While this method can converge to a global minimum for convex error surfaces and to a local minimum for non-convex surfaces, it can be computationally intensive and inefficient for large datasets.
Mini-Batch Gradient Descent: To overcome the computational inefficiencies of batch gradient descent, mini-batch gradient descent is often employed. Here, the gradient is computed over a small randomly-selected subset of the dataset. This results in a more efficient computation and also helps to avoid potential correlations in successive samples that might adversely affect the gradient calculation.
Stochastic Gradient Descent (SGD): SGD takes this concept a step further and updates the parameters for each individual sample in the training dataset. This can cause significant fluctuation in the gradient which can sometimes help escape the local minimum but also lead to overshooting of the global minimum. Overall, it helps us to find a better minimum which its rule for SGD is as follows:
Momentum: As shown in the below figure, It is a technique used to assist SGD in avoiding local oscillations and hasten convergence over the saddle point illustrated in the below figure. It can be challenging for optimization algorithms to escape from a saddle point, as the gradient is close to zero along one dimension, causing the optimizer to get stuck in the saddle point and slowing down the optimization process. Momentum optimization is a technique that can help accelerate the optimization process by allowing the optimizer to overcome the challenges of saddle points. The momentum optimizer maintains a moving average of past gradients which helps the optimizer build up "momentum" and continue moving in the direction of the accumulated gradient. This method introduces a term that considers the previous gradients to dampen the oscillations. It includes a hyperparameter which determines how much of the past gradients influence the next step. The update rule with momentum can be written as:
ADAM (Adaptive Moment Estimation): ADAM is an optimization algorithm that computes adaptive learning rates for each parameter. Unlike SGD which maintains a single learning rate for all weight updates. As illustrated in the below figure. a visual representation of an optimization landscape is presented. ADAM finds optimal parameters for a model and adapts the parameter-wise learning rate based on a moving average of the gradients and squared gradients. In this plot, local minima are points where the loss is lower than the surrounding points but higher than the global minimum. The global minimum is the point with the lowest loss value on the entire surface. This enables more subtle and efficient updates, thus accelerating the convergence of the training process. By adjusting the learning rates based on the historical gradient information, ADAM is able to achieve better performance and stability compared to optimization algorithms that use a constant learning rate.
In conclusion, this article has shed light on the critical components of CNN classification operations, focusing on activation functions, loss functions, and optimizers. Activation functions were explored as vital tools introducing non-linearity and enabling neural networks to learn complex relationships. Various activation functions, including Sigmoid, Tanh, ReLU, Leaky ReLU, and others, were detailed, offering insights into their strengths and use cases.
The discussion extended to loss functions, emphasizing their role in quantifying the difference between predicted and actual data. The Mean Squared Error (MSE) and Cross-Entropy Loss were elucidated, illustrating their applicability in regression and classification tasks, respectively. Understanding these loss functions is pivotal for assessing model performance and guiding optimization efforts.
Lastly, optimizers, such as Gradient-Based Optimization, Batch Gradient Descent, Mini-Batch Gradient Descent, and Stochastic Gradient Descent, were examined in the context of updating neural network parameters to minimize the cost function. Momentum and ADAM optimization techniques were introduced as enhancements to the traditional gradient descent methods, facilitating faster convergence and escaping local optima.
By unraveling the intricacies of activation, loss, and optimization in CNN classification, this article equips readers with the essential knowledge to harness the power of convolutional neural networks in various applications. The synergy of these components underpins the successful deployment of CNNs in real-world scenarios, from computer vision tasks to natural language processing and beyond.
Note: Backpropagation of errors, often referred to as simply "backpropagation," is a fundamental training algorithm used in artificial neural networks. It involves the iterative process of propagating the error or the difference between the predicted output and the actual target value backward through the network's layers. This process allows the network to adjust its internal parameters (weights and biases) in the direction that minimizes the error, ultimately enabling the network to improve its accuracy in making predictions or classifications during supervised learning tasks.
Reference: Khan, S., Rahmani, H., Shah, S.A.A. and Bennamoun, M. (2018). A Guide to Convolutional Neural Networks for Computer Vision. Synthesis Lectures on Computer Vision, 8(1), pp.1–207. doi: https://doi.org/10.2200/s00822ed1v01y201712cov015 .