Why Batch Normalization is Essential for Deep Learning Models
Today, while going through a Kaggle notebook that implemented a denoising autoencoder with image data, I noticed that Batch Normalization was applied after every convolution layer with a ReLU activation function.
def conv_block(x, filters, kernel_size, strides = 2):
x = Conv2D(filters = filters,
kernel_size = kernel_size,
strides = strides,
padding = 'same',
activation = 'relu',
kernel_regularizer = regularizers.l2(0.001))(x)
x = BatchNormalization()(x)
return x
I wanted to share explanation of why Batch Normalization is used, especially when working with ReLU activation function in deep neural networks.
What is an Activation Function?
An activation function is a mathematical operation applied to the output of each neuron in a neural network. It introduces non-linearity into the model, which allows the network to learn complex patterns. Without an activation function, the network would essentially be a linear model, which limits its ability to capture complex relationships in the data.
What is ReLU Activation?
The ReLU (Rectified Linear Unit) activation function is one of the most popular choices, defined as:
ReLU helps by introducing non-linearity, making the model capable of learning more complex patterns. However, it comes with two important challenges: linearity and vanishing gradients.
Linearity and Vanishing Gradient: Why is it a Problem?
1. Linearity:
When the input to ReLU is positive, the output is linear (i.e., ReLU(x)=x). This means that for positive values, ReLU does not introduce non-linearity in the network, which limits the model's ability to learn complex patterns, especially in deep networks.
2. Vanishing Gradient:
When the input to ReLU is negative, the output is 0. This means that the gradient (used for updating weights during backpropagation) will also be 0 in these regions. This leads to vanishing gradients, where the weights corresponding to neurons with negative inputs stop updating, resulting in "dead" neurons. This problem worsens in deep networks, leading to poor learning and slower convergence.
What is "Batch Normalization"?
As the name suggests, Batch Normalization is literally about normalizing the activations of neurons in each batch of data. But what does this normalization process involve?
领英推荐
Normalization is a process that adjusts the distribution of data. In the context of Batch Normalization:
1. Normalization refers to adjusting the activations (outputs) from each layer of the network to have a mean of zero and a standard deviation of one for each mini-batch. This is done by calculating the mean and variance of the activations within the current batch and then applying the formula:
Where:
This process standardizes the activations, making them more stable and ensuring that they fall within a consistent range.
2. After normalizing, Batch Normalization introduces two additional parameters: scale (γ) and shift (β), which allow the network to learn the optimal distribution for the activations through training.
Batch Normalization formula with scale (γ) and shift (β) is:
The scale (γ) parameter adjusts the variance of the activations, while the shift (β) parameter allows the network to modify the mean. These parameters give the model flexibility, enabling it to adapt the activations to better fit the underlying data distribution, which helps the network learn more effectively. By learning the optimal scale and shift values, Batch Normalization ensures that the network doesn't get constrained by a fixed mean and variance, leading to faster convergence and better overall performance.
Why is Scaling important?
Because it learns the optimal distribution: The primary purpose of introducing γ and β is to give the network the freedom to learn the optimal distribution of activations that best fits the task and data at hand. By allowing both the variance and the mean of the activations to be learned, the model has more flexibility.
Now, let’s see how Batch Normalization solves the issues introduced by ReLU activation.
Batch Normalization Reduces Linearity and Prevents Vanishing Gradients
Mathematically, after normalization, the activations have a mean of zero and a standard deviation of one. The scaling and shifting parameters (γ and β) allow the network to adjust the normalized activations, ensuring that the inputs to ReLU are neither too extreme (all positive, which could lead to linearity) nor too small (all negative, which could lead to vanishing gradients). As a result, the activations are kept within a range that avoids these problems, ensuring the gradients are usable and enabling proper backpropagation.