登录查看更多内容

Why Batch Normalization is Essential for Deep Learning Models

Gina Lee

Data Scientist Sharing My Daily Learnings | Former Data Analyst | Statistics @ uwaterloo

发布日期: 2024年12月19日

Today, while going through a Kaggle notebook that implemented a denoising autoencoder with image data, I noticed that Batch Normalization was applied after every convolution layer with a ReLU activation function.

def conv_block(x, filters, kernel_size, strides = 2):
    x = Conv2D(filters = filters,
               kernel_size = kernel_size,
               strides = strides,
               padding = 'same',
               activation = 'relu',
               kernel_regularizer = regularizers.l2(0.001))(x)
    x = BatchNormalization()(x)
    return x

I wanted to share explanation of why Batch Normalization is used, especially when working with ReLU activation function in deep neural networks.

What is an Activation Function?

An activation function is a mathematical operation applied to the output of each neuron in a neural network. It introduces non-linearity into the model, which allows the network to learn complex patterns. Without an activation function, the network would essentially be a linear model, which limits its ability to capture complex relationships in the data.

What is ReLU Activation?

The ReLU (Rectified Linear Unit) activation function is one of the most popular choices, defined as:

When the input x is positive, ReLU outputs x (linear behavior).
When the input x is negative, ReLU outputs 0 (a hard threshold).

ReLU helps by introducing non-linearity, making the model capable of learning more complex patterns. However, it comes with two important challenges: linearity and vanishing gradients.

Linearity and Vanishing Gradient: Why is it a Problem?

1. Linearity:

When the input to ReLU is positive, the output is linear (i.e., ReLU(x)=x). This means that for positive values, ReLU does not introduce non-linearity in the network, which limits the model's ability to learn complex patterns, especially in deep networks.

2. Vanishing Gradient:

When the input to ReLU is negative, the output is 0. This means that the gradient (used for updating weights during backpropagation) will also be 0 in these regions. This leads to vanishing gradients, where the weights corresponding to neurons with negative inputs stop updating, resulting in "dead" neurons. This problem worsens in deep networks, leading to poor learning and slower convergence.

What is "Batch Normalization"?

As the name suggests, Batch Normalization is literally about normalizing the activations of neurons in each batch of data. But what does this normalization process involve?

领英推荐

What Is Stable Diffusion and How Does It Work?

Politetech Software 2 年前

The relationship between chip computing power and deep…

Shenzhen 10Gigabit Ethernet Technology Co.,ltd 2 年前

Perceptron Based Linear Regression model

Harry Thapa 1 年前

Normalization is a process that adjusts the distribution of data. In the context of Batch Normalization:

1. Normalization refers to adjusting the activations (outputs) from each layer of the network to have a mean of zero and a standard deviation of one for each mini-batch. This is done by calculating the mean and variance of the activations within the current batch and then applying the formula:

Where:

μ_batch is the mean activation of the batch,
σ_batch is the standard deviation of the batch.

This process standardizes the activations, making them more stable and ensuring that they fall within a consistent range.

2. After normalizing, Batch Normalization introduces two additional parameters: scale (γ) and shift (β), which allow the network to learn the optimal distribution for the activations through training.

Batch Normalization formula with scale (γ) and shift (β) is:

The scale (γ) parameter adjusts the variance of the activations, while the shift (β) parameter allows the network to modify the mean. These parameters give the model flexibility, enabling it to adapt the activations to better fit the underlying data distribution, which helps the network learn more effectively. By learning the optimal scale and shift values, Batch Normalization ensures that the network doesn't get constrained by a fixed mean and variance, leading to faster convergence and better overall performance.

Why is Scaling important?

Because it learns the optimal distribution: The primary purpose of introducing γ and β is to give the network the freedom to learn the optimal distribution of activations that best fits the task and data at hand. By allowing both the variance and the mean of the activations to be learned, the model has more flexibility.

Now, let’s see how Batch Normalization solves the issues introduced by ReLU activation.

Batch Normalization Reduces Linearity and Prevents Vanishing Gradients

Mathematically, after normalization, the activations have a mean of zero and a standard deviation of one. The scaling and shifting parameters (γ and β) allow the network to adjust the normalized activations, ensuring that the inputs to ReLU are neither too extreme (all positive, which could lead to linearity) nor too small (all negative, which could lead to vanishing gradients). As a result, the activations are kept within a range that avoids these problems, ensuring the gradients are usable and enabling proper backpropagation.

要查看或添加评论，请登录

Gina Lee的更多文章

Support Vector Machine (SVM) Algorithm Part 2 - Transforming the SVM Problem Using Lagrange Multipliers

2025年3月10日

Support Vector Machine (SVM) Algorithm Part 2 - Transforming the SVM Problem Using Lagrange Multipliers

In yesterday's article, we defined the primal optimization goal that SVM aims to solve. If you haven't read it yet, I…

1 条评论
Support Vector Machine (SVM) Algorithm Part 1 - Define Primal Optimization Goal

2025年3月9日

Support Vector Machine (SVM) Algorithm Part 1 - Define Primal Optimization Goal

The primary goal of Support Vector Machines (SVM) is to maximize the margin between the two classes. This is achieved…
How Decision Tree Build a Tree and Leaves

2025年2月21日

How Decision Tree Build a Tree and Leaves

When I introduced boosting models, do you remember they were based on decision trees? I briefly mentioned how these…
Generalized Linear Model (GLM) - Flexible Regression Model

2025年2月20日

Generalized Linear Model (GLM) - Flexible Regression Model

When we think about linear regression, the first thing that usually comes to mind is the assumption that the dependent…

1 条评论
What to Check Before You Decide to Apply a Linear Regression Model

2025年2月18日

What to Check Before You Decide to Apply a Linear Regression Model

In our last post, we talked about the assumptions of the linear regression model: linearity, homoscedasticity…
VGGNet: A Deep CNN Approach to Large-Scale Image Recognition

2025年2月15日

VGGNet: A Deep CNN Approach to Large-Scale Image Recognition

Last time, we talked about how CNNs use convolution to process images. Today, let’s dive deeper into one of the most…
How Convolutional Neural Network (CNN) Processes Image Data

2025年2月11日

How Convolutional Neural Network (CNN) Processes Image Data

Convolutional Neural Networks (CNNs) have revolutionized image processing by enabling models to automatically extract…
Understand Gradient Boosting

2025年2月6日

Understand Gradient Boosting

Last time, I explained the difference between bagging and boosting models: while bagging builds multiple models in…
A/B Testing Part 1: Power Analysis Calculates Sample Size

2025年2月1日

A/B Testing Part 1: Power Analysis Calculates Sample Size

When designing A/B testing, one of the most critical factors is determining how many participants (or samples) you need…
Gradient Descent Part 4: Regularization Prevents Overfitting

2025年1月28日

Gradient Descent Part 4: Regularization Prevents Overfitting

Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. This penalty…

See all articles

Why Batch Normalization is Essential for Deep Learning Models

Gina Lee

Data Scientist Sharing My Daily Learnings | Former Data Analyst | Statistics @ uwaterloo

What is an Activation Function?

What is ReLU Activation?

Linearity and Vanishing Gradient: Why is it a Problem?

What is "Batch Normalization"?

领英推荐

Why is Scaling important?

Batch Normalization Reduces Linearity and Prevents Vanishing Gradients

Gina Lee的更多文章

社区洞察

其他会员也浏览了

Do You Understand The Difference Between Deep Learning And Neural Networks?

Hello World - Machine Learning & Neural Network

Over-Parameterization does not lead to Poor Generalization

Automating Neural Network Configuration with Keras-Tuner

Top 10 Activation Functions in Deep Learning

Exploring Advanced Convolutional Layers in Deep Learning

What is Computer Vision??

Capsule Networks - The Next Generation of Deep Learning-Based Computer Vision for Stock Forecasting

Kolgomorov Arnold Networks

A Primer on Deep Learning

What is an Activation Function?

What is ReLU Activation?

Linearity and Vanishing Gradient: Why is it a Problem?

What is "Batch Normalization"?

领英推荐

Why is Scaling important?

Batch Normalization Reduces Linearity and Prevents Vanishing Gradients

Gina Lee的更多文章

Support Vector Machine (SVM) Algorithm Part 2 - Transforming the SVM Problem Using Lagrange Multipliers

Support Vector Machine (SVM) Algorithm Part 1 - Define Primal Optimization Goal

How Decision Tree Build a Tree and Leaves

Generalized Linear Model (GLM) - Flexible Regression Model

What to Check Before You Decide to Apply a Linear Regression Model

VGGNet: A Deep CNN Approach to Large-Scale Image Recognition

How Convolutional Neural Network (CNN) Processes Image Data

Understand Gradient Boosting

A/B Testing Part 1: Power Analysis Calculates Sample Size

Gradient Descent Part 4: Regularization Prevents Overfitting

社区洞察

其他会员也浏览了

Do You Understand The Difference Between Deep Learning And Neural Networks?

Hello World - Machine Learning & Neural Network

Over-Parameterization does not lead to Poor Generalization

Automating Neural Network Configuration with Keras-Tuner

Top 10 Activation Functions in Deep Learning

Exploring Advanced Convolutional Layers in Deep Learning

What is Computer Vision??

Capsule Networks - The Next Generation of Deep Learning-Based Computer Vision for Stock Forecasting

Kolgomorov Arnold Networks

A Primer on Deep Learning