Layer Normalization
Credit: Self-Made

Layer Normalization

Layer Norm, Batch Norm & Covariate Shift:

Continuing from my last post on batch normalization,

Here are a few things on layer normalization, that are helpful in working with neural network architectures like Transformers, RNNs, and feedforward networks.


Why LayerNorm? Problems with BatchNorm:

Most of the issue with Batch Norm arises due to its dependency on batch size while training the network.

  • Hard to use with Sequence Data: as sequences are of varying length, making calculations difficult
  • Doesn't work well with small batch sizes: Since Batch Norm calculates mean & variance for batches of data, thus mean & variance over small batches wouldn't represent the overall data well.
  • Parallelization: Cannot parallelize the network while using Batch Norm.


What is Layer Normalization?

  • Layer normalization layer is similar to batch normalization, and is a way to reduce the covariate shift in neural networks, allowing them to be trained faster and achieve better performance.
  • In simple terms, Covariate shift refers to changes in the distribution of neural network activations as it trains, caused by changes in the data distribution like scale, mean, variance, etc.
  • Batch normalization computes the mean and variance of activations as an average over the samples in the batch, causing its performance to rely on mini-batches used to train the model.
  • However, layer normalization computes the mean and variance (that is, the normalization terms) of the activations in such a way that the normalization terms are the same for every hidden unit in a layer.
  • In other words, layer normalization has a single mean and a variance value for all the hidden units in a layer. This is in contrast to batch normalization, which maintains individual mean and variance values for each hidden unit in a layer.
  • Moreover, unlike batch normalization, layer normalization does not average over the samples in the batch; instead, it leaves the averaging out and has different normalization terms for different inputs. By having a mean and variance per sample, layer normalization gets rid of the dependency on the mini-batch size.


  • Benefits of Layer Norm: - can deal with sequences like RNNs - any batch number works - can parallelize
  • Layer Norm doesn't work well with CNNs. Batch Norm is preferred in the case of CNNs.


Visual Understanding:

Batch Normalization
Image Credit:
Layer Normalization
Image Credit:

Covariate Shift:

  • Covariate Shift refers to changes in the distribution of activations or features within a neural network as the model goes through training.
  • In simpler terms, it's the phenomenon where the statistical properties of the i/p to a neural network change over time. This change can be caused by various factors, such as changes in the data distribution, changes in the model's parameters, or the inherent non-stationarity of the data.
  • For instance, during the training of a neural network, the distribution of data that it sees can change as the model adapts to new examples. This can lead to differences in the scale, mean, or variance of the activations within the network. When this happens, the network may need to continuously adapt to these changes, making training slower and less stable.
  • ??Why Does Batch Norm Work? - Visual understanding of Covariate Shift, black cat, and colored cat example! by DeepLearning.AI


Reducing Covariate Shift:

Batch Normalization (BatchNorm):

  • Batch normalization is a technique used to mitigate covariate shifts.
  • It works by normalizing (scaling and shifting) the activations within each mini-batch of data during training.
  • This helps stabilize the distribution of activations, making training more efficient and enabling the use of higher learning rates.

Layer Normalization (LayerNorm):

  • Layer normalization is similar to batch normalization but operates at a different level.
  • While batch normalization normalizes activations across a mini-batch, layer normalization normalizes activations across the features at each layer.
  • In other words, it normalizes the activations for a single training example, independently for each feature, rather than relying on statistics computed over a mini-batch.


Benefits of Layer Normalization:

Layer normalization offers several advantages:

  • Reducing Covariate Shift: Layer normalization, like batch normalization, helps reduce the effects of covariate shift by ensuring that the mean and variance of the activations within each layer remain relatively constant during training. This stabilizes the training process.
  • Independence from Batch Size: Unlike batch normalization, layer normalization is less dependent on the mini-batch size. It is often used in scenarios where batch sizes are small or even when processing single examples (like in RNNs).
  • Applicability to Different Architectures: Layer normalization is used in a wide range of neural network architectures, including Transformers, RNNs, and feedforward networks.


In summary, covariate shift, which is the change in the distribution of neural network activations during training, can hinder the training process and negatively impact model performance. Techniques like layer normalization, by ensuring stable statistics of activations at each layer, help alleviate this problem and make training more efficient and effective, ultimately leading to better model performance.


For more details, here is my notebook on BatchNorm & LayerNorm:

??GitHub Notebook Link

  • Batch Norm, Layer Norm, and Covariate Shift Explained!
  • Training & Testing Differences in Batch Norm & Layer Norm.


Resources:

要查看或添加评论,请登录

社区洞察

其他会员也浏览了