Layer Normalization

Layer Normalization

Layer normalization is a crucial technique in transformer models that helps stabilize and accelerate training by normalizing the inputs to each layer. It ensures that the model processes information consistently, regardless of the input’s scale or distribution. Building on concepts like self-attention, multi-head attention, and positional encoding, layer normalization plays a key role in enhancing the efficiency and performance of transformers, making them more robust and capable of handling complex tasks. In this blog, we’ll dive into layer normalization, an essential component in transformer models.

Let’s quickly discuss what normalization is and why it’s useful in deep learning. If you’ve been studying machine learning for a while, you’re likely familiar with concepts like normalization and standardization. But to recap, normalization in deep learning refers to the process of transforming data to conform to specific statistical properties.

There are various forms of normalization. One common type is standardization, where each data point is adjusted by subtracting the mean of its column and then dividing by the standard deviation. This transformation results in a new column where the mean is zero and the standard deviation is one. Another type is min-max normalization, where data is scaled to fit within a given range. These techniques are applied to ensure that the data is on a consistent scale, which is crucial in machine learning.

In deep learning, normalization can be applied in two key areas:

  1. Input Data: The data that you feed into the neural network (e.g., features like f1, f2, f3) can be normalized before entering the network. This step is similar to what you’ve done in traditional machine learning.
  2. Hidden Layer Activations: You can also normalize the activations (outputs) from hidden layers in the neural network. This is often done to stabilize and accelerate training, especially in deep networks.

Let’s dive into why normalization is so important in deep learning. Imagine you’re training a neural network, and as you update the weights, some of them start getting really big. When that happens, the activations tied to those weights also become large, making it harder for your model to learn effectively. It slows things down and can cause problems in training.

Normalization helps fix this by keeping activations within a stable range. This not only makes the training process more stable but also speeds it up, allowing your model to learn more efficiently.

Another big benefit of normalization is that it prevents a problem called internal covariate shift. This happens when the input data’s distribution changes as it moves through the layers of your network, which can confuse the model. By normalizing the activations, you keep things consistent, so the model can keep learning without getting thrown off.

On top of that, some types of normalization, like batch normalization, even help with regularization, which means your model can generalize better to new data.

Great, so now that you’ve got a good understanding of why normalization is important, let’s move on to a quick review of batch normalization.

Batch Normalization:

Batch normalization is designed to address a specific problem known as internal covariate shift. To understand this, imagine a neural network with multiple hidden layers. During training, the distribution of activations in these layers can change because the weights are updated. This shifting distribution makes it challenging for the network to learn effectively, leading to unstable training.

Internal covariate shift refers to this problem where the distribution of activations changes during training due to the constant updates to the network’s weights. This shifting can cause each subsequent layer to receive inputs with varying distributions, which can hinder stable learning.

Batch normalization mitigates this issue by normalizing the activations within each layer, ensuring they follow a consistent distribution with a mean of zero and a standard deviation of one. This helps maintain stability in training and allows the network to learn more effectively. By normalizing activations, batch normalization reduces the impact of internal covariate shift and improves convergence speed, making the training process smoother and more predictable.

How Batch Normalization works ?

Let’s break down how batch normalization works with an example. We have a simple dataset with two features and some rows of data. We’ll use this dataset to train a neural network.

We have a dataset with features f1 and f2 and a neural network with one hidden layer. The goal is to apply batch normalization to the output of the nodes in the hidden layer. since we are processing data in batches. For simplicity, let’s assume our batch size is 5. This means we will feed five rows of data to the neural network at once.

Steps in Batch Normalization :

  1. Calculate Activations:

  • For each row in the batch, calculate the activations for the nodes in the hidden layer.
  • For simplicity, let’s focus on three nodes and their activations z1, z2, and z3. Assume we calculate these values for each row in the batch.

2. Normalize Activations:

  • Step 1: Compute the mean (μ) and standard deviation (σ) for each activation value across the batch.
  • For z1: Find the mean μ1 and standard deviation σ1.
  • For z2: Find the mean μ2 and standard deviation σ2.
  • For z3: Find the mean μ3 and standard deviation σ3.

  • Step 2: Normalize each activation value using the formula:

  • Apply this to all values for z1, z2, and z3.
  • Step 3: After normalization, apply scaling and shifting. Each node has two learnable parameters: gamma (γ) and beta (β).

Here, γ and β are initially set to 1 and 0, respectively, but are adjusted during training.

5. Replace Values:

  • Replace the original activation values with the normalized values for each node.
  • Continue this for all rows in the batch.

6. Apply Activation Function:

  • After normalization, pass the values through the activation function to get the final output for each node.

In this way Batch normalization works.

Why don’t we use Batch Normalization in Transformers?

In the context of transformer architectures, the choice of normalization technique plays a crucial role in the model’s performance. The primary reason layer normalization is preferred over batch normalization in transformers is that batch normalization does not work effectively with self-attention mechanisms. To put it simply, batch normalization struggles with sequential data, which is a fundamental aspect of transformer models.

Let’s delve deeper into this with a well-known diagram of the transformer architecture. You’ll notice that the normalization step is applied right after the attention mechanism.

But why not use batch normalization here? To illustrate this, I’ll demonstrate what happens when you apply batch normalization directly to the self-attention mechanism.

A Brief Recap of Self-Attention

In a self-attention mechanism, you typically start with a sentence, like “river bank,” and break it down into individual words or tokens. For simplicity, let’s consider two tokens: “river” and “bank.” Each word is then represented by an embedding vector. Although embedding vectors are usually high-dimensional, let’s assume we are using a four-dimensional vector for this example. Importantly, each word’s embedding vector has the same dimensionality.

Next, these embedding vectors are fed into the self-attention mechanism. The role of self-attention is to generate contextual embeddings from these vectors. What you see here is the contextual embedding vector for the word “river,” which accounts for the presence of the word “bank” alongside it, and vice versa. Notice that the dimensionality of the output vectors remains consistent with the input — both are four-dimensional in this example.

Adding Complexity: Batch Processing in Self-Attention

Now, let’s introduce a bit of complexity. Up until now, we’ve considered feeding one sentence at a time into the self-attention module. But what if we want to process multiple sentences simultaneously? This is where batching comes into play. I’ll show you how you can send more than one sentence through the self-attention module in the form of batches.

Let’s assume we’re working on a sentiment analysis task with a dataset containing sentences like :

To train the self-attention module on this data, we’ve decided to process two sentences at a time — meaning our batch size is set to two. So, the first two sentences will be processed together as one batch, followed by the next two sentences.

Each word in these sentences is represented by an embedding vector, just like before, and for simplicity, let’s keep the dimensionality of each embedding vector at three. Below is the diagram where I am showing embedding of each word:

In the context of self-attention, especially when dealing with sequences of different lengths, padding plays a crucial role. Let’s delve into how padding works and why it’s necessary, using a practical example.

Imagine we have two sentences:

  1. “Hi Nitish”
  2. “How are you today?”

These sentences vary in length — one has two words, and the other has four. However, in self-attention, we need to ensure that the number of words (tokens) in each sentence is equal when processing them together. This is where padding comes in.

With both sentences now of equal length, they can be fed into the self-attention module. This module will calculate contextual embeddings for each word based on the entire sequence.

The above embeddings can be represented using matrices, like this:

After passing these matrices through the self-attention block, you’ll obtain your following contextual embeddings.

Now, we stack both contextual embeddings together, as shown below:

Now, let’s apply batch normalization to this setup by calculating the mean and standard deviation across each column. However, it’s important to ask yourself whether the calculated mean and standard deviation truly represent the data. The answer is definitely no, primarily because there are many zeros present. If we have multiple zeros in the data, it significantly affects how batch normalization works. This is the main issue — batch normalization is not effective when there are many zeros in the data.

Conclusion: Why Batch Normalization Doesn’t Work in Self-Attention

The fundamental issue is that padding tokens, although necessary for alignment, are not part of the original data. They introduce a lot of zeros into the dataset, which can mislead the normalization process. This is why batch normalization is not applied in the context of self-attention in transformers.

Solution of Batch Normalization : Idea of Layer Normalization :

To explain layer normalization, I’ll use the same setup I previously used for batch normalization. We have the exact same neural network and the same data, so the foundation is consistent. We’re still calculating z1, z2, and z3, which represent the pre-activation values for the first, second, and third nodes, respectively, for all data points.

One key similarity is that each node has its own gamma (γ) and beta (β) parameters. This setup is identical to what we saw with batch normalization. However, the main difference arises in how normalization is applied.

In batch normalization, the normalization is done across the batch. For example, if we consider the rows as our batch, normalization is applied in this direction. But with layer normalization, normalization occurs across the features. This means normalization happens across each row, rather than across the batch.

To compute the mean and standard deviation for layer normalization, you calculate them for each row across all features. For instance, you’ll compute the mean (μ1) and standard deviation (σ1) for the first row, then μ2 and σ2 for the second row, and so on, until you have μf and σf for the last row.

Once you have these values, you normalize each element by subtracting the mean of its row and dividing by the standard deviation. For example, to normalize the first element in the first row, you’d write:

normalized_value=7?μ1/σ1

Let’s assume this value is p. The next step involves scaling and shifting this value using the gamma and beta parameters specific to z1:

output=γ1 × p + β1

This process is repeated for each element in the row. By following this approach, you normalize the entire data set across the features, rather than across the batch.

Key Difference Between Batch Normalization and Layer Normalization

The main difference between batch normalization and layer normalization is in the direction of normalization. Batch normalization normalizes across the batch, while layer normalization normalizes across the features. This distinction is crucial, particularly in the context of transformers, where layer normalization plays a vital role in handling the sequential nature of the data.

In summary, layer normalization offers a more logical and effective approach in transformer architectures, ensuring that the data is accurately normalized, even in the presence of padding, thereby enhancing the model’s performance and stability.

要查看或添加评论,请登录

Tafique Hossain Khan的更多文章

社区洞察

其他会员也浏览了