登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

Why is Transformer Preferred Over RNN? - Transformer Part 1: Embedding & Positional Encoding

Gina Lee

Data Scientist Sharing My Daily Learnings | Former Data Analyst | Statistics @ uwaterloo

发布日期: 2024年12月28日

We have now reached the main focus of this lecture: the Transformer model. The Transformer is a Sequence-to-Sequence (Seq2Seq) architecture that significantly improves upon traditional models like RNNs, primarily through the use of multi-head attention and self-attention mechanisms. These innovations allow it to better capture long-range dependencies and process sequences in parallel.

Source: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need.

The architecture of the Transformer consists of several key components, and to understand each part, we will cover the following topics:

Embedding & Positional Encoding: These components transform input tokens into continuous vector representations and add positional information, allowing the model to understand both the content and the order of the tokens in the sequence.
Multi-head Attention & Masking: Multi-head attention allows the model to focus on different parts of the input simultaneously, while masking prevents the model from attending to future tokens during training, ensuring autoregressive behavior (predicts each new value based on previous values) during decoding.
Position-wise Feed-Forward Networks: These networks consist of two fully connected layers with a ReLU activation in between, applied independently to each token, enabling complex transformations of the token representations.
Residual Connections: Residual connections allow the model to bypass certain layers, improving gradient flow (prevent vanishing gradient) and making it easier to train deep networks.
Layer Normalization: Applied to each sub-layer, layer normalization stabilizes training by normalizing the input of each layer, ensuring faster convergence.

Today, we focused on the first chapter: Embedding & Positional Encoding.

Embedding Vs. Encoding

In the Transformer model, embedding is used to convert input tokens and output tokens into vectors of a fixed dimension (d_model like in Seq2Seq) which allows the model to distinguish between each word. This step is similar to other sequence transduction models like Seq2Seq. The embedding weights are scaled by multiplying them by sqrt(dmodel) to help maintain stable gradients during training This scaling ensures that the values of the embeddings don't grow too large and destabilize the learning process, especially since the dot products of vectors in attention calculations can grow quickly as the model size increases.

However, positional encoding, particularly sinusoidal positional encoding, is a unique addition in the Transformer model. Unlike traditional sequence transduction models like Seq2Seq, the Transformer processes all words in parallel, which means it doesn't inherently capture the sequential order of the tokens. In Seq2Seq, we had arrows directing tokens to the next, implying their order, but since the Transformer lacks this sequential processing, we must explicitly encode positional information. This is where positional encoding comes in: it is added to the embedded vectors to provide the model with information about the position of each word in the sequence. Let’s now explore how sinusoidal positional encoding works.

Conditions of Positional Encoding

First of all, here are the conditions that a positional encoding method should satisfy to be considered a proper positional encoding:

The positional encoding for each position must be deterministic and unique, ensuring that the encoding for a given position can always be computed consistently during inference.
The positional encoding for any specific position in the sequence must be constant and independent of the input data, meaning it should not change across different input sequences.
The method must be scalable to handle sequences of arbitrary length, ensuring that it can generate valid positional encodings for both short and long sequences.
For any given interval between positions, the relative distance between those positions in their encodings must remain the same, regardless of the specific time step: distance(x, x + h) = distance(y, y + h)
The absolute distance between any two positions i and j (where i<j) in the sequence should be consistent and independent of the specific sequence content, ensuring that the encoding reflects only their position in the sequence.

Sinusoidal positional encoding

Sinusoidal positional encoding is defined as:

Where PE(pos, 2i) refers to the positional encoding at a particular position pos in a sequence, specifically for the 2i-th dimension of the encoding.

This method uses the sine function for even indices and the cosine function for odd indices to create the positional encodings. But how did the idea of using trigonometric functions for positional encoding come about?

The use of trigonometric functions like sine and cosine was inspired by the need to encode the position of tokens in a way that was both continuous and periodic, while also being computationally feasible. Trigonometric functions were chosen because of their periodicity and smoothness, which allow for unique, continuous encodings that can repeat over time (helpful for longer sequences) while still providing enough distinctiveness between positions.

Since we've established that trigonometry, specifically sine and cosine functions, is ideal for positional encoding, you might now be wondering why we use both sine and cosine functions rather than just one of them. Let’s explore this.

If we were to use only sine for positional encoding, it would violate condition 1, which requires each positional encoding value to be unique. The reason for this is that the sine function has periodicity, meaning it repeats the same values at regular intervals. For example, sin(x)=sin(x+2π), so if we only used sine, multiple positions in the sequence could end up with the same value, leading to non-unique encodings and failing to satisfy the uniqueness requirement.

To prevent this issue, we scale the position by dividing it by 10000^(2i/d_model), which ensures that the values of sine and cosine stay within one period and don't overlap, making the positional encodings small enough to meet the uniqueness condition. This scaling keeps the positional values within a manageable range and guarantees that the sine (or cosine) value for each position remains unique.

However, using both sine and cosine functions is necessary. Why?

In translation, knowing the relative positions of words in a sequence is crucial. For example, in the sentence "I am going to eat," the position of the word "eat" relative to "am going to" significantly affects the meaning of the sentence. Therefore, positional encoding must capture not just the absolute positions of words, but also the relative positional differences between them. To achieve this, we need a system that can model these shifts effectively, and this is where sinusoidal positional encoding comes into play.

Rotation Matrix

To understand how sinusoidal positional encoding works, understand the rotation matrix:

This matrix performs a linear transformation that rotates a vector in 2D space. When applied to a vector [cos(θ),sin(θ)], it shifts the vector by an angle ?:

This operation of shifting by ? is analogous to the idea in positional encoding, where we need to "translate" the position of a word by a relative positional difference. Just as the rotation matrix shifts the vector by a certain amount, sinusoidal positional encoding shifts the positional encoding in a way that reflects the relative difference between positions in the sequence.

Translation Property in Positional Encoding

The goal of positional encoding is to allow the model to understand relative positional shifts in a sequence. This means that the positional encoding of a word must capture how far it is from other words, not just its absolute position. Mathematically, we express this idea as:

Where T(Δstep) is the translation function that applies a shift.

Now, let's see why sinusoidal positional encoding works well for the conditions of positional encoding.

1. Definition of Positional Encoding

Refer to the PE_pos above. For the proof we'll do next, I transformed PE_pos into:

where ω=1/(10000^(2i/d_model)).

2. Translating Positional Encoding by Δstep

To account for the difference in positions, we apply a shift to the positional encoding using the properties of trigonometric functions. The positional encoding for a position pos+Δstep is given by:

Expanding this using trigonometric identities:

This can be rewritten in matrix form:

3. Using the Rotation Matrix:

The transformation matrix T(Δstep), which encodes the translation, is:

Therefore, the translation of the positional encoding can be written as:

T(Δstep) may look different from the rotation matrix, but it’s really just a matter of the dimensionality of the PE_pos matrix. If we write the positional encoding as a row vector, i.e., PEpos=[sin(pos?w), cos(pos?w)], it would be the same as the rotation matrix in terms of functionality.

So, what’s important here is that the translation property of PE_pos holds, and the dimensionality doesn't affect the property.

Now that we know the idea behind sinusoidal positional encoding, let's examine whether this approach meets the conditions of positional encoding.

Does Sinusoidal Positional Encoding Meet the Conditions?

1. Deterministic and Unique Encoding for Each Position

The sinusoidal positional encoding is deterministic and unique for each position because it uses the sine function for even indices and the cosine function for odd indices. To ensure that the encoding fits within one period of the sine and cosine functions, the position pos is scaled by a factor w. This scaling ensures that the values of sin(pos?w) and cos(pos?w) are distinct for each position in the sequence, making the encoding unique. As a result, each position has a unique encoding that can always be consistently computed, ensuring that no two positions share the same representation.

2. Constant and Independent of Input Data

The positional encoding PE_pos depends on the position index pos, the model dimension d_model, and the dimension index i. These factors are related to the position of the token in the sequence, not the specific sequence content. Therefore, PE_pos remains constant for a given position, regardless of the input sequence. The values of pos, d_model, and i are fixed, ensuring that the encoding is independent of the sequence itself and only reflects the position of the token.

3. Scalability for Sequences of Arbitrary Length

The sinusoidal encoding is scalable because the encoding for any position pospospos is computed using the formula involving the position and the frequency constant w. As the position increases, the encoding simply continues to follow the sinusoidal pattern. This means that the approach can handle sequences of arbitrary length. For long sequences, the values of www adjust accordingly, making it suitable for both short and long sequences without requiring a fixed length or structure.

4. Consistent Relative Distance Between Positions

The translation property of sinusoidal encoding ensures that the relative distance between positions is preserved. When a shift Δstep occurs, the encoding for any two positions will maintain the same relative difference, no matter where those positions occur in the sequence. This guarantees that the model can always recognize relative shifts consistently.

5. Consistent Absolute Distance Between Positions

The absolute distance between positions remains consistent for the same reasons mentioned in point 2. Since PE_pos depends solely on pos, d_model, and i, the encoding for each position is determined by these fixed variables, and not by the content of the sequence. Therefore, the difference in encoding between any two positions will always be the same, ensuring that the absolute distance between positions is consistent across different sequences.

Special Property of Sinusoidal Positional Encoding: Complementary Phase Shifts

While it's not mandatory to know, an interesting aspect of sinusoidal positional encoding is its complementary phase shift between the sine and cosine functions.

Sine and cosine are phase-shifted versions of each other, meaning they start at different points in their respective cycles. This phase difference captures different aspects of the positional relationship. Specifically, the low-frequency components of sine and cosine change more gradually, capturing larger-scale positional differences between tokens that are far apart in the sequence. On the other hand, the high-frequency components change rapidly, capturing fine-grained positional differences between closely spaced tokens.

To illustrate this, the heatmap below visualizes the positional encodings for a sequence of tokens.

Left Side (Lower Dimensions): The lower dimensions show fast-changing patterns due to the high-frequency components of sine and cosine. These components produce rapid phase shifts, capturing small, fine-grained differences between tokens that are close together in the sequence.
Right Side (Higher Dimensions): The higher dimensions show slow-changing patterns, corresponding to the low-frequency components of sine and cosine. These lead to gradual phase shifts, helping to capture larger-scale positional differences between tokens that are further apart.

In the Transformer model, the embedding vectors and positional encodings are summed sequentially (but separately) for both the encoder and the decoder. First, the input token embeddings are combined with the positional encodings in the encoder, and the same process occurs for the target token embeddings in the decoder. These summed vectors are then passed through their respective encoder and decoder layers.

Conclusion

Sinusoidal positional encoding is ideal for transformers because it efficiently represents positional information while addressing the unique needs of transformer models. Since transformers process tokens in parallel rather than sequentially, it's crucial to incorporate positional encoding that is deterministic, unique, and independent of input data, making it scalable to sequences of any length. The complementary phase shifts between sine and cosine functions allow the encoding to capture both long-range and short-range positional relationships, helping the model distinguish fine-grained and larger-scale dependencies. This approach is computationally efficient, requires no learnable parameters, and can be easily applied to variable-length sequences, making it perfectly suited for transformers' parallelized architecture.

Source

????-????-Transformer5-Positional-Encoding

Attention Is All You Need

要查看或添加评论，请登录

Gina Lee的更多文章

Support Vector Machine (SVM) Algorithm Part 2 - Transforming the SVM Problem Using Lagrange Multipliers

2025年3月10日

Support Vector Machine (SVM) Algorithm Part 2 - Transforming the SVM Problem Using Lagrange Multipliers

In yesterday's article, we defined the primal optimization goal that SVM aims to solve. If you haven't read it yet, I…

1 条评论
Support Vector Machine (SVM) Algorithm Part 1 - Define Primal Optimization Goal

2025年3月9日

Support Vector Machine (SVM) Algorithm Part 1 - Define Primal Optimization Goal

The primary goal of Support Vector Machines (SVM) is to maximize the margin between the two classes. This is achieved…
How Decision Tree Build a Tree and Leaves

2025年2月21日

How Decision Tree Build a Tree and Leaves

When I introduced boosting models, do you remember they were based on decision trees? I briefly mentioned how these…
Generalized Linear Model (GLM) - Flexible Regression Model

2025年2月20日

Generalized Linear Model (GLM) - Flexible Regression Model

When we think about linear regression, the first thing that usually comes to mind is the assumption that the dependent…

1 条评论
What to Check Before You Decide to Apply a Linear Regression Model

2025年2月18日

What to Check Before You Decide to Apply a Linear Regression Model

In our last post, we talked about the assumptions of the linear regression model: linearity, homoscedasticity…
VGGNet: A Deep CNN Approach to Large-Scale Image Recognition

2025年2月15日

VGGNet: A Deep CNN Approach to Large-Scale Image Recognition

Last time, we talked about how CNNs use convolution to process images. Today, let’s dive deeper into one of the most…
How Convolutional Neural Network (CNN) Processes Image Data

2025年2月11日

How Convolutional Neural Network (CNN) Processes Image Data

Convolutional Neural Networks (CNNs) have revolutionized image processing by enabling models to automatically extract…
Understand Gradient Boosting

2025年2月6日

Understand Gradient Boosting

Last time, I explained the difference between bagging and boosting models: while bagging builds multiple models in…
A/B Testing Part 1: Power Analysis Calculates Sample Size

2025年2月1日

A/B Testing Part 1: Power Analysis Calculates Sample Size

When designing A/B testing, one of the most critical factors is determining how many participants (or samples) you need…
Gradient Descent Part 4: Regularization Prevents Overfitting

2025年1月28日

Gradient Descent Part 4: Regularization Prevents Overfitting

Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. This penalty…

See all articles

Embedding Vs. Encoding

Conditions of Positional Encoding

Sinusoidal positional encoding

Rotation Matrix

Translation Property in Positional Encoding

1. Definition of Positional Encoding

2. Translating Positional Encoding by Δstep

3. Using the Rotation Matrix:

Does Sinusoidal Positional Encoding Meet the Conditions?

Special Property of Sinusoidal Positional Encoding: Complementary Phase Shifts

Conclusion

Source

Gina Lee的更多文章

Support Vector Machine (SVM) Algorithm Part 2 - Transforming the SVM Problem Using Lagrange Multipliers

Support Vector Machine (SVM) Algorithm Part 1 - Define Primal Optimization Goal

How Decision Tree Build a Tree and Leaves

Generalized Linear Model (GLM) - Flexible Regression Model

What to Check Before You Decide to Apply a Linear Regression Model

VGGNet: A Deep CNN Approach to Large-Scale Image Recognition

How Convolutional Neural Network (CNN) Processes Image Data

Understand Gradient Boosting

A/B Testing Part 1: Power Analysis Calculates Sample Size

Gradient Descent Part 4: Regularization Prevents Overfitting

社区洞察