Why is Transformer Preferred Over RNN? - Transformer Part 1: Embedding & Positional Encoding
We have now reached the main focus of this lecture: the Transformer model. The Transformer is a Sequence-to-Sequence (Seq2Seq) architecture that significantly improves upon traditional models like RNNs, primarily through the use of multi-head attention and self-attention mechanisms. These innovations allow it to better capture long-range dependencies and process sequences in parallel.
The architecture of the Transformer consists of several key components, and to understand each part, we will cover the following topics:
Today, we focused on the first chapter: Embedding & Positional Encoding.
Embedding Vs. Encoding
In the Transformer model, embedding is used to convert input tokens and output tokens into vectors of a fixed dimension (d_model like in Seq2Seq) which allows the model to distinguish between each word. This step is similar to other sequence transduction models like Seq2Seq. The embedding weights are scaled by multiplying them by sqrt(dmodel) to help maintain stable gradients during training This scaling ensures that the values of the embeddings don't grow too large and destabilize the learning process, especially since the dot products of vectors in attention calculations can grow quickly as the model size increases.
However, positional encoding, particularly sinusoidal positional encoding, is a unique addition in the Transformer model. Unlike traditional sequence transduction models like Seq2Seq, the Transformer processes all words in parallel, which means it doesn't inherently capture the sequential order of the tokens. In Seq2Seq, we had arrows directing tokens to the next, implying their order, but since the Transformer lacks this sequential processing, we must explicitly encode positional information. This is where positional encoding comes in: it is added to the embedded vectors to provide the model with information about the position of each word in the sequence. Let’s now explore how sinusoidal positional encoding works.
Conditions of Positional Encoding
First of all, here are the conditions that a positional encoding method should satisfy to be considered a proper positional encoding:
Sinusoidal positional encoding
Sinusoidal positional encoding is defined as:
Where PE(pos, 2i) refers to the positional encoding at a particular position pos in a sequence, specifically for the 2i-th dimension of the encoding.
This method uses the sine function for even indices and the cosine function for odd indices to create the positional encodings. But how did the idea of using trigonometric functions for positional encoding come about?
The use of trigonometric functions like sine and cosine was inspired by the need to encode the position of tokens in a way that was both continuous and periodic, while also being computationally feasible. Trigonometric functions were chosen because of their periodicity and smoothness, which allow for unique, continuous encodings that can repeat over time (helpful for longer sequences) while still providing enough distinctiveness between positions.
Since we've established that trigonometry, specifically sine and cosine functions, is ideal for positional encoding, you might now be wondering why we use both sine and cosine functions rather than just one of them. Let’s explore this.
If we were to use only sine for positional encoding, it would violate condition 1, which requires each positional encoding value to be unique. The reason for this is that the sine function has periodicity, meaning it repeats the same values at regular intervals. For example, sin(x)=sin(x+2π), so if we only used sine, multiple positions in the sequence could end up with the same value, leading to non-unique encodings and failing to satisfy the uniqueness requirement.
To prevent this issue, we scale the position by dividing it by 10000^(2i/d_model), which ensures that the values of sine and cosine stay within one period and don't overlap, making the positional encodings small enough to meet the uniqueness condition. This scaling keeps the positional values within a manageable range and guarantees that the sine (or cosine) value for each position remains unique.
However, using both sine and cosine functions is necessary. Why?
In translation, knowing the relative positions of words in a sequence is crucial. For example, in the sentence "I am going to eat," the position of the word "eat" relative to "am going to" significantly affects the meaning of the sentence. Therefore, positional encoding must capture not just the absolute positions of words, but also the relative positional differences between them. To achieve this, we need a system that can model these shifts effectively, and this is where sinusoidal positional encoding comes into play.
Rotation Matrix
To understand how sinusoidal positional encoding works, understand the rotation matrix:
This matrix performs a linear transformation that rotates a vector in 2D space. When applied to a vector [cos(θ),sin(θ)], it shifts the vector by an angle ?:
This operation of shifting by ? is analogous to the idea in positional encoding, where we need to "translate" the position of a word by a relative positional difference. Just as the rotation matrix shifts the vector by a certain amount, sinusoidal positional encoding shifts the positional encoding in a way that reflects the relative difference between positions in the sequence.
Translation Property in Positional Encoding
The goal of positional encoding is to allow the model to understand relative positional shifts in a sequence. This means that the positional encoding of a word must capture how far it is from other words, not just its absolute position. Mathematically, we express this idea as:
Where T(Δstep) is the translation function that applies a shift.
Now, let's see why sinusoidal positional encoding works well for the conditions of positional encoding.
1. Definition of Positional Encoding
Refer to the PE_pos above. For the proof we'll do next, I transformed PE_pos into:
where ω=1/(10000^(2i/d_model)).
2. Translating Positional Encoding by Δstep
To account for the difference in positions, we apply a shift to the positional encoding using the properties of trigonometric functions. The positional encoding for a position pos+Δstep is given by:
Expanding this using trigonometric identities:
This can be rewritten in matrix form:
3. Using the Rotation Matrix:
The transformation matrix T(Δstep), which encodes the translation, is:
Therefore, the translation of the positional encoding can be written as:
T(Δstep) may look different from the rotation matrix, but it’s really just a matter of the dimensionality of the PE_pos matrix. If we write the positional encoding as a row vector, i.e., PEpos=[sin(pos?w), cos(pos?w)], it would be the same as the rotation matrix in terms of functionality.
So, what’s important here is that the translation property of PE_pos holds, and the dimensionality doesn't affect the property.
Now that we know the idea behind sinusoidal positional encoding, let's examine whether this approach meets the conditions of positional encoding.
Does Sinusoidal Positional Encoding Meet the Conditions?
1. Deterministic and Unique Encoding for Each Position
The sinusoidal positional encoding is deterministic and unique for each position because it uses the sine function for even indices and the cosine function for odd indices. To ensure that the encoding fits within one period of the sine and cosine functions, the position pos is scaled by a factor w. This scaling ensures that the values of sin(pos?w) and cos(pos?w) are distinct for each position in the sequence, making the encoding unique. As a result, each position has a unique encoding that can always be consistently computed, ensuring that no two positions share the same representation.
2. Constant and Independent of Input Data
The positional encoding PE_pos depends on the position index pos, the model dimension d_model, and the dimension index i. These factors are related to the position of the token in the sequence, not the specific sequence content. Therefore, PE_pos remains constant for a given position, regardless of the input sequence. The values of pos, d_model, and i are fixed, ensuring that the encoding is independent of the sequence itself and only reflects the position of the token.
3. Scalability for Sequences of Arbitrary Length
The sinusoidal encoding is scalable because the encoding for any position pospospos is computed using the formula involving the position and the frequency constant w. As the position increases, the encoding simply continues to follow the sinusoidal pattern. This means that the approach can handle sequences of arbitrary length. For long sequences, the values of www adjust accordingly, making it suitable for both short and long sequences without requiring a fixed length or structure.
4. Consistent Relative Distance Between Positions
The translation property of sinusoidal encoding ensures that the relative distance between positions is preserved. When a shift Δstep occurs, the encoding for any two positions will maintain the same relative difference, no matter where those positions occur in the sequence. This guarantees that the model can always recognize relative shifts consistently.
5. Consistent Absolute Distance Between Positions
The absolute distance between positions remains consistent for the same reasons mentioned in point 2. Since PE_pos depends solely on pos, d_model, and i, the encoding for each position is determined by these fixed variables, and not by the content of the sequence. Therefore, the difference in encoding between any two positions will always be the same, ensuring that the absolute distance between positions is consistent across different sequences.
Special Property of Sinusoidal Positional Encoding: Complementary Phase Shifts
While it's not mandatory to know, an interesting aspect of sinusoidal positional encoding is its complementary phase shift between the sine and cosine functions.
Sine and cosine are phase-shifted versions of each other, meaning they start at different points in their respective cycles. This phase difference captures different aspects of the positional relationship. Specifically, the low-frequency components of sine and cosine change more gradually, capturing larger-scale positional differences between tokens that are far apart in the sequence. On the other hand, the high-frequency components change rapidly, capturing fine-grained positional differences between closely spaced tokens.
To illustrate this, the heatmap below visualizes the positional encodings for a sequence of tokens.
In the Transformer model, the embedding vectors and positional encodings are summed sequentially (but separately) for both the encoder and the decoder. First, the input token embeddings are combined with the positional encodings in the encoder, and the same process occurs for the target token embeddings in the decoder. These summed vectors are then passed through their respective encoder and decoder layers.
Conclusion
Sinusoidal positional encoding is ideal for transformers because it efficiently represents positional information while addressing the unique needs of transformer models. Since transformers process tokens in parallel rather than sequentially, it's crucial to incorporate positional encoding that is deterministic, unique, and independent of input data, making it scalable to sequences of any length. The complementary phase shifts between sine and cosine functions allow the encoding to capture both long-range and short-range positional relationships, helping the model distinguish fine-grained and larger-scale dependencies. This approach is computationally efficient, requires no learnable parameters, and can be easily applied to variable-length sequences, making it perfectly suited for transformers' parallelized architecture.