Improving Transformer Architecture with Rotary Positional Embeddings
Introduction
The Transformer architecture has been the bedrock of natural language processing since the groundbreaking “Attention Is All You Need” paper in 2017. However, until recently, there had been minimal architectural innovation. In 2022, a game-changing improvement emerged in the form of Rotary Positional Embeddings, or “Rotary PE.”
In this post, we’ll delve into the fascinating world of Rotary PE, exploring how it combines the strengths of both absolute and relative positional embeddings.
The Need for Positional Embeddings
Transformers, by default, are order-invariant, treating sentences as unordered sets of tokens. This causes issues, as sentences with different word orders have identical representations. To preserve order information, positional embeddings are essential.
Absolute Positional Embeddings
The conventional approach is absolute positional embeddings, where a vector of the same dimension as word embeddings represents each position in a sentence. These vectors are either learned or derived from sinusoidal functions.
However, they have limitations, such as a fixed maximum sequence length.
Relative Positional Embeddings
An alternative approach is relative positional embeddings, which represent pairs of tokens’ positional relationships. For instance, the T5 model uses biases to denote the relative distances between tokens. This method preserves relative positions but presents engineering challenges, especially for longer sequences due to the quadrilateral time complexity.
领英推荐
Enter Rotary Positional Embeddings
Rotary PE offers an innovative solution. Instead of adding positional vectors to word embeddings, it applies rotations to the vectors. For instance, a word appearing in the second position is rotated by an angle Θ, while words in later positions undergo proportionally greater rotations. This approach combines the advantages of absolute and relative embeddings.
Matrix Formulation
The rotation is mathematically represented as a matrix multiplication.
However, in practice, it’s much more efficient to implement it with vector operations, such as vector multiplications and additions. This approach results in faster computation.
Key Properties
One key property of Rotary PE is that tokens closer together are likely to have larger dot products, while tokens far apart tend to have smaller dot products. This property aligns with the intuitive idea that closely related words should have stronger interactions.
Experiments and Conclusion
Experimental results show that models using Rotary PE train faster than those using sinusoidal embeddings. Researchers have replicated these findings across various model architectures and training setups, indicating the robustness of Rotary PE.
In conclusion, Rotary Positional Embeddings are a game-changer in the world of language models, offering a powerful way to preserve order information efficiently. This innovation bridges the gap between absolute and relative positional embeddings, promising better performance and more robust models in natural language processing tasks. As we continue to explore this exciting development, the future of language models appears even more promising.
Data Scientist at Applied Materials
1 年Well explained!