Improving Transformer Architecture with Rotary Positional Embeddings

Improving Transformer Architecture with Rotary Positional Embeddings

Introduction

The Transformer architecture has been the bedrock of natural language processing since the groundbreaking “Attention Is All You Need” paper in 2017. However, until recently, there had been minimal architectural innovation. In 2022, a game-changing improvement emerged in the form of Rotary Positional Embeddings, or “Rotary PE.”

In this post, we’ll delve into the fascinating world of Rotary PE, exploring how it combines the strengths of both absolute and relative positional embeddings.

The Need for Positional Embeddings

Transformers, by default, are order-invariant, treating sentences as unordered sets of tokens. This causes issues, as sentences with different word orders have identical representations. To preserve order information, positional embeddings are essential.

Absolute Positional Embeddings

The conventional approach is absolute positional embeddings, where a vector of the same dimension as word embeddings represents each position in a sentence. These vectors are either learned or derived from sinusoidal functions.

However, they have limitations, such as a fixed maximum sequence length.

Relative Positional Embeddings

An alternative approach is relative positional embeddings, which represent pairs of tokens’ positional relationships. For instance, the T5 model uses biases to denote the relative distances between tokens. This method preserves relative positions but presents engineering challenges, especially for longer sequences due to the quadrilateral time complexity.

Enter Rotary Positional Embeddings

Rotary PE offers an innovative solution. Instead of adding positional vectors to word embeddings, it applies rotations to the vectors. For instance, a word appearing in the second position is rotated by an angle Θ, while words in later positions undergo proportionally greater rotations. This approach combines the advantages of absolute and relative embeddings.

Matrix Formulation

The rotation is mathematically represented as a matrix multiplication.

2 Dimensional example

However, in practice, it’s much more efficient to implement it with vector operations, such as vector multiplications and additions. This approach results in faster computation.

Key Properties

One key property of Rotary PE is that tokens closer together are likely to have larger dot products, while tokens far apart tend to have smaller dot products. This property aligns with the intuitive idea that closely related words should have stronger interactions.

Experiments and Conclusion

Experimental results show that models using Rotary PE train faster than those using sinusoidal embeddings. Researchers have replicated these findings across various model architectures and training setups, indicating the robustness of Rotary PE.

In conclusion, Rotary Positional Embeddings are a game-changer in the world of language models, offering a powerful way to preserve order information efficiently. This innovation bridges the gap between absolute and relative positional embeddings, promising better performance and more robust models in natural language processing tasks. As we continue to explore this exciting development, the future of language models appears even more promising.

Prakhar Agarwal

Data Scientist at Applied Materials

1 年

Well explained!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了