登录查看更多内容

Improving Transformer Architecture with Rotary Positional Embeddings

Thirupathi Thangavel

Data Scientist | AI/ML | Okta | Ex-Google | NIT Trichy

发布日期: 2023年9月16日

Introduction

The Transformer architecture has been the bedrock of natural language processing since the groundbreaking “Attention Is All You Need” paper in 2017. However, until recently, there had been minimal architectural innovation. In 2022, a game-changing improvement emerged in the form of Rotary Positional Embeddings, or “Rotary PE.”

In this post, we’ll delve into the fascinating world of Rotary PE, exploring how it combines the strengths of both absolute and relative positional embeddings.

The Need for Positional Embeddings

Transformers, by default, are order-invariant, treating sentences as unordered sets of tokens. This causes issues, as sentences with different word orders have identical representations. To preserve order information, positional embeddings are essential.

Absolute Positional Embeddings

The conventional approach is absolute positional embeddings, where a vector of the same dimension as word embeddings represents each position in a sentence. These vectors are either learned or derived from sinusoidal functions.

However, they have limitations, such as a fixed maximum sequence length.

Relative Positional Embeddings

An alternative approach is relative positional embeddings, which represent pairs of tokens’ positional relationships. For instance, the T5 model uses biases to denote the relative distances between tokens. This method preserves relative positions but presents engineering challenges, especially for longer sequences due to the quadrilateral time complexity.

领英推荐

How Generative AI Will Change The Jobs of Architects…

Bernard Marr 12 个月前

"41% of architects now using AI" says RIBA report

Lucas Christopher 10 个月前

Paper Review: Vision-RWKV: Efficient and Scalable…

Andrey Lukyanenko 12 个月前

Enter Rotary Positional Embeddings

Rotary PE offers an innovative solution. Instead of adding positional vectors to word embeddings, it applies rotations to the vectors. For instance, a word appearing in the second position is rotated by an angle Θ, while words in later positions undergo proportionally greater rotations. This approach combines the advantages of absolute and relative embeddings.

Matrix Formulation

The rotation is mathematically represented as a matrix multiplication.

However, in practice, it’s much more efficient to implement it with vector operations, such as vector multiplications and additions. This approach results in faster computation.

Key Properties

One key property of Rotary PE is that tokens closer together are likely to have larger dot products, while tokens far apart tend to have smaller dot products. This property aligns with the intuitive idea that closely related words should have stronger interactions.

Experiments and Conclusion

Experimental results show that models using Rotary PE train faster than those using sinusoidal embeddings. Researchers have replicated these findings across various model architectures and training setups, indicating the robustness of Rotary PE.

In conclusion, Rotary Positional Embeddings are a game-changer in the world of language models, offering a powerful way to preserve order information efficiently. This innovation bridges the gap between absolute and relative positional embeddings, promising better performance and more robust models in natural language processing tasks. As we continue to explore this exciting development, the future of language models appears even more promising.

Improving Transformer Architecture with Rotary Positional Embeddings

Thirupathi Thangavel

Data Scientist | AI/ML | Okta | Ex-Google | NIT Trichy

Introduction

The Need for Positional Embeddings

Absolute Positional Embeddings

Relative Positional Embeddings

领英推荐

Enter Rotary Positional Embeddings

Matrix Formulation

Key Properties

Experiments and Conclusion

社区洞察

其他会员也浏览了

The?“Law of Conservation of Complexity”

AI for Architecture Pitch Off: A Call for Innovation in Design

The Architecture Field: Balancing Identity and the Invasion of Technology

The Paradox of Choice and Generative Architecture: How to Design Better with Less

Future of Architecture: Generative Ai

How to use AI in architectural practice: Case study with Wardle

Connector Node For Free-Form Grid Shells

Will AI like DALL-E replace Architects in the future?

From Telling to Showing: Demonstrating Competence Over Credentials in Architecture

?? Revolutionizing Facade Engineering with AI-Powered Analysis and Design ??