Engineers Guide to AI - Understanding Positional Encoding

Engineers Guide to AI - Understanding Positional Encoding

Positional encodings are very important for Transformer models because they allow the model to use the order of the words in the input sequence.

They are a way to give the transformer model some information about the relative position of the words in the sentence. This is necessary because the self-attention mechanism in the transformer model doesn't have any inherent sense of word order, which is important in many language tasks.

The intuition behind Positional Encoding is to provide the Transformer model with a notion of word order, since unlike RNNs or LSTMs, Transformers do not inherently understand the concept of sequence or word order. Transformers process words in parallel, which is great for computational efficiency, but it means they can't understand which word comes before another. To rectify this, Positional Encoding injects some information about word order into the model.

For example, the sentence "The cat sat on the mat" has a different meaning than the sentence "The mat sat on the cat."

That's where positional encoding comes in. We add a vector to each word's embedding that represents its position in the sentence. The key is that these vectors are designed in a way that the model can learn to use them to understand the order of the words.

A common choice for these vectors is a fixed sinusoidal pattern (using sinusoidal functions), based on the position's binary encoding. For example, for a position 'p' and dimension 'i', the positional encoding might be defined as follows:

# "Pos" stands for position and "i" stands for each 
# dimension in the positional encoding. Every dimension 
# corresponds to a wave (sinusoid). The wavelengths of 
# these waves increase geometrically, which means each one 
# is a certain multiple of the previous one. 
# They start from a wavelength of 2π and go up to a 
# wavelength of 10000 multiplied by 2π.

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))        

where 'd_model' is the dimensionality of the embeddings. These functions have the property that they can be easily learned by the model and capture a wide range of positional relationships.


If you find this series useful please do subscribe I have a fun and interesting line up of posts planned covering not only the technical details but around all the latest advancements and the developing culture around AI. ????


All of this might sound complicated, but it's really just about creating a big list of numbers. Imagine you're working with an embedding that has 512 dimensions (so when d_model = 512, 'i' ranges from 0 to 511). 'i' here represents the index for the value you're calculating. The indexes using 'sin' and 'cos' will keep switching back and forth.

'Pos' refers to the maximum number of tokens (sentence length) that we've chosen for this model. So, what we're essentially doing here is figuring out an angle (measured in radians) for each pair of elements in our list. For this above example we are working with a 2D array in the size of 128x512

Sinusoidal functions are mathematical functions that describe a smooth, periodic oscillation. They are based on the sine and cosine functions from trigonometry, which describe the coordinates of a point on a circle as it moves around the circle at a constant speed. Two of the most common sinusoidal functions are: y = sin(x) and y = cos(x)

To make this more concrete, let's consider a very simple example in Javascript:

function positionalEncoding(position, dimensions)  {
  const posEnc = new Array(dimensions).fill(0).map((_, i) => {
    const pos = position / Math.pow(10000, 2 * Math.floor(i / 2) / dimensions);
      return i % 2 == 0 ? Math.sin(pos) : Math.cos(pos);
    });
          
  return posEnc;
}
                        
// Get the positional encoding for the first 
// position in a sentence with embedding size 16
console.log(positionalEncoding(1, 16));        

Or a more efficient one in Python using the popular Numpy library:

# Import the numpy library, which provides support for arrays and mathematical functions
import numpy as np


# Define the positional encoding function
def positional_encoding(position, d_model):
? ? # np.arange(d_model) generates an array of integers from 0 to d_model-1
? ? # np.newaxis adds an extra dimension to the array, transforming it from a 1D array into a 2D array
? ? # np.power calculates 10000 to the power of (2 * index / d_model) for every index in np.arange(d_model)[::2]
? ? # The output is a 2D array where each row corresponds to a different position 'p'
? ? # and each column corresponds to a different dimension 'i'
? ? # Each entry in the array represents the angle in radians that corresponds to that position and dimension
? ? angle_rads = np.arange(d_model)[np.newaxis, :] / np.power(10000, (2 * (np.arange(d_model)[::2]) / np.float32(d_model)))


? ? # For even indices, we apply the sin function to the radian angles
? ? # This essentially converts the angles to a range between -1 and 1
? ? angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
? ??
? ? # For odd indices, we apply the cos function
? ? # Like with sin, this converts the angles to a range between -1 and 1
? ? angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
? ??
? ? # Return the resulting 2D array, which is the positional encoding for the given position and model dimension
? ? return angle_rads


# Call the positional encoding function to get the positional encoding for the first 10 positions,
# using a model dimension of 300. The output will be a 10x300 2D array, where each row is the positional encoding
# for a position and each column is a different dimension.
pos_encoding = positional_encoding(10, 300)

        

This would give you a 10 x 300 matrix, where each row is the positional encoding for a position in the sentence, and can be added to the word embeddings for the words at that position.

Once we add these positional encodings to the word embeddings, the transformer model can learn to use this information to understand word order and can better capture the meaning of the sentence.

Why use sine and cosine functions?

The function we chose for positional encoding is based on a certain hypothesis. We thought this function would help the model learn to pay attention to relative positions more easily.

Let's consider the concept of a "fixed offset" - this is just a constant value that you might add to the position (pos) of a token. We denote this fixed offset as 'k'.

So, when you add this fixed offset to a position, you get a new position (pos+k). According to our hypothesis, the positional encoding at this new position (PE pos+k) can be expressed as a straightforward transformation (a linear function) of the positional encoding at the original position (PE pos).

In other words, we think that a model using this positional encoding function could learn to shift its attention from one position to another, just by applying a simple, predictable transformation to the positional encodings. This would make it easier for the model to understand and utilize the concept of relative position.

A simple trick to improve LLM context length

New relatively simple experiments (hacks) with Positional Encodings have helped improve the context length of LLMs. This is big as a larger context length allows you to put more information in the prompt. Tasks like text summarization, code generation are helped massively by a larger context length.

A technique has been found for training large language models (LLMs) that can handle longer sequences of text than their original training window. This is important because LLMs like GPT-3 are typically trained on relatively short sequences (e.g., 1024 tokens) but may need to handle much longer sequences when they're actually used (e.g., when generating a long story). Let me try and break this down.

  1. Limited context window: When training LLMs, we usually only consider a limited number of words (tokens) at a time, often due to hardware memory constraints. This is the "context window". For example, we might train a model on sequences of 1024 tokens.
  2. Running beyond the window length: If we try to generate text that's longer than this window length, the model's performance might degrade because it hasn't been trained on such long sequences. This is a limitation of how LLMs are typically trained.
  3. Diving the positional encoding value: One trick that has been discovered to mitigate this issue is to modify the positional encodings of the tokens. Specifically, the positional encoding values are divided by a certain factor, which "tricks" the model into thinking that the sequence is shorter than it actually is.
  4. Performance degradation: This trick, while it allows the model to handle longer sequences, tends to come with some degradation in the model's performance. This is because it effectively reduces the distinctness of the positional encodings, making it harder for the model to distinguish between different positions.
  5. NTK-Aware method: This is a more sophisticated method that was proposed to handle longer sequences. It adjusts the positional encodings in a more nuanced way, taking into account the current sequence length. However, it also has some drawbacks, such as a tendency to suffer from "catastrophic perplexity blowup" (a sharp increase in the model's uncertainty) with longer sequences.
  6. Dynamic parameter adjustment: The solution proposed here is to adjust the parameters used in the NTK-Aware method dynamically, based on the current sequence length. This means that as the sequence gets longer, the parameters change in such a way that the degradation in performance is reduced.

The overall goal of these techniques is to enable LLMs to handle longer sequences without a significant drop in performance. This is an ongoing area of research in the field of NLP, and there are likely to be further developments in the future.

No Positional Encoding

A new paper claims we can do without positional encodings all together. https://arxiv.org/abs/2203.16634

This research demonstrates that even without explicit positional encoding, these models can still perform competitively. This has been observed across different datasets, model sizes, and sequence lengths, making it a robust phenomenon.

It appears that these models without explicit positional encodings are capable of developing an implicit understanding of absolute positions within the network. This suggests that the models are, in some manner, compensating for the absence of explicit positional information.

This may seem surprising at first, but consider the structure of transformers, which employ what's known as "causal attention." This form of attention ensures that each token attends only to its predecessors and not to its successors. This setup inherently provides some positional information: the number of predecessors each token can attend to can serve as a rough estimate of its absolute position.

Therefore, our findings indicate that causal transformer-based language models might be capable of inferring positional awareness from the structure of the attention mechanism itself, not solely from explicit positional encodings.

Types of Positional Encodings

These are methods that can be used to imbue Transformer models with a sense of positional awareness, each with its own trade-offs and characteristics.

  1. NoPE - Stands for "No Positional Encoding". It's a scenario where a Transformer model doesn't use any explicit positional encoding scheme, relying instead on the inherent causal attention mechanism (which allows each token to attend only to preceding tokens) to infer a token's position within a sequence.
  2. RoPE - Stands for "Rotary Positional Encoding". It's a positional encoding scheme that rotates the embeddings of the words in the sequence instead of adding positional information. The key idea is to apply a rotation operation to the high-dimensional space where word embeddings live. This operation leaves the distances between points (embeddings) unchanged while still allowing relative positions to be captured.
  3. ALiBi - Stands for "Axial Learned in Bins". It's another approach to positional encoding that combines learned and fixed positional encodings. The idea is to divide the input sequence into bins, learn a separate positional encoding for each bin (like learned positional embeddings), and then use a fixed positional encoding to capture the position of the bin within the sequence. This can help handle longer sequences by providing more precise positional information within each bin and coarse-grained information across bins.

Timeline

  1. Encoder-decoder models require Positional Encodings (PEs): Early transformer models were encoder-decoder architectures, like the original Transformer proposed in the "Attention is All You Need" paper. These models definitely needed positional encodings (PEs) because neither the encoder nor the decoder inherently understands the order of tokens in a sequence. PEs were used to provide this order information.
  2. Decoder-only models become popular: With the advent of models like GPT and GPT-2, researchers started shifting toward models that only used the decoder part of the original Transformer architecture. This "decoder-only" model is also called an autoregressive or causal language model, where each token can only attend to previous tokens. Even though these models still used PEs, the causal attention mechanism provided some inherent sense of token order.
  3. ALiBi introduces the question of sequence length generalization in PEs: The ALiBi (Axial Learned in Bins) technique was proposed as a way of dealing with the limitations of traditional PEs when it comes to handling sequences longer than the model was trained on. ALiBi uses a combination of learned and fixed PEs to provide both fine-grained and coarse-grained positional information, thus introducing the concept of sequence length generalization in PEs.
  4. NoPE seems to work decently well (only tested in-distro): Research found that "No Positional Encoding" (NoPE) models, which don't use any explicit PEs, still perform decently on tasks within their training distribution ("in-distro"). This suggests that the causal attention mechanism, by itself, might provide enough positional information for the model to function. However, the robustness of NoPE models to sequence lengths outside their training distribution ("out-of-distro") is still an open question.

If you find this series useful please do subscribe I have a fun and interesting line up of posts planned covering not only the technical details but around all the latest advancements and the developing culture around AI.

要查看或添加评论,请登录

勝利 Rangnekar的更多文章

  • Will AI Eat Software Engineering?

    Will AI Eat Software Engineering?

    Will LLMs take coding jobs is the most common point of discussion that I come across with software engineers. This is…

    15 条评论
  • How to Prompt an LLM

    How to Prompt an LLM

    “Write me a tweet about the fall weather in PNW” Don't do this! This above is an example how NOT to prompt LLMs. The…

  • Engineers Guide to AI - Decoding Transformer Models Part 2

    Engineers Guide to AI - Decoding Transformer Models Part 2

    Embeddings With all the excitement around vector databases you've certainly heard the word embeddings or sentence…

  • Engineers Guide to AI - Tokenization

    Engineers Guide to AI - Tokenization

    Tokenization is a common step in text processing, especially in natural language processing (NLP). It's the process of…

  • Engineers Guide to AI - Decoding the Transformer Model

    Engineers Guide to AI - Decoding the Transformer Model

    I've been working on writing my own transformer and this series of blog posts is me trying to learn in public. I've…

社区洞察

其他会员也浏览了