Engineers Guide to AI - Understanding Positional Encoding
Positional encodings are very important for Transformer models because they allow the model to use the order of the words in the input sequence.
They are a way to give the transformer model some information about the relative position of the words in the sentence. This is necessary because the self-attention mechanism in the transformer model doesn't have any inherent sense of word order, which is important in many language tasks.
The intuition behind Positional Encoding is to provide the Transformer model with a notion of word order, since unlike RNNs or LSTMs, Transformers do not inherently understand the concept of sequence or word order. Transformers process words in parallel, which is great for computational efficiency, but it means they can't understand which word comes before another. To rectify this, Positional Encoding injects some information about word order into the model.
For example, the sentence "The cat sat on the mat" has a different meaning than the sentence "The mat sat on the cat."
That's where positional encoding comes in. We add a vector to each word's embedding that represents its position in the sentence. The key is that these vectors are designed in a way that the model can learn to use them to understand the order of the words.
A common choice for these vectors is a fixed sinusoidal pattern (using sinusoidal functions), based on the position's binary encoding. For example, for a position 'p' and dimension 'i', the positional encoding might be defined as follows:
# "Pos" stands for position and "i" stands for each
# dimension in the positional encoding. Every dimension
# corresponds to a wave (sinusoid). The wavelengths of
# these waves increase geometrically, which means each one
# is a certain multiple of the previous one.
# They start from a wavelength of 2π and go up to a
# wavelength of 10000 multiplied by 2π.
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
where 'd_model' is the dimensionality of the embeddings. These functions have the property that they can be easily learned by the model and capture a wide range of positional relationships.
If you find this series useful please do subscribe I have a fun and interesting line up of posts planned covering not only the technical details but around all the latest advancements and the developing culture around AI. ????
All of this might sound complicated, but it's really just about creating a big list of numbers. Imagine you're working with an embedding that has 512 dimensions (so when d_model = 512, 'i' ranges from 0 to 511). 'i' here represents the index for the value you're calculating. The indexes using 'sin' and 'cos' will keep switching back and forth.
'Pos' refers to the maximum number of tokens (sentence length) that we've chosen for this model. So, what we're essentially doing here is figuring out an angle (measured in radians) for each pair of elements in our list. For this above example we are working with a 2D array in the size of 128x512
Sinusoidal functions are mathematical functions that describe a smooth, periodic oscillation. They are based on the sine and cosine functions from trigonometry, which describe the coordinates of a point on a circle as it moves around the circle at a constant speed. Two of the most common sinusoidal functions are: y = sin(x) and y = cos(x)
To make this more concrete, let's consider a very simple example in Javascript:
function positionalEncoding(position, dimensions) {
const posEnc = new Array(dimensions).fill(0).map((_, i) => {
const pos = position / Math.pow(10000, 2 * Math.floor(i / 2) / dimensions);
return i % 2 == 0 ? Math.sin(pos) : Math.cos(pos);
});
return posEnc;
}
// Get the positional encoding for the first
// position in a sentence with embedding size 16
console.log(positionalEncoding(1, 16));
Or a more efficient one in Python using the popular Numpy library:
# Import the numpy library, which provides support for arrays and mathematical functions
import numpy as np
# Define the positional encoding function
def positional_encoding(position, d_model):
? ? # np.arange(d_model) generates an array of integers from 0 to d_model-1
? ? # np.newaxis adds an extra dimension to the array, transforming it from a 1D array into a 2D array
? ? # np.power calculates 10000 to the power of (2 * index / d_model) for every index in np.arange(d_model)[::2]
? ? # The output is a 2D array where each row corresponds to a different position 'p'
? ? # and each column corresponds to a different dimension 'i'
? ? # Each entry in the array represents the angle in radians that corresponds to that position and dimension
? ? angle_rads = np.arange(d_model)[np.newaxis, :] / np.power(10000, (2 * (np.arange(d_model)[::2]) / np.float32(d_model)))
? ? # For even indices, we apply the sin function to the radian angles
? ? # This essentially converts the angles to a range between -1 and 1
? ? angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
? ??
? ? # For odd indices, we apply the cos function
? ? # Like with sin, this converts the angles to a range between -1 and 1
? ? angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
? ??
? ? # Return the resulting 2D array, which is the positional encoding for the given position and model dimension
? ? return angle_rads
# Call the positional encoding function to get the positional encoding for the first 10 positions,
# using a model dimension of 300. The output will be a 10x300 2D array, where each row is the positional encoding
# for a position and each column is a different dimension.
pos_encoding = positional_encoding(10, 300)
This would give you a 10 x 300 matrix, where each row is the positional encoding for a position in the sentence, and can be added to the word embeddings for the words at that position.
Once we add these positional encodings to the word embeddings, the transformer model can learn to use this information to understand word order and can better capture the meaning of the sentence.
领英推荐
Why use sine and cosine functions?
The function we chose for positional encoding is based on a certain hypothesis. We thought this function would help the model learn to pay attention to relative positions more easily.
Let's consider the concept of a "fixed offset" - this is just a constant value that you might add to the position (pos) of a token. We denote this fixed offset as 'k'.
So, when you add this fixed offset to a position, you get a new position (pos+k). According to our hypothesis, the positional encoding at this new position (PE pos+k) can be expressed as a straightforward transformation (a linear function) of the positional encoding at the original position (PE pos).
In other words, we think that a model using this positional encoding function could learn to shift its attention from one position to another, just by applying a simple, predictable transformation to the positional encodings. This would make it easier for the model to understand and utilize the concept of relative position.
A simple trick to improve LLM context length
New relatively simple experiments (hacks) with Positional Encodings have helped improve the context length of LLMs. This is big as a larger context length allows you to put more information in the prompt. Tasks like text summarization, code generation are helped massively by a larger context length.
A technique has been found for training large language models (LLMs) that can handle longer sequences of text than their original training window. This is important because LLMs like GPT-3 are typically trained on relatively short sequences (e.g., 1024 tokens) but may need to handle much longer sequences when they're actually used (e.g., when generating a long story). Let me try and break this down.
The overall goal of these techniques is to enable LLMs to handle longer sequences without a significant drop in performance. This is an ongoing area of research in the field of NLP, and there are likely to be further developments in the future.
No Positional Encoding
A new paper claims we can do without positional encodings all together. https://arxiv.org/abs/2203.16634
This research demonstrates that even without explicit positional encoding, these models can still perform competitively. This has been observed across different datasets, model sizes, and sequence lengths, making it a robust phenomenon.
It appears that these models without explicit positional encodings are capable of developing an implicit understanding of absolute positions within the network. This suggests that the models are, in some manner, compensating for the absence of explicit positional information.
This may seem surprising at first, but consider the structure of transformers, which employ what's known as "causal attention." This form of attention ensures that each token attends only to its predecessors and not to its successors. This setup inherently provides some positional information: the number of predecessors each token can attend to can serve as a rough estimate of its absolute position.
Therefore, our findings indicate that causal transformer-based language models might be capable of inferring positional awareness from the structure of the attention mechanism itself, not solely from explicit positional encodings.
Types of Positional Encodings
These are methods that can be used to imbue Transformer models with a sense of positional awareness, each with its own trade-offs and characteristics.
Timeline
If you find this series useful please do subscribe I have a fun and interesting line up of posts planned covering not only the technical details but around all the latest advancements and the developing culture around AI.