Deep Dive into the Positional Encodings of the Transformer Neural Network Architecture: With Code!
Ajay Taneja
Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics
1. Introduction
?
This is the continuation of my series of blogs on Language Models. This blog is going to be about the positional encodings a.k.a. positional embeddings of the Transformer neural network architecture.? This article is organized as follows. Firstly, I have attempted to recapitulate the flow of the discussion on language models, some important concepts that I have discussed until now in the series of articles on Language Models. Thus:
?
?
?
?
?
?
2. Transformer Encoder-Decoder Architecture:
?
In my article on “The Evolution of Language Models” discussed here , I had talked about n-gram language model, Recurrent Neural Network architecture. It was discussed in detail about why Transformers scored over Recurrent Neural Networks which was followed by a discussion on the Transformer Neural Network Architecture.
It was pointed out that the Transformer architecture comprised of 2 components: Encoder and Decoder components both of which have a contextual understanding of language – if we stack the Encoder, we get the “Bidirectional Encoding Representation of Transformer (BERT)” and we stack the Decoder we get the “Generative Pre-trained Transformer (GPT)”. The article highlighted that the Encoder only models were good for tasks that required understanding of inputs such as: sentiment classification, named entity recognition (NER) whereas the Decoder only models are good for generative tasks such as text generation such as – Language Models.
Encoder – Decoder models are good for generalisation tasks such as: text summarization or translation.
3. Transformer Components:
?
I had pointed out that in my article on: Demystifying Multi-Head Attention in the Transformer Neural Network Architecture ; that since computers do not understand words and can only understand numbers/vectors/matrices, therefore, before passing the text into the model to process, we must first tokenize the words. Once the words are tokenized, the tokens / numbers are passed to an embedding layer which is a trainable vector embedding space – a higher dimension space where each token is represented as a vector and occupies a unique location in that space. The embedding vectors “learn” to encode the meaning of the context of the individual tokens (words) in the sentence.
?
Looking at the sample sequence as below, it can be noticed, that each word has been mapped to a token id and each token id has been mapped to a vector. In the original Transformer paper of “Attention Is All You Need” the length of the embedding vector was 512. As an example, in the figure below, one can see that the embedding vectors corresponding to the words which are closely related are closely spaced in the embedding space.
We then add the positional encoding to the embedding vector through which we process the information of the word order.
After adding the positional encoding to the original word embedding – for each word – we have to generate a query, key and value vector for each word. We pass each of the vectors – comprising of the word embedding plus the positional encoding into three separate neural network units, thus, each with a trainable separate set of weights. Each of the heads are split into multiple heads to perform multi-head attention which I have discussed in sufficient detail in my blog on ,Multi-head attention: Demystifying Multi-Head Attention in the Transformer Neural Network Architecture
4 .Why do we need Positional Encoding a.k.a. Positional Embedding in the Transformer Neural Network Architecture?
?
It should be underscored that in the Transformer Network, the processing of the embedding vectors in the Transformer attention block to transform each vector into query, key and value vector occurs in parallel. Therefore, the order information between the words is not known anywhere. The order information has to be modelled – this information is modelled through the positional encodings.
The order information between the words is of significance in English or any other language. For example, considering the two sentences:
?
1.?????? The man drove the woman to the store.
2.?????? The woman drove the man to the store.
?
One can imagine, that, reversing the order information in the above sentence, we’re essentially reversing the driver of the sentence – in this case man vs woman. Without the order information, transforming the embedding into contextually rich embedding is meaningless. Therefore, the positional embedding is superimposed (added) along with the word embedding for processing in the Attention block.
?
5. Physical intuition of the Positional Embeddings a.k.a. Positional Encodings
?
Positional Encodings can be looked upon as an identifier – that tells the word embeddings of the transformer of the whereabouts about the piece of word /input within a sequence of words. These embedding are then added to the initial vector representation of the input.
领英推荐
This addition of the initial vector representation with the positional encodings can be physically understood as follows:
?
Let us say we have the words “King” and “Queen” – thus both the words share close context and let us assume that the embedding vectors representing “King” and “Queen” are represented in the three-dimensional space as below:
By identifying the order of the word in the sentence, we offset the word embedding of the word as shown in the figure below. This multi-dimensional shifting helps the transformer capture the order information.
Thus, in short, positional embeddings are order or positional identifiers added to the original embedding vectors for the transformer architecture to incorporate the order of the sequence. The positional embeddings must fulfil certain requirements as discussed below which will then led to the mathematical formulation of the positional embeddings.
?
6. Requirements of the Positional Embedding
?
Let us intuitively understand some of the requirements of the positional embeddings:
?I.??1st Requirement:
Firstly, every positional embedding should have the same identifier, irrespective of the sequence length or what the input is. That is: the positional embedding for the corresponding position in a sentence must remain the same even though there might be another word in that sentence. This is illustrated through the figure below:
?II.??2nd Requirement
Since positional embeddings are offsets that get added to the original word embedding, they cannot be too large. If the positional (offset vector) embeddings are too large, then the embedding vector will get offset by a very large distance thus destroying the overall concept of the embedding space wherein the word vectors which are closely related must remain close to one another.
Thus, one can imagine that if the positional encoding is represented as vectors of magnitude based on the order of the word in the sequence – this cannot be correct6 as that might offset the embedding vector by a huge amount as stated above.? Thus, the positional embedding vectors have to be bounded – that is: the offsets introduced to incorporate the order information must be small/bounded.
?
7. Formulation of Positional Embedding Vectors
?
From the discussion in section 6,it is clear that the function to mathematically formulate the modelling of positional embeddings in the Transformer Neural Network architecture have to be bounded in the values it can take – like the sine or the cosine functions which can take values between -1 and +1 and are periodic functions and have the upside that they are defined up to infinity – thus, emulating the fact that the they are well suited to whatever be the length of a sentence (since they can take values up to infinity) and they will always have values between -1 and +1!
Another question that arises is: can we choose only sine or cosine functions?
If we choose only the sine or cosine functions as the positional embeddings, then, the outcome will repeat for different positions since these functions are periodic – to get the optimal solution satisfying the requirements explained in section 6, it is best to choose a combination of sines and cosines with different frequencies.
Keeping these requirements into consideration, the function is taken as:
Here:
?
?
That is: each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2 pi to 10000. 2pi
?
8. GitHub Repository - Colab notebooks with code
Positional Encoding Colab Notebooks: https://github.com/ajaytaneja-learner/transformers-notebooks
9. References
?
1.?????? Natural Language Processing – Specialization on deeplearning.ai
2.?????? Large Language Models – Specialization on deeplearning.ai
3.?????? Several You-Tube Videos and medium blogs
?