Demystifying Multihead Attention in the Transformer Neural Network Architecture – With Code
Ajay Taneja
Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics
1.???Introduction
This is the continuation of my series of blogs on Large Language Models – in this article I’m going to talk about the multi-head attention mechanism in the Transformer neural network architecture and code out the essential mathematical ingredients in a colab notebook. This article is organized as follows:
?
The highlight of this article is the source code (Section 4 – in my GitHub) of the Multihead attention block which I have not discussed in any of my blogs on LLMs until now.
This is my 9th blog in my series of blogs on Large Language Models – all of these articles can be found on my LinkedIn profile and the hyperlinks are as follows:
?
?
2.?Multihead Attention: Intuition and Calculation details
?The Transformer model introduced in the paper “Attention Is All You Need” dramatically improved the performance of the NLP tasks over the earlier generation models like the Recurrent Neural Networks and the LSTMs. As explained in my article on the Evolution of Language Models [https://www.dhirubhai.net/pulse/evolution-language-models-my-notes-ajay-taneja/?trackingId=d5WQH95MRp2iA%2B6b8ky0UQ%3D%3D], the power of the Transformers lied in the ability to come out with a measure of the relevance/context of all the words in a sentence – the concept of Attention matrix expressed how each word in a sentence was related to the rest of the words in that sentence.
?
The Transformer working: Overview.
The Transformer is split into 2 distinct parts: the Encoder and the Decoder. These components work in conjunction with each other, and they share a number of similarities.
Input Embedding and positional encoding:
To start with, since computers do not understand words and only understand numbers/vectors/matrices, therefore, before passing the text into the model to process, one must first tokenize the words. Once the words are tokenized, these tokens/numbers are passed to an embedding layer which is a trainable vector embedding space – a high dimension space where each token is represented as a vector and occupies a unique location in that space. These embedding vectors learn to the encode the meaning and context of individual tokens in the sentence.
Looking at a sample sequence as below, each word has been mapped to a token id and each token id has been mapped to a vector. In the original transformer paper, the vector size was 512 – as an example, one can see the embedding vectors corresponding to the words which are closely related are closely spaced as shown in the figure below in the embedding space.
We then add the positional encoding through which we preserve the information of the word order.??
领英推荐
Self-Attention:
The embedding vector and the positional encoding are summed, and the sum is passed to the self-attention layer. By adding the positional encoding, the word order information is preserved and thus the relevance of the position of the word in the sentence is maintained.
In the self-attention, the model analysis the relationship between the tokens in the input sentence. We use 3 separate neural network layers each with separate set of weights to generate the query, key and value vectors respectively. The weights are the learnable parameters during the learning process. You might refer to my blog on the Evolution of Language models (hyperlinked above in the introduction section of this article) to understand about the physical interpretation of the query, key and value vectors.
?
Multihead self-attention:
The above process does not happen just once – the Transformer architecture has multi-head attention which means that multiple sets (or heads) of self-attention are learnt in parallel. The number of attention heads vary from model to model, but numbers may range to the order of 12 – 100.?
Intuition behind multiple attention heads
The intuition behind multi-headed attention in the Transformers is that 'each head' learns a different aspect of language. For example, one head might see the relationship between people, another head may focus on the context of the sentence and another head may see relationship between nouns and the numerical values and another the rhyming between the words (or tokens) and so on..
The learning of the aspect of the language is obviously not decided ahead of the training process - the weights of each set of attention head are randomly initialised and the network is given sufficient training data and during the training process each head learns an aspect of language. I feel it is generally not easy to interpret which aspect of language a particular head is learning or has learnt.
Multi-head Attention – calculations in a nutshell
?
?
?
The Attention Scores illustrated above in the Attention matrix lie between 0 and 1 which are obtained as a dot product of Query and Key matrices (scaled to reduce the variance) and passing the resultant through a softmax to get the probabilities between 0 and 1.
?
?
The Decoder portion:
Next to the Decoder portion, we pass the English – contextually aware embeddings – along with a start token denoting the start of the sentence and start generating the French words one after the other. At the nth time step – to obtain the nth French word – we pass the English words that were obtained after processing through the encoder unit of the architecture plus the French words that were generated up-to the (n-1)th time step. This is illustrated in the figures below.
?3. Code flow of the overall Multihead Attention Mechanism:
4.??Coding the Mathematics in Multihead Attention [Colab Notebook]
Multihead-Attention Colab Notebooks – GitHub -?https://github.com/ajaytaneja-learner/transformers-notebooks