Demystifying Multihead Attention in the Transformer Neural Network Architecture – With Code
Attention is All You Need (Google paper)

Demystifying Multihead Attention in the Transformer Neural Network Architecture – With Code

1.???Introduction

This is the continuation of my series of blogs on Large Language Models – in this article I’m going to talk about the multi-head attention mechanism in the Transformer neural network architecture and code out the essential mathematical ingredients in a colab notebook. This article is organized as follows:


  • In the section 2, I discuss about the inner working of the Transformer with detailed emphasis on multi-head attention.


  • ?In the section 3, I have provided code snippets from my Colab notebook illustrating the calculation details in the multi-head attention code block.


  • ?The section 4 points to my GitHub repository which has 2 new colab notebooks which contain the code for each step involved in the multi-head attention mechanism. The overall ingredients in both the notebooks are essentially the same but the second notebook is architecturally organized into classes and functions and doc string explanations.

?

The highlight of this article is the source code (Section 4 – in my GitHub) of the Multihead attention block which I have not discussed in any of my blogs on LLMs until now.


This is my 9th blog in my series of blogs on Large Language Models – all of these articles can be found on my LinkedIn profile and the hyperlinks are as follows:

?

i)?????ChatGPT and how it works – my notes: Part 1

ii)????Step 2 of ChatGPT training demystified: Part 2 of the ChatGPT series of my notes

iii)???Step 3 of ChatGPT training demystified: Part 3 of the ChatGPT series of my notes

iv)???Step 1 of ChatGPT Training Demystified: Part 4 of ChatGPT series of my notes

v)????Foundational Principles of Deep Learning – my notes

vi)???The Evolution of Language Models: my notes

vii)??Demystifying Self Attention in the Transformer Neural Network Architecture - With Code

viii)? LangChain – Essential Concepts – my notes

?

2.?Multihead Attention: Intuition and Calculation details

?The Transformer model introduced in the paper “Attention Is All You Need” dramatically improved the performance of the NLP tasks over the earlier generation models like the Recurrent Neural Networks and the LSTMs. As explained in my article on the Evolution of Language Models [https://www.dhirubhai.net/pulse/evolution-language-models-my-notes-ajay-taneja/?trackingId=d5WQH95MRp2iA%2B6b8ky0UQ%3D%3D], the power of the Transformers lied in the ability to come out with a measure of the relevance/context of all the words in a sentence – the concept of Attention matrix expressed how each word in a sentence was related to the rest of the words in that sentence.

No alt text provided for this image
Figure: Google’s paper on Attention Is All You Need


No alt text provided for this image
Figure : Attention Scores (Attention Matrix)

?

The Transformer working: Overview.

The Transformer is split into 2 distinct parts: the Encoder and the Decoder. These components work in conjunction with each other, and they share a number of similarities.

No alt text provided for this image
Figure: Transformer – Encoder-Decoder


No alt text provided for this image
Figure: Encoder and Decoder portion of the Transformer


Input Embedding and positional encoding:

To start with, since computers do not understand words and only understand numbers/vectors/matrices, therefore, before passing the text into the model to process, one must first tokenize the words. Once the words are tokenized, these tokens/numbers are passed to an embedding layer which is a trainable vector embedding space – a high dimension space where each token is represented as a vector and occupies a unique location in that space. These embedding vectors learn to the encode the meaning and context of individual tokens in the sentence.


Looking at a sample sequence as below, each word has been mapped to a token id and each token id has been mapped to a vector. In the original transformer paper, the vector size was 512 – as an example, one can see the embedding vectors corresponding to the words which are closely related are closely spaced as shown in the figure below in the embedding space.


No alt text provided for this image
Figure: Each word mapped to a token id and each token mapped to a vector



No alt text provided for this image
Figure: Closely related words are closely spaced in the embedding space (note:in reality the embedding is of a much higher dimension than 3 dimension)


We then add the positional encoding through which we preserve the information of the word order.??

No alt text provided for this image
Figure: Positional encoding added to the word embedding to maintain the relevance of the word order.


Self-Attention:

The embedding vector and the positional encoding are summed, and the sum is passed to the self-attention layer. By adding the positional encoding, the word order information is preserved and thus the relevance of the position of the word in the sentence is maintained.


In the self-attention, the model analysis the relationship between the tokens in the input sentence. We use 3 separate neural network layers each with separate set of weights to generate the query, key and value vectors respectively. The weights are the learnable parameters during the learning process. You might refer to my blog on the Evolution of Language models (hyperlinked above in the introduction section of this article) to understand about the physical interpretation of the query, key and value vectors.

?

Multihead self-attention:

The above process does not happen just once – the Transformer architecture has multi-head attention which means that multiple sets (or heads) of self-attention are learnt in parallel. The number of attention heads vary from model to model, but numbers may range to the order of 12 – 100.?


Intuition behind multiple attention heads

The intuition behind multi-headed attention in the Transformers is that 'each head' learns a different aspect of language. For example, one head might see the relationship between people, another head may focus on the context of the sentence and another head may see relationship between nouns and the numerical values and another the rhyming between the words (or tokens) and so on..

The learning of the aspect of the language is obviously not decided ahead of the training process - the weights of each set of attention head are randomly initialised and the network is given sufficient training data and during the training process each head learns an aspect of language. I feel it is generally not easy to interpret which aspect of language a particular head is learning or has learnt.


Multi-head Attention – calculations in a nutshell

  • To form the attention heads – the query, key and value vectors are broken down into “X” pieces, each piece being an attention head. Thus, if the original word vector is of size 512, then, the query, key and value vectors will be of size 512 / x, so, if the number of attention heads is 8, the size of the query, key and value vectors will be 512 / 8 = 64.

?

  • An Attention matrix will be generated for each attention head which will be equal to the sequence length x sequence length. So, supposing if we have a sentence “I Love Deep Learning” , the size of the attention matrix for each head will be 4 x 4.

?

  • The Attention matrix is generated as a result of the mathematical operation, QK^T and the product is scaled to eliminate the variance and passed through the softmax to convert the numbers into probabilities.

?

  • The concept of Attention matrix expressed how each word in a sentence was related to the rest of the words in that sentence. Thus, the model learns of each word to every other word, no matter where they are in the input. The Attention matrix shown below clearly illustrated how each word in a sentence was related to all other words:


No alt text provided for this image
Figure: Attention Scores


The Attention Scores illustrated above in the Attention matrix lie between 0 and 1 which are obtained as a dot product of Query and Key matrices (scaled to reduce the variance) and passing the resultant through a softmax to get the probabilities between 0 and 1.

?

  • It may be noted from the above matrix – for example – the word “deep” is strongly connected with or paying attention to the word “learning” and thus the score (0.7) is high and similarly the word learning is strongly connected to the word “Deep”.

?

  • This Attention Matrix is multiplied by the Value to get the resultant contextually rich transformed word embedding as illustrated in the equation below.

No alt text provided for this image
Getting the contextually rich (transformed) embedding vectors


The Decoder portion:

Next to the Decoder portion, we pass the English – contextually aware embeddings – along with a start token denoting the start of the sentence and start generating the French words one after the other. At the nth time step – to obtain the nth French word – we pass the English words that were obtained after processing through the encoder unit of the architecture plus the French words that were generated up-to the (n-1)th time step. This is illustrated in the figures below.


No alt text provided for this image
No alt text provided for this image
Figure: Working of the Decoder portion of the Transformer for a language translation task


?3. Code flow of the overall Multihead Attention Mechanism:


No alt text provided for this image
Figure: Colab notebook extracts - a
No alt text provided for this image
Figure: Colab notebook extracts - b
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

4.??Coding the Mathematics in Multihead Attention [Colab Notebook]

Multihead-Attention Colab Notebooks – GitHub -?https://github.com/ajaytaneja-learner/transformers-notebooks

要查看或添加评论,请登录

Ajay Taneja的更多文章

社区洞察

其他会员也浏览了