登录查看更多内容

Demystifying Multihead Attention in the Transformer Neural Network Architecture – With Code

Ajay Taneja

Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics

发布日期: 2023年8月6日

1.???Introduction

This is the continuation of my series of blogs on Large Language Models – in this article I’m going to talk about the multi-head attention mechanism in the Transformer neural network architecture and code out the essential mathematical ingredients in a colab notebook. This article is organized as follows:

In the section 2, I discuss about the inner working of the Transformer with detailed emphasis on multi-head attention.

?In the section 3, I have provided code snippets from my Colab notebook illustrating the calculation details in the multi-head attention code block.

?The section 4 points to my GitHub repository which has 2 new colab notebooks which contain the code for each step involved in the multi-head attention mechanism. The overall ingredients in both the notebooks are essentially the same but the second notebook is architecturally organized into classes and functions and doc string explanations.

The highlight of this article is the source code (Section 4 – in my GitHub) of the Multihead attention block which I have not discussed in any of my blogs on LLMs until now.

This is my 9th blog in my series of blogs on Large Language Models – all of these articles can be found on my LinkedIn profile and the hyperlinks are as follows:

i)?????ChatGPT and how it works – my notes: Part 1

ii)????Step 2 of ChatGPT training demystified: Part 2 of the ChatGPT series of my notes

iii)???Step 3 of ChatGPT training demystified: Part 3 of the ChatGPT series of my notes

iv)???Step 1 of ChatGPT Training Demystified: Part 4 of ChatGPT series of my notes

v)????Foundational Principles of Deep Learning – my notes

vi)???The Evolution of Language Models: my notes

vii)??Demystifying Self Attention in the Transformer Neural Network Architecture - With Code

viii)? LangChain – Essential Concepts – my notes

2.?Multihead Attention: Intuition and Calculation details

?The Transformer model introduced in the paper “Attention Is All You Need” dramatically improved the performance of the NLP tasks over the earlier generation models like the Recurrent Neural Networks and the LSTMs. As explained in my article on the Evolution of Language Models [https://www.dhirubhai.net/pulse/evolution-language-models-my-notes-ajay-taneja/?trackingId=d5WQH95MRp2iA%2B6b8ky0UQ%3D%3D], the power of the Transformers lied in the ability to come out with a measure of the relevance/context of all the words in a sentence – the concept of Attention matrix expressed how each word in a sentence was related to the rest of the words in that sentence.

No alt text provided for this image — Figure: Google’s paper on Attention Is All You Need

The Transformer working: Overview.

The Transformer is split into 2 distinct parts: the Encoder and the Decoder. These components work in conjunction with each other, and they share a number of similarities.

Input Embedding and positional encoding:

To start with, since computers do not understand words and only understand numbers/vectors/matrices, therefore, before passing the text into the model to process, one must first tokenize the words. Once the words are tokenized, these tokens/numbers are passed to an embedding layer which is a trainable vector embedding space – a high dimension space where each token is represented as a vector and occupies a unique location in that space. These embedding vectors learn to the encode the meaning and context of individual tokens in the sentence.

Looking at a sample sequence as below, each word has been mapped to a token id and each token id has been mapped to a vector. In the original transformer paper, the vector size was 512 – as an example, one can see the embedding vectors corresponding to the words which are closely related are closely spaced as shown in the figure below in the embedding space.

We then add the positional encoding through which we preserve the information of the word order.??

领英推荐

Understanding deep learning models as overcoming…

Ajit Jaokar 1 年前

In search of equivalent of CNNs for wireless…

Subramaniyam Venkata Pooni 2 个月前

ConvNext: The Return Of Convolution Networks

Ritesh Kanjee 3 年前

Self-Attention:

The embedding vector and the positional encoding are summed, and the sum is passed to the self-attention layer. By adding the positional encoding, the word order information is preserved and thus the relevance of the position of the word in the sentence is maintained.

In the self-attention, the model analysis the relationship between the tokens in the input sentence. We use 3 separate neural network layers each with separate set of weights to generate the query, key and value vectors respectively. The weights are the learnable parameters during the learning process. You might refer to my blog on the Evolution of Language models (hyperlinked above in the introduction section of this article) to understand about the physical interpretation of the query, key and value vectors.

Multihead self-attention:

The above process does not happen just once – the Transformer architecture has multi-head attention which means that multiple sets (or heads) of self-attention are learnt in parallel. The number of attention heads vary from model to model, but numbers may range to the order of 12 – 100.?

Intuition behind multiple attention heads

The intuition behind multi-headed attention in the Transformers is that 'each head' learns a different aspect of language. For example, one head might see the relationship between people, another head may focus on the context of the sentence and another head may see relationship between nouns and the numerical values and another the rhyming between the words (or tokens) and so on..

The learning of the aspect of the language is obviously not decided ahead of the training process - the weights of each set of attention head are randomly initialised and the network is given sufficient training data and during the training process each head learns an aspect of language. I feel it is generally not easy to interpret which aspect of language a particular head is learning or has learnt.

Multi-head Attention – calculations in a nutshell

To form the attention heads – the query, key and value vectors are broken down into “X” pieces, each piece being an attention head. Thus, if the original word vector is of size 512, then, the query, key and value vectors will be of size 512 / x, so, if the number of attention heads is 8, the size of the query, key and value vectors will be 512 / 8 = 64.

An Attention matrix will be generated for each attention head which will be equal to the sequence length x sequence length. So, supposing if we have a sentence “I Love Deep Learning” , the size of the attention matrix for each head will be 4 x 4.

The Attention matrix is generated as a result of the mathematical operation, QK^T and the product is scaled to eliminate the variance and passed through the softmax to convert the numbers into probabilities.

The concept of Attention matrix expressed how each word in a sentence was related to the rest of the words in that sentence. Thus, the model learns of each word to every other word, no matter where they are in the input. The Attention matrix shown below clearly illustrated how each word in a sentence was related to all other words:

The Attention Scores illustrated above in the Attention matrix lie between 0 and 1 which are obtained as a dot product of Query and Key matrices (scaled to reduce the variance) and passing the resultant through a softmax to get the probabilities between 0 and 1.

It may be noted from the above matrix – for example – the word “deep” is strongly connected with or paying attention to the word “learning” and thus the score (0.7) is high and similarly the word learning is strongly connected to the word “Deep”.

This Attention Matrix is multiplied by the Value to get the resultant contextually rich transformed word embedding as illustrated in the equation below.

The Decoder portion:

Next to the Decoder portion, we pass the English – contextually aware embeddings – along with a start token denoting the start of the sentence and start generating the French words one after the other. At the nth time step – to obtain the nth French word – we pass the English words that were obtained after processing through the encoder unit of the architecture plus the French words that were generated up-to the (n-1)th time step. This is illustrated in the figures below.

?3. Code flow of the overall Multihead Attention Mechanism:

4.??Coding the Mathematics in Multihead Attention [Colab Notebook]

Multihead-Attention Colab Notebooks – GitHub -?https://github.com/ajaytaneja-learner/transformers-notebooks

要查看或添加评论，请登录

Ajay Taneja的更多文章

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

2025年2月24日

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

1. Introduction: This article is the continuation of my series of articles on “Fine-Tuning of LLMs” and is the fourth…
Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

2025年2月10日

Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

1. Introduction This is the continuation of my series of blogs on Fine-Tuning of LLMs and is the third blog in the…
Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

2025年2月4日

Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

1. Introduction This is the continuation of my series of blogs on Fine-Tuning and is the second blog in the series.
Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

2025年2月1日

Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

1. Fine Tuning Series and Background of Transformers and ChatGPT Training Process: One of my earlier series of blogs…
RAG Beyond Basics:

2025年1月7日

RAG Beyond Basics:

1. Introduction: In this article/blog, I will discussing some advanced techniques in the Retrieval-Augmented Generation…
The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

2024年10月24日

The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

1. Introduction: The general idea of Retrieval-Augmented Generation (RAGs) is now well understood in LLM community and…

2 条评论
Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

2024年9月23日

Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

1. Introduction: This is the continuation of my series of blogs on Graphs and is the 14th article in the series.

3 条评论
Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

2024年8月26日

Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

1. Introduction: This is the continuation of my Graph Series of Blogs and is the thirteenth blog in the series.
Training Graph Neural Networks: Part 12 of my Graph series of blogs

2024年8月18日

Training Graph Neural Networks: Part 12 of my Graph series of blogs

1. Introduction: This is the continuation of my series of blogs on Graphs and is the twelfth article in the series.
Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

2024年6月30日

Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

1. Introduction: This article is the continuation of my series of blogs on “Graphs” and is the eleventh article in the…

See all articles

Demystifying Multihead Attention in the Transformer Neural Network Architecture – With Code

Ajay Taneja

Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics

1.???Introduction

2.?Multihead Attention: Intuition and Calculation details

领英推荐

?3. Code flow of the overall Multihead Attention Mechanism:

4.??Coding the Mathematics in Multihead Attention [Colab Notebook]

Ajay Taneja的更多文章

社区洞察

其他会员也浏览了

FIFTY Transfer Learning Models (for Deep Neural Networks) From Keras & PyTorch with Useful Links (for advanced ML Practitioners) - Shailendra Kadre

Understanding Neural Networks by Building a Language Model from Scratch

What is Neural Networks? | Neural Networks + AI - Brains Behind the Bots: Magic of Neural Networks in the World of AI

Table Parsing Made Simple with Homegrown Neural Networks - Part 3: Building a Neural Network with Semantic & Positional Features

A Primer on Natural Language Processing: Sequence models vs. Attention models

Demystifying the Add & Norm Block in the Transformer Neural Network Architecture: With Code

Part 3: How machines remember

Deep Dive: Building GPT from scratch - part 5

Table Parsing Made Simple with Homegrown Neural Networks - Part 4: Training Pipeline Coding Insights

From Early AI to Modern Large Language Models

1.???Introduction

2.?Multihead Attention: Intuition and Calculation details

领英推荐

?3. Code flow of the overall Multihead Attention Mechanism:

4.??Coding the Mathematics in Multihead Attention [Colab Notebook]

Ajay Taneja的更多文章

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

RAG Beyond Basics:

The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

Training Graph Neural Networks: Part 12 of my Graph series of blogs

Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

社区洞察

其他会员也浏览了

FIFTY Transfer Learning Models (for Deep Neural Networks) From Keras & PyTorch with Useful Links (for advanced ML Practitioners) - Shailendra Kadre

Understanding Neural Networks by Building a Language Model from Scratch

What is Neural Networks? | Neural Networks + AI - Brains Behind the Bots: Magic of Neural Networks in the World of AI

Table Parsing Made Simple with Homegrown Neural Networks - Part 3: Building a Neural Network with Semantic & Positional Features

A Primer on Natural Language Processing: Sequence models vs. Attention models

Demystifying the Add & Norm Block in the Transformer Neural Network Architecture: With Code

Part 3: How machines remember

Deep Dive: Building GPT from scratch - part 5

Table Parsing Made Simple with Homegrown Neural Networks - Part 4: Training Pipeline Coding Insights

From Early AI to Modern Large Language Models