登录查看更多内容

QKV and Multi-head Attention in LLM

Nedved Yang

发布日期: 2024年3月5日

In the realm of Natural Language Processing (NLP), Large Language Models (LLMs) like GPT-3 and BERT have revolutionized how machines understand and generate human language. At the heart of these models lies the concept that QKV and Multi-Head Attention are the key. It sounds cryptic to me at the very beginning and takes me a few weeks to figure it out.

The following is what has been explained in the paper.

Query (Q):

Represents the current item that the model is focusing on.
In a sequence, the query is like asking a question about a particular element.

Key (K):

Represents all items in the sequence that the model could focus on.
The keys are what the query is compared against to determine how much attention should be paid to them.

Value (V):

Each key is associated with a value.
Once the model determines which keys are important (based on the query), the corresponding values are used to construct the output.

Well, it is still a bit hard to digest. Let’s look at an example below.

A simple sentence like “Tom is going to fish at the river bank ” is easy for us to understand. To let computers understand it, we need to encode every word into numbers, which is called Word embedding. Assuming in a simple six-dimensional space, the word “River” can be represented as a word embedding of [-0.9, 0.9,-0.2, 0.4, 0.2, 0.6]. Those words with a higher “similarity” will be close to each other. For example, Group 1) River, Fish, and Fishman. Group 2) Hospital, PostOffice, and Restaurant. It becomes interesting when we try to figure out where to put the word “Bank”. It is a polysemy that can be interpreted differently based on the context of the sentence in which it is. Should it be closer to Group 1 or Group 2?

Now, let’s look at the sentence again,

Tom is going to fish at the river bank.

When we read it, we know “bank” can not be the place where you can draw money. Why? Well, the presence of the words “River” and “Fish” contribute more to our understanding of the context, compared to the rest. Therefore, they should have high attention scores and be closer to the “bank”.

领英推荐

LeewayHertz Weekly Digest - Unleashing the Power of AI…

LeewayHertz 1 年前

Exploring Mixtral 8x7B: Deep Dive into its…

Ashish Patel ???? 1 年前

Why ‘Attention is All You Need’: A Deep Dive into the…

Dr. Rabi Prasad Padhy 5 个月前

How does a computer determine that it should pay more attention to “River” and “Fish” and not the others? This is where the Q (Query) and K (Key) come in. They are two linear transformations that help answer the question: within this sentence, what are the similarity scores among the words?

Firstly, the input of both is the same input embeddings( let’s put the positional embedding aside first), assuming 6 dimensions, illustrated below.

Applying the linear transformation of K and Q to the input embedding,

The output goes through the steps of MatMul, Scale, Mask, and SoftMax to get the attention weights, and MatMul with V. We then have the final output, a weighted sum of the values, where the weights are determined by how well each key matches the query. So, the new embedding, compared to the original one, captures more contextual relationships.

For example, the word “bank” has the highest attention score with “bank”, “river” and “fish”. So the model will focus more on these input words.

Why do we have to go through this complicated QKV transformation?

If we are asked to describe what is in a picture, rather than scanning from the top left corner, pixel by pixel, our brain will immediately focus on the most prominent elements, like a boy in the scene. This process is highly efficient and effective, demonstrating the power of attention.

If you consider QKV as one set of linear projections, representing a so-called attention head, then multi-headed attention is simply having multiple sets of QKV and concatenating the outputs. The benefit of having multi-heads is to allow us to find different aspects of similarity. For instance, one head can focus on the nearby nouns, while another might look at the verb-objective relation. Back to the picture above, one “head” might detect the boy, and another sees the ball.

This is an intuitive explanation of QKV and multi-head attention. If you want to know the mathematical part of it, the original paper “Attention Is All You Need” is a good place to start. Have fun!

Varun Khare

7 个月

So concise and clear

要查看或添加评论，请登录

Nedved Yang的更多文章

Geek Out Time: Model Context Protocol (MCP) and the Future of AI Tooling - Upgrading Previously-Built Multi-Agent Financial Advisor Copilot

2025年3月24日

Geek Out Time: Model Context Protocol (MCP) and the Future of AI Tooling - Upgrading Previously-Built Multi-Agent Financial Advisor Copilot

(Also on Constellar tech blog:…
Geek Out Time: Trying newly released OpenAI’s Responses API with Web Search Tool in Google Colab

2025年3月17日

Geek Out Time: Trying newly released OpenAI’s Responses API with Web Search Tool in Google Colab

(Also on Constellar tech blog:…

2 条评论
Geek Out Time: Building a Multi-Agent Financial Advisor Copilot with AG2 (formerly AutoGen), OpenAI, and DeepSeek LLM

2025年3月3日

Geek Out Time: Building a Multi-Agent Financial Advisor Copilot with AG2 (formerly AutoGen), OpenAI, and DeepSeek LLM

(Also on Constellar tech blog…

2 条评论
Geek Out Time: Simulating Distributed Training on TPU & GPU in Google Colab

2025年2月24日

Geek Out Time: Simulating Distributed Training on TPU & GPU in Google Colab

(Also on Constellar tech blog…
Geek Out Time: “Vibe Coding” on Google Colab with OpenAI & DeepSeek

2025年2月17日

Geek Out Time: “Vibe Coding” on Google Colab with OpenAI & DeepSeek

(Also on Constellar tech blog…

2 条评论
Geek Out Time: Mixture of Experts(MoE) vs. CNN: A Google Colab Experiment

2025年2月10日

Geek Out Time: Mixture of Experts(MoE) vs. CNN: A Google Colab Experiment

(Also on Constellar tech blog…

4 条评论
Geek Out Time: Knowledge Distillation in TensorFlow- Smaller, Smarter Models in Google Colab

2025年2月4日

Geek Out Time: Knowledge Distillation in TensorFlow- Smaller, Smarter Models in Google Colab

(Also on Constellar tech blog…
Geek Out Time: Build Your Own Autonomous AI Agent Backed by the Top Open-Source LLM DeepSeek v3 and Browser-Use Web UI-Right in Your Browser

2025年1月20日

Geek Out Time: Build Your Own Autonomous AI Agent Backed by the Top Open-Source LLM DeepSeek v3 and Browser-Use Web UI-Right in Your Browser

(Also on Constellar tech blog…

2 条评论
Geek Out Time: AI Model Routing — Dynamically Choose Models Based on Question Complexity

2025年1月13日

Geek Out Time: AI Model Routing — Dynamically Choose Models Based on Question Complexity

(Also on Constellar tech blog…
Geek Out Time: AI in the Browser- Run WebLLM for Powerful, Local LLM Experiences

2024年12月23日

Geek Out Time: AI in the Browser- Run WebLLM for Powerful, Local LLM Experiences

(Also on Constellar tech blog https://nedvedyang.medium.

1 条评论

See all articles

QKV and Multi-head Attention in LLM

Nedved Yang

领英推荐

Nedved Yang的更多文章

社区洞察

其他会员也浏览了

How AI and Natural Language Processing (NLP) are transforming Global medicine?

Transformer Theory Made Simple

Impact of Increasing Input Size on Attention Fidelity in Modified Transformer-based Models

Test-Time Training (TTT): A New Approach to Sequence Modeling

Attention Mechanism in Depth – How Self-Attention Helps AI Focus on Relevant Words in a Sentence

Understanding the Encoder-Decoder Transformer: A Deep Dive

Beyond Words: The Future of Machine Learning with Transformer Models

Transformer Encoder: A Closer Look at its Key Components

Transformers: The Gateway to Natural Language Processing (NLP)

Comparing “O1 Pro Mode” Reasoning Models and GPT-4o Models

领英推荐

Nedved Yang的更多文章

Geek Out Time: Model Context Protocol (MCP) and the Future of AI Tooling - Upgrading Previously-Built Multi-Agent Financial Advisor Copilot

Geek Out Time: Trying newly released OpenAI’s Responses API with Web Search Tool in Google Colab

Geek Out Time: Building a Multi-Agent Financial Advisor Copilot with AG2 (formerly AutoGen), OpenAI, and DeepSeek LLM

Geek Out Time: Simulating Distributed Training on TPU & GPU in Google Colab

Geek Out Time: “Vibe Coding” on Google Colab with OpenAI & DeepSeek

Geek Out Time: Mixture of Experts(MoE) vs. CNN: A Google Colab Experiment

Geek Out Time: Knowledge Distillation in TensorFlow- Smaller, Smarter Models in Google Colab

Geek Out Time: Build Your Own Autonomous AI Agent Backed by the Top Open-Source LLM DeepSeek v3 and Browser-Use Web UI-Right in Your Browser

Geek Out Time: AI Model Routing — Dynamically Choose Models Based on Question Complexity

Geek Out Time: AI in the Browser- Run WebLLM for Powerful, Local LLM Experiences

社区洞察

其他会员也浏览了

How AI and Natural Language Processing (NLP) are transforming Global medicine?

Transformer Theory Made Simple

Impact of Increasing Input Size on Attention Fidelity in Modified Transformer-based Models

Test-Time Training (TTT): A New Approach to Sequence Modeling

Attention Mechanism in Depth – How Self-Attention Helps AI Focus on Relevant Words in a Sentence

Understanding the Encoder-Decoder Transformer: A Deep Dive

Beyond Words: The Future of Machine Learning with Transformer Models

Transformer Encoder: A Closer Look at its Key Components

Transformers: The Gateway to Natural Language Processing (NLP)

Comparing “O1 Pro Mode” Reasoning Models and GPT-4o Models