登录查看更多内容

Attention Mechanisms

Shradha Agarwal

SWE ops, Newfold Digital | IIITD MTech CSE (AI) | Devops | AWS Certified SA-Associate

发布日期: 2024年4月16日

The attention mechanism has significantly improved the performance of models in tasks like machine translation and text summarization. It allows models to focus on specific parts of the input, thereby capturing long-term dependencies more effectively. It understands the full context of a sentence. While embeddings are similar for similar words, for ambiguous terms like ‘bank’ (referring to either a financial institution or the side of a river), the complete context of the sentence is needed. Attention mechanisms facilitate this understanding. This article will delve into two key forms of the attention mechanism: self-attention and multihead attention.

Self-Attention

Figure 1. is taken from the paper - Attention is all you need by Vaswani et al..

We work with three matrices: Q (Query), K (Key), and V (Value). These can be conceptualized in terms of a dictionary, where the Key and Value pairs represent the data structure. To retrieve a Value for a given Query Q, we perform a dot product operation with all the Keys. The result of this operation is a measure of similarity between the Query and each Key. A Key that is highly similar to the Query yields a high attention score, indicating a strong association. Conversely, a Key that is dissimilar to the Query results in a low attention score, signifying a weak association. This allows the model to focus on the most relevant information in the data.

The formula presented in Figure 2 is identical to the diagrammatic representation in Figure 1.

Figure 3. Self-Attention Detailed Diagram

Each word in a sentence is initially transformed into a dense vector through an embedding layer, capturing the semantic significance of the words. This results in the input (seq, d_model), where 'seq' is the sequence length or the number of words in the sentence, and 'd_model' is the dimension of the embedding vector.

The embeddings undergo a linear transformation to form the Q, K, and V matrices, utilizing three distinct weight matrices that the model learns during training. Specifically: Q = Input W^Q, K = Input W^K, and V = Input * W^V, where W^Q, W^K, and W^V are the learned weight matrices (model learns during training) for the Query, Key, and Value, respectively.

Next, the dot product of Q and the transpose of K (K^T) is computed to assess the similarity between the Query and each Key. The outcome is scaled by the square root of d_k, the key's dimension. A softmax function is then applied to normalize the output values to a (0, 1) range, representing them as probabilities. These probabilities indicate the level of attention each Value should receive. A weighted sum of the Values is then calculated using this probability distribution. The final vector produced by the attention mechanism is a weighted representation of the input, determined by the attention scores.

领英推荐

The Art of Model Tuning: Mastering Grid Search, Random…

Sanjay Kumar MBA,MS,PhD 12 个月前

Understanding statistical inference

Ajit Jaokar 8 个月前

Machine Learning - Hyperparameter Tuning

Gaurav Pahuja 3 年前

What's self in self-attention? The term “self” in self-attention signifies that the attention scores are computed within the input sequence itself. It allows the model to focus on its own input, considering the entire sequence to capture dependencies between elements, regardless of their positions in the sequence.

Multi-head Attention

Figure 4. is taken from the paper - Attention is all you need by Vaswani et al..

The self-attention mechanism we discussed above is a form of single-head attention. In this setup, the attention operation is executed once on the input, and the dimensionality of the model (d_model) is equivalent to the key dimension (d_k).

When we transition to multi-head attention, the process evolves. The model’s embeddings are partitioned into h distinct segments, or “heads”. Each head independently performs the attention operation, allowing the model to capture various types of information from different perspectives in the input data. This enhances the model’s ability to understand complex patterns and dependencies.

Figure 5. Formula for Multi-head Attention

In this, everything is similar to self-attention above except few pointers:

In multi-head attention, the model’s embeddings are partitioned into h distinct segments, or “heads”. Each head independently performs the attention operation. This allows the model to capture various types of information from different perspectives in the input data.
For each head, different weight matrices W_{Qi}, W_{Ki}, and W_{Vi} are used for the Query, Key, and Value respectively. These are learned during training.
The outputs of all heads are concatenated and then linearly transformed using another learned weight matrix W_O.

The Transformer architecture, introduced in the paper “Attention is All You Need” by Vaswani et al., utilizes several multi-head attention network units. This architecture will be the focus of discussion in the next article.

要查看或添加评论，请登录

Shradha Agarwal的更多文章

Denoising Diffusion Probabilistic Model - DDPM

2024年5月22日

Denoising Diffusion Probabilistic Model - DDPM

Diffusion model is a generative model that has emerged as a powerful technique for creating realistic data. It operates…
PEFT with LoRA for Fine-tuning

2024年5月8日

PEFT with LoRA for Fine-tuning

Fine-tuning is a process where a pre-trained model is further trained on new data to enhance its performance on a…
Retrieval Augmented Generation (RAG)

2024年4月29日

Retrieval Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a method that improves how language models create text by using additional…

3 条评论
BERT-Bidirectional Encoder Representations from Transformers

2024年4月22日

BERT-Bidirectional Encoder Representations from Transformers

Introduction BERT was introduced in the research paper - "BERT: Pre-training of Deep Bidirectional Transformers for…
Transformer Architecture

2024年4月18日

Transformer Architecture

The Transformer is a groundbreaking model architecture introduced in the seminal paper “Attention is All You Need” by…
Long Short Term Memory (LSTM)

2024年4月8日

Long Short Term Memory (LSTM)

Figure 1. LSTM Architecture at time step t Long Short-Term Memory (LSTM) networks tackle a challenge in deep learning:…
Recurrent Neural Networks

2024年4月2日

Recurrent Neural Networks

RNNs are a type of artificial neural network architected specifically to tackle sequential data. In contrast to…
Generative Adversarial Networks

2024年3月28日

Generative Adversarial Networks

Figure 1. GAN Architecture Vanilla GAN introduced by Ian J.
Variational Autoencoders

2024年3月25日

Variational Autoencoders

Variational Autoencoders (VAEs) are generative models explicitly designed to capture the underlying probability…
KL Divergence: Prerequisite to Variational AutoEncoder (VAE)

2024年3月23日

KL Divergence: Prerequisite to Variational AutoEncoder (VAE)

KL Divergence The Kullback-Leibler divergence (KL divergence) assesses the inefficiency of approximating the true…

See all articles

Attention Mechanisms

Shradha Agarwal

SWE ops, Newfold Digital | IIITD MTech CSE (AI) | Devops | AWS Certified SA-Associate

Self-Attention

领英推荐

Multi-head Attention

Shradha Agarwal的更多文章

社区洞察

其他会员也浏览了

Demystifying Mathematical Models

What is the main benefit of Multi-Token Prediction (MTP)?

Which is easier to correct, an algorithm’s bias or a human’s?

Functionary V2.4 Model Release

A People Analytics Tutorial on Unsupervised Machine Learning - Cluster Analysis in R

A chat with GPT

Class 15 - INTRO TO SCIKIT LEARN AND CLASSIFICATION Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

AI, Fractal storage and Fractal Thinking.

Encoding Techniques

What is RandomizedSearchCV in Machine Learning

Self-Attention

领英推荐

Multi-head Attention

Shradha Agarwal的更多文章

Denoising Diffusion Probabilistic Model - DDPM

PEFT with LoRA for Fine-tuning

Retrieval Augmented Generation (RAG)

BERT-Bidirectional Encoder Representations from Transformers

Transformer Architecture

Long Short Term Memory (LSTM)

Recurrent Neural Networks

Generative Adversarial Networks

Variational Autoencoders

KL Divergence: Prerequisite to Variational AutoEncoder (VAE)

社区洞察

其他会员也浏览了

Demystifying Mathematical Models

What is the main benefit of Multi-Token Prediction (MTP)?

Which is easier to correct, an algorithm’s bias or a human’s?

Functionary V2.4 Model Release

A People Analytics Tutorial on Unsupervised Machine Learning - Cluster Analysis in R

A chat with GPT

Class 15 - INTRO TO SCIKIT LEARN AND CLASSIFICATION Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

AI, Fractal storage and Fractal Thinking.

Encoding Techniques

What is RandomizedSearchCV in Machine Learning