Attention Mechanisms

Attention Mechanisms

The attention mechanism has significantly improved the performance of models in tasks like machine translation and text summarization. It allows models to focus on specific parts of the input, thereby capturing long-term dependencies more effectively. It understands the full context of a sentence. While embeddings are similar for similar words, for ambiguous terms like ‘bank’ (referring to either a financial institution or the side of a river), the complete context of the sentence is needed. Attention mechanisms facilitate this understanding. This article will delve into two key forms of the attention mechanism: self-attention and multihead attention.

Self-Attention

Figure 1. Scaled Dot-product Attention

Figure 1. is taken from the paper - Attention is all you need by Vaswani et al..

We work with three matrices: Q (Query), K (Key), and V (Value). These can be conceptualized in terms of a dictionary, where the Key and Value pairs represent the data structure. To retrieve a Value for a given Query Q, we perform a dot product operation with all the Keys. The result of this operation is a measure of similarity between the Query and each Key. A Key that is highly similar to the Query yields a high attention score, indicating a strong association. Conversely, a Key that is dissimilar to the Query results in a low attention score, signifying a weak association. This allows the model to focus on the most relevant information in the data.

Figure 2. Formula for self-attention

The formula presented in Figure 2 is identical to the diagrammatic representation in Figure 1.

Figure 3. Self-Attention Detailed Diagram

Each word in a sentence is initially transformed into a dense vector through an embedding layer, capturing the semantic significance of the words. This results in the input (seq, d_model), where 'seq' is the sequence length or the number of words in the sentence, and 'd_model' is the dimension of the embedding vector.

The embeddings undergo a linear transformation to form the Q, K, and V matrices, utilizing three distinct weight matrices that the model learns during training. Specifically: Q = Input W^Q, K = Input W^K, and V = Input * W^V, where W^Q, W^K, and W^V are the learned weight matrices (model learns during training) for the Query, Key, and Value, respectively.

Next, the dot product of Q and the transpose of K (K^T) is computed to assess the similarity between the Query and each Key. The outcome is scaled by the square root of d_k, the key's dimension. A softmax function is then applied to normalize the output values to a (0, 1) range, representing them as probabilities. These probabilities indicate the level of attention each Value should receive. A weighted sum of the Values is then calculated using this probability distribution. The final vector produced by the attention mechanism is a weighted representation of the input, determined by the attention scores.

What's self in self-attention? The term “self” in self-attention signifies that the attention scores are computed within the input sequence itself. It allows the model to focus on its own input, considering the entire sequence to capture dependencies between elements, regardless of their positions in the sequence.

Multi-head Attention

Figure 4. Multi-head Attention

Figure 4. is taken from the paper - Attention is all you need by Vaswani et al..

The self-attention mechanism we discussed above is a form of single-head attention. In this setup, the attention operation is executed once on the input, and the dimensionality of the model (d_model) is equivalent to the key dimension (d_k).

When we transition to multi-head attention, the process evolves. The model’s embeddings are partitioned into h distinct segments, or “heads”. Each head independently performs the attention operation, allowing the model to capture various types of information from different perspectives in the input data. This enhances the model’s ability to understand complex patterns and dependencies.

Figure 5. Formula for Multi-head Attention

In this, everything is similar to self-attention above except few pointers:

  • In multi-head attention, the model’s embeddings are partitioned into h distinct segments, or “heads”. Each head independently performs the attention operation. This allows the model to capture various types of information from different perspectives in the input data.
  • For each head, different weight matrices W_{Qi}, W_{Ki}, and W_{Vi} are used for the Query, Key, and Value respectively. These are learned during training.
  • The outputs of all heads are concatenated and then linearly transformed using another learned weight matrix W_O.

The Transformer architecture, introduced in the paper “Attention is All You Need” by Vaswani et al., utilizes several multi-head attention network units. This architecture will be the focus of discussion in the next article.

要查看或添加评论,请登录

Shradha Agarwal的更多文章

  • Denoising Diffusion Probabilistic Model - DDPM

    Denoising Diffusion Probabilistic Model - DDPM

    Diffusion model is a generative model that has emerged as a powerful technique for creating realistic data. It operates…

  • PEFT with LoRA for Fine-tuning

    PEFT with LoRA for Fine-tuning

    Fine-tuning is a process where a pre-trained model is further trained on new data to enhance its performance on a…

  • Retrieval Augmented Generation (RAG)

    Retrieval Augmented Generation (RAG)

    Retrieval-Augmented Generation (RAG) is a method that improves how language models create text by using additional…

    3 条评论
  • BERT-Bidirectional Encoder Representations from Transformers

    BERT-Bidirectional Encoder Representations from Transformers

    Introduction BERT was introduced in the research paper - "BERT: Pre-training of Deep Bidirectional Transformers for…

  • Transformer Architecture

    Transformer Architecture

    The Transformer is a groundbreaking model architecture introduced in the seminal paper “Attention is All You Need” by…

  • Long Short Term Memory (LSTM)

    Long Short Term Memory (LSTM)

    Figure 1. LSTM Architecture at time step t Long Short-Term Memory (LSTM) networks tackle a challenge in deep learning:…

  • Recurrent Neural Networks

    Recurrent Neural Networks

    RNNs are a type of artificial neural network architected specifically to tackle sequential data. In contrast to…

  • Generative Adversarial Networks

    Generative Adversarial Networks

    Figure 1. GAN Architecture Vanilla GAN introduced by Ian J.

  • Variational Autoencoders

    Variational Autoencoders

    Variational Autoencoders (VAEs) are generative models explicitly designed to capture the underlying probability…

  • KL Divergence: Prerequisite to Variational AutoEncoder (VAE)

    KL Divergence: Prerequisite to Variational AutoEncoder (VAE)

    KL Divergence The Kullback-Leibler divergence (KL divergence) assesses the inefficiency of approximating the true…

社区洞察

其他会员也浏览了