Attention Mechanisms
Shradha Agarwal
SWE ops, Newfold Digital | IIITD MTech CSE (AI) | Devops | AWS Certified SA-Associate
The attention mechanism has significantly improved the performance of models in tasks like machine translation and text summarization. It allows models to focus on specific parts of the input, thereby capturing long-term dependencies more effectively. It understands the full context of a sentence. While embeddings are similar for similar words, for ambiguous terms like ‘bank’ (referring to either a financial institution or the side of a river), the complete context of the sentence is needed. Attention mechanisms facilitate this understanding. This article will delve into two key forms of the attention mechanism: self-attention and multihead attention.
Self-Attention
Figure 1. is taken from the paper - Attention is all you need by Vaswani et al..
We work with three matrices: Q (Query), K (Key), and V (Value). These can be conceptualized in terms of a dictionary, where the Key and Value pairs represent the data structure. To retrieve a Value for a given Query Q, we perform a dot product operation with all the Keys. The result of this operation is a measure of similarity between the Query and each Key. A Key that is highly similar to the Query yields a high attention score, indicating a strong association. Conversely, a Key that is dissimilar to the Query results in a low attention score, signifying a weak association. This allows the model to focus on the most relevant information in the data.
The formula presented in Figure 2 is identical to the diagrammatic representation in Figure 1.
Each word in a sentence is initially transformed into a dense vector through an embedding layer, capturing the semantic significance of the words. This results in the input (seq, d_model), where 'seq' is the sequence length or the number of words in the sentence, and 'd_model' is the dimension of the embedding vector.
The embeddings undergo a linear transformation to form the Q, K, and V matrices, utilizing three distinct weight matrices that the model learns during training. Specifically: Q = Input W^Q, K = Input W^K, and V = Input * W^V, where W^Q, W^K, and W^V are the learned weight matrices (model learns during training) for the Query, Key, and Value, respectively.
Next, the dot product of Q and the transpose of K (K^T) is computed to assess the similarity between the Query and each Key. The outcome is scaled by the square root of d_k, the key's dimension. A softmax function is then applied to normalize the output values to a (0, 1) range, representing them as probabilities. These probabilities indicate the level of attention each Value should receive. A weighted sum of the Values is then calculated using this probability distribution. The final vector produced by the attention mechanism is a weighted representation of the input, determined by the attention scores.
领英推荐
What's self in self-attention? The term “self” in self-attention signifies that the attention scores are computed within the input sequence itself. It allows the model to focus on its own input, considering the entire sequence to capture dependencies between elements, regardless of their positions in the sequence.
Multi-head Attention
Figure 4. is taken from the paper - Attention is all you need by Vaswani et al..
The self-attention mechanism we discussed above is a form of single-head attention. In this setup, the attention operation is executed once on the input, and the dimensionality of the model (d_model) is equivalent to the key dimension (d_k).
When we transition to multi-head attention, the process evolves. The model’s embeddings are partitioned into h distinct segments, or “heads”. Each head independently performs the attention operation, allowing the model to capture various types of information from different perspectives in the input data. This enhances the model’s ability to understand complex patterns and dependencies.
In this, everything is similar to self-attention above except few pointers:
The Transformer architecture, introduced in the paper “Attention is All You Need” by Vaswani et al., utilizes several multi-head attention network units. This architecture will be the focus of discussion in the next article.