Transformer Architecture
Shradha Agarwal
SWE ops, Newfold Digital | IIITD MTech CSE (AI) | Devops | AWS Certified SA-Associate
The Transformer is a groundbreaking model architecture introduced in the seminal paper “Attention is All You Need” by Vaswani et al. It revolutionized the field of natural language processing (NLP) and has since been the foundation for many state-of-the-art models, including BERT and GPT.
It consists of an encoder and a decoder, each composed of multiple identical layers. Each layer uses multi-head self-attention and position-wise fully connected feed-forward networks, with residual connections and layer normalization applied at each step. A unique feature of the Transformer is its use of positional encodings to inject information about the position of words in the sequence, as the model itself doesn’t have any inherent sense of order. This allows it to process all words in the sequence in parallel, leading to efficient training.
Key Components of Transformer Architecture
Refer to my previous article to know about self-attention and multi-head attention mechanisms.
Masked multi-head attention
The objective is to make our model causal implying that the output at a given position should solely depend on preceding words. In essence, the model must be designed to avoid exposure to future words. For this, we manipulate the upper triangle of the matrix, which is subjected to a softmax function during attention calculation, setting it to negative infinity. Consequently, the softmax function will drive the values at these positions towards zero, as the exponential of negative infinity (e^(-inf)) approaches zero.
Input embedding
Each word in the input sentence is mapped to a unique vector in a high-dimensional space, known as an embedding. This process transforms the discrete words into continuous vectors that capture semantic meanings and relationships among words.
The dimension of the input embedding is denoted as (seq, d_model), where ‘seq’ represents the sequence length or the number of words, and ‘d_model’ is the size of the embedding. This transformation allows the model to process the input in subsequent layers effectively.
Position Embedding
It is used to capture the order of words in a sentence. It assigns a unique vector to each position in the sequence, enabling the model to recognize patterns based on word positions. This is crucial as Transformers, unlike RNNs, do not inherently understand the sequential nature of the data.
The computation of position embeddings is facilitated by the formulas provided in Figure 2. These embeddings are calculated once and subsequently reused across all sentences. The dimension of the position embedding is denoted as (seq, d_model).
For each position in the word’s embedding, a Positional Encoding (PE) is calculated. For instance, PE(0,0), PE(0,1), PE(0,2)… PE(0,d_model-1) would be computed for the first word’s embedding. The first formula is applied for even locations, while the second formula is utilized for odd locations.
Add and norm
These operations collectively help in mitigating the problem of vanishing gradients and helps in improving the learning process.
Add: This is the residual connection or shortcut path that bypasses the sub-layers (like multi-head attention and feed-forward neural network), allowing the input of the sub-layer to be added to its output.
Norm: Post the addition, layer normalization is performed. It standardizes the features of the output across the sequence (i.e., for each word), enhancing the model’s stability and performance.
Feed Forward Layer
It's the fully connected layer.
Training Process of Transformer
The base image of transformer architecture in Figure 1. is taken from "Attention is all you need" by Vaswani et al..
Contrary to Recurrent Neural Networks (RNNs) where each word corresponds to a unique timestep, the entire training process in Transformer models is executed at a single timestep.
The input to the encoder side is prepended and appended with <SOS> and <EOS> respectively. These are special tokens to mark the start and end of the sentence - Start Of Sentence and End Of Sentence.
Input embedding is generated for each token of the input. This is added with the position encoding. The dimension of the input is (seq, d_model). This is then fed to the encoder block.
Within the encoder, the multi-head attention mechanism operates as a form of self-attention, given that the Key (K), Query (Q), and Value (V) all originate from the same sentence. The encoder’s output, of dimension (seq, d_model), encapsulates the semantics of the word, its position, and its relation to other words within the same sentence. This output subsequently serves as K and V for the decoder’s multi-head attention.
The decoder input sentence is prefixed with an <SOS> token, indicating the commencement of the sentence. For each token in the sentence, an input embedding is generated and combined with the position embedding. The resulting dimension is (seq, d_model). This is replicated thrice and supplied to the masked multi-head attention mechanism, which operates as self-attention since it utilizes K, Q, and V from the decoder input sentence exclusively.
The multi-head attention mechanism within the decoder functions as cross-attention, differing from self-attention in that the query Q is derived from the decoder input, while K and V are sourced from the encoder output. The decoder’s output is of dimension (seq, d_model). It is processed through a linear layer to map the generated embeddings to corresponding words. This transformation alters the decoder output from (seq, d_model) to (seq, vocab_size), where ‘vocab_size’ represents the size of the vocabulary. Following the application of a softmax function, words from the vocabulary can be selected based on the highest probability for each row/token in (seq, vocab_size).
Inference Process in Transformer
The Spanish translation is not available during the inference phase. So, we need ‘T’ timesteps for the inferencing in contrast to only one timestep required for training phase. At the beginning, we pass <SOS> as input and we get decoder output with the first translation.
For the subsequent timestep, the output from the current timestep is appended to the decoder input of the same timestep. Following the application of a softmax function, the word corresponding to the last word is selected based on the highest probability. This iterative process continues until the decoder outputs the <EOS> token, signifying the completion of the translation.