Transformer Architecture

Transformer Architecture

The Transformer is a groundbreaking model architecture introduced in the seminal paper “Attention is All You Need” by Vaswani et al. It revolutionized the field of natural language processing (NLP) and has since been the foundation for many state-of-the-art models, including BERT and GPT.

It consists of an encoder and a decoder, each composed of multiple identical layers. Each layer uses multi-head self-attention and position-wise fully connected feed-forward networks, with residual connections and layer normalization applied at each step. A unique feature of the Transformer is its use of positional encodings to inject information about the position of words in the sequence, as the model itself doesn’t have any inherent sense of order. This allows it to process all words in the sequence in parallel, leading to efficient training.

Figure 1. Training Process of Transformer

Key Components of Transformer Architecture

Refer to my previous article to know about self-attention and multi-head attention mechanisms.

Masked multi-head attention

The objective is to make our model causal implying that the output at a given position should solely depend on preceding words. In essence, the model must be designed to avoid exposure to future words. For this, we manipulate the upper triangle of the matrix, which is subjected to a softmax function during attention calculation, setting it to negative infinity. Consequently, the softmax function will drive the values at these positions towards zero, as the exponential of negative infinity (e^(-inf)) approaches zero.

Input embedding

Each word in the input sentence is mapped to a unique vector in a high-dimensional space, known as an embedding. This process transforms the discrete words into continuous vectors that capture semantic meanings and relationships among words.

The dimension of the input embedding is denoted as (seq, d_model), where ‘seq’ represents the sequence length or the number of words, and ‘d_model’ is the size of the embedding. This transformation allows the model to process the input in subsequent layers effectively.

Position Embedding

Figure 2. Formulas for calculating position embedding

It is used to capture the order of words in a sentence. It assigns a unique vector to each position in the sequence, enabling the model to recognize patterns based on word positions. This is crucial as Transformers, unlike RNNs, do not inherently understand the sequential nature of the data.

The computation of position embeddings is facilitated by the formulas provided in Figure 2. These embeddings are calculated once and subsequently reused across all sentences. The dimension of the position embedding is denoted as (seq, d_model).

For each position in the word’s embedding, a Positional Encoding (PE) is calculated. For instance, PE(0,0), PE(0,1), PE(0,2)… PE(0,d_model-1) would be computed for the first word’s embedding. The first formula is applied for even locations, while the second formula is utilized for odd locations.

Add and norm

These operations collectively help in mitigating the problem of vanishing gradients and helps in improving the learning process.

Add: This is the residual connection or shortcut path that bypasses the sub-layers (like multi-head attention and feed-forward neural network), allowing the input of the sub-layer to be added to its output.

Norm: Post the addition, layer normalization is performed. It standardizes the features of the output across the sequence (i.e., for each word), enhancing the model’s stability and performance.

Feed Forward Layer

It's the fully connected layer.

Training Process of Transformer

The base image of transformer architecture in Figure 1. is taken from "Attention is all you need" by Vaswani et al..

Contrary to Recurrent Neural Networks (RNNs) where each word corresponds to a unique timestep, the entire training process in Transformer models is executed at a single timestep.

The input to the encoder side is prepended and appended with <SOS> and <EOS> respectively. These are special tokens to mark the start and end of the sentence - Start Of Sentence and End Of Sentence.

Input embedding is generated for each token of the input. This is added with the position encoding. The dimension of the input is (seq, d_model). This is then fed to the encoder block.

Within the encoder, the multi-head attention mechanism operates as a form of self-attention, given that the Key (K), Query (Q), and Value (V) all originate from the same sentence. The encoder’s output, of dimension (seq, d_model), encapsulates the semantics of the word, its position, and its relation to other words within the same sentence. This output subsequently serves as K and V for the decoder’s multi-head attention.

The decoder input sentence is prefixed with an <SOS> token, indicating the commencement of the sentence. For each token in the sentence, an input embedding is generated and combined with the position embedding. The resulting dimension is (seq, d_model). This is replicated thrice and supplied to the masked multi-head attention mechanism, which operates as self-attention since it utilizes K, Q, and V from the decoder input sentence exclusively.

The multi-head attention mechanism within the decoder functions as cross-attention, differing from self-attention in that the query Q is derived from the decoder input, while K and V are sourced from the encoder output. The decoder’s output is of dimension (seq, d_model). It is processed through a linear layer to map the generated embeddings to corresponding words. This transformation alters the decoder output from (seq, d_model) to (seq, vocab_size), where ‘vocab_size’ represents the size of the vocabulary. Following the application of a softmax function, words from the vocabulary can be selected based on the highest probability for each row/token in (seq, vocab_size).

Inference Process in Transformer

The Spanish translation is not available during the inference phase. So, we need ‘T’ timesteps for the inferencing in contrast to only one timestep required for training phase. At the beginning, we pass <SOS> as input and we get decoder output with the first translation.

For the subsequent timestep, the output from the current timestep is appended to the decoder input of the same timestep. Following the application of a softmax function, the word corresponding to the last word is selected based on the highest probability. This iterative process continues until the decoder outputs the <EOS> token, signifying the completion of the translation.

要查看或添加评论,请登录

Shradha Agarwal的更多文章

  • Denoising Diffusion Probabilistic Model - DDPM

    Denoising Diffusion Probabilistic Model - DDPM

    Diffusion model is a generative model that has emerged as a powerful technique for creating realistic data. It operates…

  • PEFT with LoRA for Fine-tuning

    PEFT with LoRA for Fine-tuning

    Fine-tuning is a process where a pre-trained model is further trained on new data to enhance its performance on a…

  • Retrieval Augmented Generation (RAG)

    Retrieval Augmented Generation (RAG)

    Retrieval-Augmented Generation (RAG) is a method that improves how language models create text by using additional…

    3 条评论
  • BERT-Bidirectional Encoder Representations from Transformers

    BERT-Bidirectional Encoder Representations from Transformers

    Introduction BERT was introduced in the research paper - "BERT: Pre-training of Deep Bidirectional Transformers for…

  • Attention Mechanisms

    Attention Mechanisms

    The attention mechanism has significantly improved the performance of models in tasks like machine translation and text…

  • Long Short Term Memory (LSTM)

    Long Short Term Memory (LSTM)

    Figure 1. LSTM Architecture at time step t Long Short-Term Memory (LSTM) networks tackle a challenge in deep learning:…

  • Recurrent Neural Networks

    Recurrent Neural Networks

    RNNs are a type of artificial neural network architected specifically to tackle sequential data. In contrast to…

  • Generative Adversarial Networks

    Generative Adversarial Networks

    Figure 1. GAN Architecture Vanilla GAN introduced by Ian J.

  • Variational Autoencoders

    Variational Autoencoders

    Variational Autoencoders (VAEs) are generative models explicitly designed to capture the underlying probability…

  • KL Divergence: Prerequisite to Variational AutoEncoder (VAE)

    KL Divergence: Prerequisite to Variational AutoEncoder (VAE)

    KL Divergence The Kullback-Leibler divergence (KL divergence) assesses the inefficiency of approximating the true…

社区洞察

其他会员也浏览了