登录查看更多内容

Transformer Architecture

Shradha Agarwal

SWE ops, Newfold Digital | IIITD MTech CSE (AI) | Devops | AWS Certified SA-Associate

发布日期: 2024年4月18日

The Transformer is a groundbreaking model architecture introduced in the seminal paper “Attention is All You Need” by Vaswani et al. It revolutionized the field of natural language processing (NLP) and has since been the foundation for many state-of-the-art models, including BERT and GPT.

It consists of an encoder and a decoder, each composed of multiple identical layers. Each layer uses multi-head self-attention and position-wise fully connected feed-forward networks, with residual connections and layer normalization applied at each step. A unique feature of the Transformer is its use of positional encodings to inject information about the position of words in the sequence, as the model itself doesn’t have any inherent sense of order. This allows it to process all words in the sequence in parallel, leading to efficient training.

Figure 1. Training Process of Transformer

Key Components of Transformer Architecture

Refer to my previous article to know about self-attention and multi-head attention mechanisms.

Masked multi-head attention

The objective is to make our model causal implying that the output at a given position should solely depend on preceding words. In essence, the model must be designed to avoid exposure to future words. For this, we manipulate the upper triangle of the matrix, which is subjected to a softmax function during attention calculation, setting it to negative infinity. Consequently, the softmax function will drive the values at these positions towards zero, as the exponential of negative infinity (e^(-inf)) approaches zero.

Input embedding

Each word in the input sentence is mapped to a unique vector in a high-dimensional space, known as an embedding. This process transforms the discrete words into continuous vectors that capture semantic meanings and relationships among words.

The dimension of the input embedding is denoted as (seq, d_model), where ‘seq’ represents the sequence length or the number of words, and ‘d_model’ is the size of the embedding. This transformation allows the model to process the input in subsequent layers effectively.

Position Embedding

It is used to capture the order of words in a sentence. It assigns a unique vector to each position in the sequence, enabling the model to recognize patterns based on word positions. This is crucial as Transformers, unlike RNNs, do not inherently understand the sequential nature of the data.

The computation of position embeddings is facilitated by the formulas provided in Figure 2. These embeddings are calculated once and subsequently reused across all sentences. The dimension of the position embedding is denoted as (seq, d_model).

For each position in the word’s embedding, a Positional Encoding (PE) is calculated. For instance, PE(0,0), PE(0,1), PE(0,2)… PE(0,d_model-1) would be computed for the first word’s embedding. The first formula is applied for even locations, while the second formula is utilized for odd locations.

领英推荐

Learn AI (Artificial Intelligence) for FREE & Pursue…

Hisham Sarwar 3 周前

RAG Architecture Options

Dr. Rabi Prasad Padhy 11 个月前

LLM

Darshika Srivastava 1 年前

Add and norm

These operations collectively help in mitigating the problem of vanishing gradients and helps in improving the learning process.

Add: This is the residual connection or shortcut path that bypasses the sub-layers (like multi-head attention and feed-forward neural network), allowing the input of the sub-layer to be added to its output.

Norm: Post the addition, layer normalization is performed. It standardizes the features of the output across the sequence (i.e., for each word), enhancing the model’s stability and performance.

Feed Forward Layer

It's the fully connected layer.

Training Process of Transformer

The base image of transformer architecture in Figure 1. is taken from "Attention is all you need" by Vaswani et al..

Contrary to Recurrent Neural Networks (RNNs) where each word corresponds to a unique timestep, the entire training process in Transformer models is executed at a single timestep.

The input to the encoder side is prepended and appended with <SOS> and <EOS> respectively. These are special tokens to mark the start and end of the sentence - Start Of Sentence and End Of Sentence.

Input embedding is generated for each token of the input. This is added with the position encoding. The dimension of the input is (seq, d_model). This is then fed to the encoder block.

Within the encoder, the multi-head attention mechanism operates as a form of self-attention, given that the Key (K), Query (Q), and Value (V) all originate from the same sentence. The encoder’s output, of dimension (seq, d_model), encapsulates the semantics of the word, its position, and its relation to other words within the same sentence. This output subsequently serves as K and V for the decoder’s multi-head attention.

The decoder input sentence is prefixed with an <SOS> token, indicating the commencement of the sentence. For each token in the sentence, an input embedding is generated and combined with the position embedding. The resulting dimension is (seq, d_model). This is replicated thrice and supplied to the masked multi-head attention mechanism, which operates as self-attention since it utilizes K, Q, and V from the decoder input sentence exclusively.

The multi-head attention mechanism within the decoder functions as cross-attention, differing from self-attention in that the query Q is derived from the decoder input, while K and V are sourced from the encoder output. The decoder’s output is of dimension (seq, d_model). It is processed through a linear layer to map the generated embeddings to corresponding words. This transformation alters the decoder output from (seq, d_model) to (seq, vocab_size), where ‘vocab_size’ represents the size of the vocabulary. Following the application of a softmax function, words from the vocabulary can be selected based on the highest probability for each row/token in (seq, vocab_size).

Inference Process in Transformer

The Spanish translation is not available during the inference phase. So, we need ‘T’ timesteps for the inferencing in contrast to only one timestep required for training phase. At the beginning, we pass <SOS> as input and we get decoder output with the first translation.

For the subsequent timestep, the output from the current timestep is appended to the decoder input of the same timestep. Following the application of a softmax function, the word corresponding to the last word is selected based on the highest probability. This iterative process continues until the decoder outputs the <EOS> token, signifying the completion of the translation.

要查看或添加评论，请登录

Shradha Agarwal的更多文章

Denoising Diffusion Probabilistic Model - DDPM

2024年5月22日

Denoising Diffusion Probabilistic Model - DDPM

Diffusion model is a generative model that has emerged as a powerful technique for creating realistic data. It operates…
PEFT with LoRA for Fine-tuning

2024年5月8日

PEFT with LoRA for Fine-tuning

Fine-tuning is a process where a pre-trained model is further trained on new data to enhance its performance on a…
Retrieval Augmented Generation (RAG)

2024年4月29日

Retrieval Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a method that improves how language models create text by using additional…

3 条评论
BERT-Bidirectional Encoder Representations from Transformers

2024年4月22日

BERT-Bidirectional Encoder Representations from Transformers

Introduction BERT was introduced in the research paper - "BERT: Pre-training of Deep Bidirectional Transformers for…
Attention Mechanisms

2024年4月16日

Attention Mechanisms

The attention mechanism has significantly improved the performance of models in tasks like machine translation and text…
Long Short Term Memory (LSTM)

2024年4月8日

Long Short Term Memory (LSTM)

Figure 1. LSTM Architecture at time step t Long Short-Term Memory (LSTM) networks tackle a challenge in deep learning:…
Recurrent Neural Networks

2024年4月2日

Recurrent Neural Networks

RNNs are a type of artificial neural network architected specifically to tackle sequential data. In contrast to…
Generative Adversarial Networks

2024年3月28日

Generative Adversarial Networks

Figure 1. GAN Architecture Vanilla GAN introduced by Ian J.
Variational Autoencoders

2024年3月25日

Variational Autoencoders

Variational Autoencoders (VAEs) are generative models explicitly designed to capture the underlying probability…
KL Divergence: Prerequisite to Variational AutoEncoder (VAE)

2024年3月23日

KL Divergence: Prerequisite to Variational AutoEncoder (VAE)

KL Divergence The Kullback-Leibler divergence (KL divergence) assesses the inefficiency of approximating the true…

See all articles

Transformer Architecture

Shradha Agarwal

SWE ops, Newfold Digital | IIITD MTech CSE (AI) | Devops | AWS Certified SA-Associate

Key Components of Transformer Architecture

领英推荐

Training Process of Transformer

Inference Process in Transformer

Shradha Agarwal的更多文章

社区洞察

其他会员也浏览了

Data Preparation for Fine-Tuning LLMs (Large Language Models) using Google Colab

LLM

LLM

Beyond Words: The Future of Machine Learning with Transformer Models

Accelerating Transformer Inference with Grouped Query Attention (GQA)

Comparing “O1 Pro Mode” Reasoning Models and GPT-4o Models

Leveraging Vector Embedding Databases in Retrieval-Augmented Generation

Understanding Transformations, Agents, and Deep Learning Frameworks: When and How to Use Them

WHAT IS HUGGING FACE ?

Key Components of Transformer Architecture

领英推荐

Training Process of Transformer

Inference Process in Transformer

Shradha Agarwal的更多文章

Denoising Diffusion Probabilistic Model - DDPM

PEFT with LoRA for Fine-tuning

Retrieval Augmented Generation (RAG)

BERT-Bidirectional Encoder Representations from Transformers

Attention Mechanisms

Long Short Term Memory (LSTM)

Recurrent Neural Networks

Generative Adversarial Networks

Variational Autoencoders

KL Divergence: Prerequisite to Variational AutoEncoder (VAE)

社区洞察

其他会员也浏览了

Data Preparation for Fine-Tuning LLMs (Large Language Models) using Google Colab

LLM

LLM

Beyond Words: The Future of Machine Learning with Transformer Models

Accelerating Transformer Inference with Grouped Query Attention (GQA)

Comparing “O1 Pro Mode” Reasoning Models and GPT-4o Models

Leveraging Vector Embedding Databases in Retrieval-Augmented Generation

Understanding Transformations, Agents, and Deep Learning Frameworks: When and How to Use Them

WHAT IS HUGGING FACE ?