登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

Deep Dive into the Transformer Architecture: Input Encoding, Positional Embading and Self-Attention

Suman Kar

发布日期: 2024年11月2日

In this article, I am trying to explain the concept in a more detailed way. We will try to focus on the behind-the-seen architecture and the mathematics behind the Transformer Architecture.

This diagram represents the architecture of a Transformer model. It shows both the encoder (left) and decoder (right) structures.

Let's break down each step and component:

Encoder (Left Block)

1. Input Embedding

What It Is: Each input word/token is represented by a vector (embedding) that captures its meaning. This embedding is learned during training.
Process: The input embeddings are combined with positional encodings to form the input to the encoder.

2. Positional Encoding

Purpose: Transformers have no inherent understanding of the sequential order of tokens, unlike recurrent models (e.g., RNNs). Positional encoding injects information about the position of tokens in a sequence so the model can understand the order.
How It Works: Positional encodings are added to input embeddings, allowing the model to learn the relative or absolute positions of tokens.

3. Encoder Block

The encoder block consists of multiple identical layers (stacked Nx times). Each layer has two main sub-components:

a. Multi-Head Attention

Function: Perform multiple Self-Attention in parallel. It helps the model focus on different parts of the input sequence by computing attention scores for each token in parallel.
How It Works:

b. Add & Norm (Residual Connection + Layer Normalization)

Purpose: A residual connection adds the original input of the multi-head attention sub-layer back to its output to help preserve information and prevent vanishing gradient issues.
Layer Normalization: Ensures the outputs are normalized for stable training.

c. Feed Forward Network

Structure: Consists of two linear layers with a ReLU activation in between.
Function: Introduces non-linearity and allows for complex transformations of the input.

d. Second Add & Norm

Process: The output of the feed-forward sub-layer is added back to its input with another normalization step.

Decoder(Right Block)

The decoder block is also composed of multiple identical layers (stacked Nx times).

a. Masked Multi-Head Attention

Function: Ensures that each position in the output sequence can only attend to previous positions and not future ones. This masking prevents the model from "cheating" by looking at future words during training.

b. Add & Norm

Function: Similar to the encoder, it has a residual connection and normalization to stabilize training.

c. Multi-Head Attention (Encoder-Decoder Attention)

Purpose: This layer helps the decoder focus on relevant parts of the input sequence. It takes the encoder's output as keys and values, with the decoder's output as the query.
Process: The decoder attends over the encoder's representations to generate context-aware output representations.

d. Add & Norm

Function: Another residual connection and normalization to ensure stable gradients.

e. Feed Forward Network

Function: Similar to the encoder, applies two linear transformations with a ReLU activation.

f. Final Add & Norm

Purpose: Adds residuals and normalizes the output.

Linear Layer and Softmax (Output Generation)

Linear Layer: Transforms the output of the last decoder block to a vector that matches the vocabulary size.
Softmax: Converts the vector into probabilities for each possible token in the vocabulary, allowing the model to predict the next word in the sequence.

6. Output Probabilities

Purpose: The final output is a probability distribution over the vocabulary for each position in the output sequence. The highest probability is selected as the predicted next word.

Summary

Encoder: Processes the input sequence and generates contextualized representations using self-attention and feed-forward layers.
Decoder: Uses the encoder's output and previous outputs to generate the final sequence step-by-step, attending to both its own context (masked attention) and the encoder's context (encoder-decoder attention).
Output: The decoder outputs probabilities for each token, and the highest probability is chosen as the next token in sequence generation.

Let's discuss the major components in detail

This diagram illustrates the attention mechanism in the Transformer architecture. Here's a step-by-step breakdown:

Step 1: Input Text

The process starts with an input sentence, "Sky is Blue," which is fed into the system.

Step 2: Tokenization and Embedding

The input text is split into individual words or tokens: "Sky," "is," and "Blue."
Each token is converted into a corresponding vector embedding, represented as orange rectangles. These embeddings capture the semantic meaning of each word.

Step 3: Positional Encoding

Positional encoding vectors are added to the word embeddings to provide information about the position of each word in the sentence.
The positional encoding ensures that the model understands the order of the words (e.g., that "Sky" comes before "is").

Step 4:Self Attention/Multi-Head attention:

Self-attention includes several steps

a.Linear Transformation

The word embeddings are linearly transformed into three different matrices:Query (Q) Matrix: Represents what the word is trying to find in other words.Key (K) Matrix: Represents the features that other words may have.Value (V) Matrix: Contains the actual word representation.
These transformations are applied to each word to generate Q, K, and V matrices (colored Green, Red, and Blue, respectively).

b.Dot Product and Scaling

For each word, the Q vector is multiplied (dot product) by the K vector of all other words, resulting in a scalar score. This operation assesses the similarity between the query word and the keywords.

c.Softmax Operation

The scaled scores are passed through a softmax function. This converts the scores into a probability distribution, giving values between 0 and 1.
The softmax output indicates the attention weight each word should give to the others.

d.Weighted Sum of Value Vectors

The softmax-generated weights are multiplied with the corresponding VVV vectors.
The weighted VVV vectors are summed to create a final attention representation for each word.

e.Final Output from the Attention Layer

The summed vectors form the output representation of the sentence from the attention mechanism. This output encapsulates the relationships and importance of the words in context.

Key Points Highlighted:

(1) Input text processing.
(2) Tokenization and generation of word embeddings.
(3) Positional encoding vectors are added to the word embeddings to provide information about the position of each word in the sentence.
(4) Creation of Q, K, and V matrices via linear transformation.
(5) Dot product and scaling of the Q and K vectors.
(6) Application of the softmax function.
(7) Calculation of weighted sums with V vectors.
(8) Summing the results to form the final representation.

This flow enables the Transformer model to focus on different parts of the input sentence when generating outputs, allowing it to capture complex relationships between words.

Vikrant Shukla

4 个月

Great writeup Suman. Gives detail in a very easy language. I am reading, learning and now writing about AI / ML more from Business, Implementation and Leadership standpoint. Feel free to subscribe to the newsletter and share your views on it. https://www.dhirubhai.net/newsletters/the-ai-musings-7251802013364019200

2 次回应

查看更多评论

要查看或添加评论，请登录

Suman Kar的更多文章

Beginner's Guide to Optimizers in Deep Learning

2024年11月24日

Beginner's Guide to Optimizers in Deep Learning

What is Optimizer? An optimizer is a crucial component in machine learning and deep learning that adjusts the model's…
Understanding Transformer Architecture: The Backbone of Modern NLP

2024年11月2日

Understanding Transformer Architecture: The Backbone of Modern NLP

What is Transformer Architecture? The Transformer architecture is a deep learning model introduced in 2017 by Vaswani…