BxD Primer Series: Transformer Models

BxD Primer Series: Transformer Models

Hey there ??

Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Transformer Models. Let’s get started:

Introduction to Transformer:

The name "transformer" comes from a key component of its architecture, known as self-attention or transformer attention mechanism.

This mechanism allows the model to capture?relationships between different positions?in input sequence, allowing it to attend and "transform" the representation of sequence based on the context and dependencies.

The term "transformer" was introduced by Google researchers in a paper "Attention is All You Need " in 2017, where they presented a novel architecture for sequence-to-sequence tasks that relied heavily on self-attention mechanisms. The name stuck and has since been widely used to refer to this specific architecture.

We have already covered the basics of Attention Mechanism in?previous edition ?and will focus on elements specific to Transformer architecture in this edition. Please make sure to read?previous edition ?first.

Note: Traditional?recurrent neural networks ,?LSTMs ,?GRUs ?were widely used in sequence-to-sequence tasks before transformers came into picture. However, they have limitations in capturing long-range dependencies in sequences and cannot learn in parallel. Transformers address this issue by using a mechanism called self-attention or scaled dot-product attention.

The What:

Transformer Model is a type of neural network architecture that consists of two main components: an Encoder and a Decoder. Both components can also work in isolation depending on task at hand.

Encoder is responsible for processing input sequence, a sentence or a document, and converting it into a set of context vectors that capture key information in input. It typically consists of several layers of self-attention and feed-forward neural networks, which allow the model to capture both the content and order of input sequence.

Decoder is responsible for generating output sequence, a translated sentence or a summary of input, based on the context vectors generated by Encoder. It typically consists of several layers of self-attention and feed-forward neural networks, as well as an additional multi-head attention mechanism that allows the model to focus on different parts of input sequence while generating output.

Encoder and Decoder are?trained jointly to minimize a loss function?by adjusting the weights of neural network using back-propagation. Trained model can be used to generate output sequences for new input sequences.

Transformer Models are known for their ability to handle long input and output sequences, to capture complex relationships between input and output, and to generalize to new input and output sequences.

Key Features of Transformer Models:

Unique features that make transformer well-suited for sequence-sequence tasks:

  1. Self-attention mechanism?is used to compute context vectors in both encoder and decoder. This enables the model to weigh the importance of each token based on its relevance to other tokens, capturing long-range dependencies. It also allows model to process entire sequence in parallel, faster and efficient way.
  2. Multi-Head Attention?is used in transformers, where self-attention is performed multiple times in parallel with different learned weight matrices. This is done to capture different types of dependencies and attend to different parts of input sequence?simultaneously, allowing for richer and fine-grained representations.
  3. Positional encoding: Since transformer models process input sequences in parallel, they do not have an inherent sense of order like RNNs. To address this, transformer models use positional encoding to?explicitly encode the position?of each token in input sequence. This allows the model to?capture both the content and order of input?sequence and handle variable-length input sequences without relying solely on recurrent connections.
  4. Masking?is used to prevent the decoder from accessing future tokens in input sequence. This ensures that the model only generates output based on information available to it at each step.
  5. Residual connections?are utilized in transformers to facilitate the flow of information through model. These connections allow gradients to propagate more effectively during training and mitigate vanishing gradient problem. Additionally,?layer normalization?is applied to normalize the outputs of each sub-layer to enhance stability and accelerate training.
  6. Pre-training: Transformers are often pre-trained on large corpora using unsupervised learning techniques, such as masked language modeling or next word prediction. It enables the model to learn general language representations and then be fine-tuned on specific downstream tasks with?smaller labeled datasets?or?Reinforcement Learning from Human Feedback ?(RLHF).

Applications of Transformer Models:

Transformer Models perform extraordinarily well for sequence-sequence tasks:

  1. Language Translation, where input is a sentence in one language and output is corresponding sentence in another language. These models are trained on large parallel corpora of sentences in different languages.
  2. Text Summarization, where input is a long document and output is a summary of document. These models are trained on large datasets of documents and corresponding summaries.
  3. Question Answering, where input is a question and output is an answer to question. These models are trained on large datasets of questions and corresponding answers.
  4. Speech Transcription, where input is an audio signal and output is a transcription of speech. These models are trained on large datasets of audio recordings and corresponding transcriptions.
  5. Dialogue Generation, where input is a prompt or context and output is a response that continues the conversation. These models are trained on large datasets of conversational data.
  6. Sentiment Analysis, where input is a text passage and output is a prediction of sentiment expressed in passage (e.g. positive, negative, neutral).

And many more…

Self-Attention Mechanism:

Self-attention computes attention scores between all pairs of tokens in a sequence to capture their dependencies. It assign weights to different tokens based on their relevance to other tokens in sequence.

Self-attention mechanism has three components:

  • Query (Q): Representation of a token used to derive attention scores w.r.t. other tokens.
  • Key (K): Representation of other tokens used to compute attention scores.
  • Value (V): Value associated with each token, which is weighted by attention scores.

Attention scores are calculated by taking normalized dot product between query and key vectors and applying a softmax function to obtain a distribution over all tokens. The weighted sum of value vectors, using attention scores as weights, yields the final representation of token.

No alt text provided for this image

Where:

  • Q?represents the query matrix
  • K?represents the key matrix
  • V?represents the value matrix
  • d_k?is the dimensionality of keys, used to alleviate the scale impact of dot products

Positional Encoding:

Positional encoding is a set of additional embeddings added to input embeddings, to differentiate tokens based on their positions. Positional encodings are?combined with token embeddings?to create input representation for transformer.

A commonly used method for positional encoding is sine and cosine functions with different frequencies:

No alt text provided for this image

Where,

  • pos?is position index in sequence
  • i?is index of dimension within embedding
  • d_model?is the dimensionality of token embeddings

Trigonometric sine and cosine functions with varying frequencies ensures that different positions receive unique encoding patterns. Different frequencies allows to capture different scales or rates of change along sequence, to distinguish between positions more effectively.

Multi-Head Attention:

Multi-head attention is used to capture different types of dependencies and attend to different parts of input sequence simultaneously. It involves performing self-attention multiple times in parallel, with each head having its own weight matrices for query, key, and value projections. The outputs of all heads are concatenated and linearly transformed to obtain final attention output.

The mathematical equations for multi-head attention are as follows:

No alt text provided for this image

Where,

  • h?is the number of attention heads
  • W_i(Q),?W_i(K),?W_i(V), and?W(0)?are weight matrices.

? Single-headed attention?means there is only one set of attention weights used to compute context vector at each time step and model can only focus on a single aspect of input sequence?at a time.

? Multi-headed attention?uses multiple sets of attention weights to compute context vector. Each attention head focuses on a different aspect of input sequence, which allows to capture different types of information at different levels of granularity. For example, some attention heads focus on global context of input sequence, while others focus on specific local features.

Attention Mask:

Attention mask controls the flow of information during self-attention mechanism. It determines which positions or tokens in input sequence should be attended to or ignored by model.

Attention mask ensures that the model attends only to valid positions and?ignores future or padding tokens. It is a square matrix of shape (N, N), where N is the length of input sequence.

Let's denote this matrix as M. Each element M[i, j] of mask matrix can take two values:

  1. Valid position: M[i, j] = 1
  2. Invalid or padding position: M[i, j] = 0

Attention mask is applied during attention scores calculation by element-wise multiplication with the weights assigned to each token by self-attention mechanism.

Weights corresponding to M[i, j]=0 are set to very large negative value (-inf) before applying softmax activation function. This causes attention weights to effectively become zero, indicating that the model should not attend to those positions.

? Padding tokens are sometimes added to ensure that all input sequences have the same length.?Encoder masking?is used to mask out padding tokens in input sequence.

? Decoder masking, on the other hand, is used to mask out future tokens in output sequence.

Basic Architecture Diagram:

No alt text provided for this image

The How:

Processing of input token by encoder-decoder transformer model (in sequential order):


? Input Encoding: Let input sequence be denoted as X = {x1, x2, …, xn}, where each?xi?represents a token.

Each token is embedded into a continuous vector representation using an embedding matrix?E?of dimensions (d_model × V), where?d_model?is the embedding dimension and?V?is the vocabulary size.

Embedded input sequence is denoted as X_embed = {e1, e2, ….., en}.


? Encoder?consists of?L?identical layers. Each layer has two sub-layers: self-attention and feed-forward networks, both followed by residual connection and layer normalization.

a) Self-Attention: For each token?e_i?in input sequence, compute the query, key, and value vectors as follows:

Query = e_i × W_Q

Key = e_i × W_K

Value = e_i × W_V

Where?W_Q, W_K, W_V?are learnable weight matrices of dimension?(d_model × d_k), and?d_k?is the dimension of key, query, and value vectors.

The self-attention mechanism calculates attention weights for token?e_i?by applying dot product between its query and key vectors of all tokens:

No alt text provided for this image

The attended output for token?e_i?is obtained by weighting value vectors of all tokens:

No alt text provided for this image

b) Residual Connection and Layer Normalization: The output of self-attention sub-layer is combined with input embedding through a residual connection:

No alt text provided for this image

Layer normalization is applied to normalize the output:

No alt text provided for this image

c) Feed-Forward Networks:?Normalized output is passed through a feed-forward neural network consisting of two linear transformations:

No alt text provided for this image

Where,

  • W_1, W_2 are learnable weight matrices
  • b_1, b_2 are learnable bias terms

d) Residual Connection and Layer Normalization:?The output of feed-forward sub-layer is combined with normalized output using a residual connection, followed by layer normalization:

No alt text provided for this image

??Decoder?also consists of?L?identical layers, but with more sub-layers compared to encoder.

a) Self-Attention: Similar to encoder self-attention, key, query, and value vectors are computed for each token in the decoder:

Query = DecoderEmbedding?×?W_Q

Key = DecoderEmbedding?×?W_K

Value = DecoderEmbedding?×?W_V

Attention weights and attended output are calculated as before.

b) Encoder-Decoder Attention:?The decoder attends to encoder output using encoder-decoder attention mechanism. Key, query, and value vectors for each token in decoder are computed as:

Query = EncoderOutput?×?W_Q

Key = EncoderOutput?×?W_K

Value = EncoderOutput?×?W_V

Attention weights and attended output are calculated as before.

c) Residual Connection and Layer Normalization?are applied to outputs of self-attention and encoder-decoder attention sub-layers, similar to the encoder.

d) Feed-Forward Networks:?Normalized output is passed through a feed-forward neural network, similar to the encoder.

e) Residual Connection and Layer Normalization?to outputs of feed forward network, similar to the encoder.


? Output Projection: Final output of decoder is projected to the desired output dimension using a linear projection layer:

No alt text provided for this image

where W_O and b_O are learnable weight matrix and bias term


? Training and Inference: The model is trained to minimize a defined loss function, such as cross-entropy loss, using back-propagation and gradient descent algorithms. Model parameters are updated iteratively.

During inference or generation, the model predicts output tokens auto-regressively.

Generated tokens are fed back as input to predict subsequent tokens until a special end-of-sequence token is generated or a predefined maximum sequence length is reached.

Note 1: Temperature scaling?is a simple technique to control diversity of transformer output. During generation, the model outputs a probability distribution over vocabulary. Temperature parameter controls the "temperature" of distribution, and hence the diversity of generated text. A higher temperature leads to more diverse and creative text and a lower temperature leads to more conservative and repetitive text.

Note 2: Residual connections, also known as skip connections, address the issue of vanishing gradients and allow information and gradients to propagate more effectively during training.

Note 3: Layer normalization?is used to normalize outputs of each sub-layer, improving stability and convergence during training. It ensures that hidden state representations in each layer have a mean of zero and a variance of one.

  • Layer normalization has advantages over batch normalization - It does not depend on batch statistics, and hence suitable for tasks with variable-length sequences. Additionally, it allows consistent performance across different batch sizes

Beam Search:

Beam search is a decoding algorithm used in sequence generation tasks. It helps to select most probable output sequence given input sequence and trained transformer model. Instead of naively selecting the highest probability token at each step, beam search maintains a beam of top-k candidates and explores multiple possible paths to find most likely sequence.

Beam search algorithm works this way:

? Initialization: Set the beam width (k), which determines number of candidate sequences to consider at each decoding step. Initialize the beam with a start token or input sequence.

? Decoding Steps: Perform a series of steps until termination condition is met (reaching maximum sequence length or encountering an end token).

a. Generate candidates:?For each candidate sequence in beam, predict the probabilities of next possible tokens using transformer model. Use softmax function for token probabilities.

b. Update the beam:?For each candidate, multiply its probability by the probability of its parent sequence (cumulative probability) to score the sequence so far. Keep track of top-k sequences with highest cumulative probabilities.

c. Early stopping:?If any of top-k sequences end with an end token, it is considered complete sequence. You can also enforce a minimum length criterion to avoid overly short sequences. If the number of complete sequences reaches k, terminate the decoding process.

? Final Selection:?Once termination condition is met, select the sequence with highest cumulative probability from the set of complete sequences as final output.

Beam search provides a trade-off between quality and diversity of generated sequences. By adjusting the beam width (k), you can control level of exploration versus exploitation.

  • Larger beam width explores more options but may generate similar or redundant sequences.
  • Smaller beam width focuses on the most probable sequences but may result in less diversity.

Pre-Training in?Transformer Models:

Pre-training is unsupervised training of transformer model on a large, diverse dataset, such as a collection of web pages or a corpus of books. During pre-training, the model learns to recognize patterns in input text and encode that information into context vectors.

After pre-training, the model can be fine-tuned on a smaller dataset for a specific task. During fine-tuning, model weights are adjusted to optimize performance on new task, while preserving the general language understanding learned during pre-training.

??Two main techniques of pre-training:

  1. Pre-training Objective (Masked Language Modeling):?Given an input sequence X = {x1, x2, ..., xn}, where xi represents i’th token in sequence, a certain percentage of tokens are randomly masked.
  2. The objective is to predict original value of masked tokens based on the context provided by other non-masked tokens. The model aims to maximize the likelihood of original tokens.
  3. Pre-training Objective (Next Sentence Prediction):?Given a pair of sentences (S1, S2), the objective is to predict whether S2 follows S1 in original text.
  4. This can be modeled as a binary classification task, where model aims to maximize the probability of correct sentence ordering:

??Examples of pre-trained transformer models:

  1. BERT (Bidirectional Encoder Representations from Transformers): BERT introduced the concept of masked language modeling and next sentence prediction. It is trained on large amounts of unlabeled text data, and the model learns to predict masked words in a sentence.
  2. GPT (Generative Pre-trained Transformer): GPT uses autoregressive language modeling, where the model predicts next word in a sequence given the preceding context. GPT has remarkable capabilities in natural language understanding and generation. It has been used for machine translation, question answering, and even creative writing.
  3. T5 (Text-To-Text Transfer Transformer): T5 is a versatile transformer model trained in a text-to-text framework, where various NLP tasks are cast as text generation tasks. It achieved impressive results by unifying different tasks under a single framework.
  4. ViT (Vision Transformer): ViT applies transformer architecture to computer vision tasks. Instead of using?convolutional neural networks , ViT treats images as sequences of patches and feeds them through a transformer model. Its performance is competitive to CNNs.

The Why:

Reasons to use Encoder-Decoder Transformer Models:

  1. Effective at processing and generating sequences of data, such as natural language text and speech.
  2. Can capture long-term dependencies in input sequence and generate coherent and contextually appropriate output.
  3. Can be trained in parallel, which makes them efficient and scalable than other sequence models and well-suited for large-scale applications.
  4. Can be pre-trained on large amounts of data, and the resulting model can be fine-tuned on a specific task. This transfer learning approach improves the performance of model in cases of limited training data.
  5. Transformer architecture helps prevent errors from propagating through the model during training and inference.
  6. Effectively handle long sequences due to their self-attention mechanism.

The Why Not:

Reasons to not use Transformer Models:

  1. Requires large amounts of data and computational resources to train effectively.
  2. Challenging to understand why the model is making specific predictions or generating specific output because of recursive self attention mechanism.
  3. Struggle with tasks that require world knowledge, common sense reasoning or understanding metaphors.
  4. Struggle to understand complex scientific literature, due to lack of training data for these domains.

Time for you to support:

  1. Reply to this article with your question
  2. Forward/Share to a friend who can benefit from this
  3. Chat on Substack with BxD (here )
  4. Engage with BxD on LinkedIN (here )

In next edition, we will cover Transfer Learning Techniques.

Let us know your feedback!

Until then,

Have a great time! ??

#businessxdata ?#bxd ?#transformers #architecture #neuralnetworks ?#primer


要查看或添加评论,请登录

社区洞察

其他会员也浏览了