ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Why Decoder-only Transformers?

Himank Jain

Senior Data Scientist @Bajaj Finserv Health | Translating complex data into simpler solutions for Healthcare | Problem Solver | Learner

å‘å¸ƒæ—¥æœŸ: 2024å¹´6æœˆ10æ—¥

In the realm of natural language processing (NLP), Transformer architectures have revolutionized the way machines understand and generate human language. At the heart of these architectures are the Encoder and Decoder blocks, which are the main components driving contextual understanding and output generation. Among the various transformer models, decoder-only transformers have garnered significant attention for their efficiency and effectiveness, particularly in text generation tasks. This article will introduce encoder-only models, encoder-decoder models, and decoder-only models, highlighting the unique characteristics and advantages of each. We will explore why most large language models (LLMs) today are based on decoder-only architecture, illustrating their prominence in the current NLP landscape.

Encoder and Decoder

Encoder: Captures the meaning of the entire input sentence and produces a rich, contextual representation of each word.
Decoder: Uses these contextual representations to generate the output sentence, word by word, ensuring the translation is contextually and grammatically correct.

Imagine we want to translate the English sentence "I love cats" to French, which is "J'aime les chats".

Input: "I love cats" (English)

2. Encoder Processing: Take input and apply

Tokenization: ["I", "love", "cats"]
Embedding: [Vector(I), Vector(love), Vector(cats)]
Positional Encoding and Self-Attention layers generate contextual representations.

3. Encoder Output: Contextual vectors representing ["I", "love", "cats"]

4. Decoder Processing:

Initial input: "<start>"
Step-by-step generation with attention to encoder output:
"<start>" -> "J'"
"J'" -> "aime"
"J'aime" -> "les"
"J'aime les" -> "chats"
"J'aime les chats" -> "<end>"

5. Output: "J'aime les chats"

Overview of Language Model Architectures

Encoder-only

Encoder-only transformers, like BERT (Bidirectional Encoder Representations from Transformers), consist solely of a stack of encoder layers. These models are designed to understand and process input sequences by attending to all tokens bidirectionally.

Applications:

Text Classification: Tasks like sentiment analysis where the entire input sequence needs to be understood as a whole.
Named Entity Recognition (NER): Identifying and classifying entities within a text.

Encoder-Decoder Models

Encoder-decoder transformers, like the T5 or BART, consist of two separate stacks: an encoder that processes the input sequence and a decoder that generates the output sequence.

Applications:

Machine Translation: They excel at tasks requiring the transformation of one sequence into another, such as translating text from one language to another.
Summarization: Generating a summary of a long document.
Question Answering: Where the encoder processes the context and the decoder generates the answer.

Decoder-Only Models

Decoder-only transformers, like GPT (Generative Pre-trained Transformer), consist solely of a stack of decoder layers. Each layer includes self-attention mechanisms and feed-forward neural networks. The self-attention mechanism in each layer allows the model to attend to all previous tokens in the sequence.

Applications:

Autoregressive Text Generation: They are particularly well-suited for tasks that involve generating text, such as language modeling, text completion, and creative writing. For example, generating a paragraph of text based on a given prompt.
Conversational AI: Useful in chatbots and dialogue systems where the model generates responses based on the conversation history.

How Decoder Only Models work?

Key Components

Token Input Layer: This is where the model receives the sequence of input tokens (words or subwords)

Decoder Block Layer:

Positional Encoding:Adds information about the position of each token to understand the order of words in the sequence.
Masked Multi-Head Self-Attention Layer: Understands the input sequence. The "masked" aspect ensures the model predicts each token without looking at future tokens, maintaining the model's autoregressive nature.
Fully Connected Layers: Further process the information from the self-attention layer.

Output Layer: A fully connected layer followed by a SoftMax function to predict the next token in the sequence.

é¢†è‹±æŽ¨è

Retrieval-Augmented Generation (RAG) and Artificial Intelligence

Retrieval-Augmented Generation (RAG) and Artificialâ€¦

Prof. Ahmed Banafa 9 ä¸ªæœˆå‰

Understanding LLMs: From Architecture to Optimization

Dr. Rabi Prasad Padhy 11 ä¸ªæœˆå‰

How RAG Works: A Beginnerâ€™s Guide

Majid Sheikh 1 ä¸ªæœˆå‰

Depending on the requirements, a Decoder-Only Model can have multiple Decoder Blocks stacked on top of each other. The real driving force behind these models is the Masked Self-Attention mechanism. This mechanism allows the model to focus on different parts of the input sequence when predicting each token, facilitating the generation of contextually relevant text.

During inference or production, Decoder-Only Models employ algorithms like Greedy Search or Sampling to choose the most appropriate words for generating the next part of the text. This ability to generate text makes Decoder-Only Models particularly useful in creating content that requires a high degree of contextual understanding and coherence, making them ideal for applications that involve human-like interaction and content creation.

Having a decoder component and thereby generative ability, wouldnâ€™t having the extra encoder components only help?

Causal Decoder (CD) vs Encoder-Decoder (ED)

In the study "What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?" by Wang et al. (ICML 2022), researchers compared different language model architectures and training methods. They found that:

Causal Decoder-Only Models: These models, trained to predict the next word in a sequence, are the best at generalizing to new tasks without additional training (zero-shot generalization).
Non-Causal Models: Models trained to fill in missing words (masked language modeling) and then fine-tuned on multiple tasks performed the best overall in their experiments.

Choosing between decoder-only and encoder-decoder architectures involves several factors, as outlined below:

Cost of Training:

Encoder-Decoder Models: Achieve maximum potential with multitask finetuning on labeled data, which is costly, especially for larger models.
Decoder-Only Models: Perform well with zero-shot generalization using self-supervised learning on large datasets, making them cost-effective.

Emergent Ability:

The models in the study (with around 5 billion parameters and 170 billion tokens) are not large enough to exhibit emergent abilities.
Emergent abilities refer to sophisticated capabilities that arise naturally as models scale, enabling complex reasoning and task decomposition.
These abilities reduce the performance gap between decoder-only and encoder-decoder models without multitask finetuning.

In-Context Learning from Prompts:

Prompting techniques (like few-shot examples) help models understand tasks better.
In decoder-only models, prompts have a more direct effect as they do not need intermediate context translation.
Encoder-decoder models require careful tuning of the encoder for optimal performance with prompts.

Efficiency Optimization:

Decoder-only models reuse Key (K) and Value (V) matrices from previous tokens, enhancing efficiency and reducing computational costs during inference.

Autoregressive vs Bidirectional Attention:

Decoder-Only (Causal Decoder): Uses autoregressive attention, which maintains full rank status and theoretically offers stronger expressive capability.
Encoder-Decoder: Uses bidirectional attention, which speeds up learning but may limit the modelâ€™s ability to learn deeper predictive patterns.
Experiments show that a mix of forward and backward attention (Forward-Backward attention) performs slightly better than full bidirectional attention but the difference is marginal with sufficient training.

Conclusion

The popularity of decoder-only architecture comes from its simplicity, good zero-shot generalization, and cheaper training cost to attain a reasonable performance. Many works have been done studying the performance of decoder-only and encoder-decoder architectures, but given there is sufficient training and model size, there really is no hard evidence that proves one architecture is superior to another in terms of final performance.

Reference

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?https://proceedings.mlr.press/v162/wang22u/wang22u.pdf

Alammar, J (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/

Wei et al. (2022). Emergent Abilities of Large Language Models. Retrieved from https://arxiv.org/abs/2206.07682

å¸¦æœ‰æ¤å›¾æ ‡çš„é“¾æŽ¥ ç”±é¢†è‹±åˆ›å»ºï¼Œä¸å¸¦æ¤å›¾æ ‡çš„é“¾æŽ¥ç”±ä½œè€…æ·»åŠ ã€‚

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Himank Jainçš„æ›´å¤šæ–‡ç«

Accelerating Language Models with Multi-Token Prediction

2024å¹´7æœˆ23æ—¥

Accelerating Language Models with Multi-Token Prediction

Meta's new research introduces an improved method for training Large Language Models (LLMs). This model predictsâ€¦
Understanding the variance of Variational Autoencoders

2024å¹´7æœˆ13æ—¥

Understanding the variance of Variational Autoencoders

In the field of deep learning, autoencoders are well-known for their ability to compress and reconstruct data, aimingâ€¦
Decoding Autoencoders

2024å¹´7æœˆ10æ—¥

Decoding Autoencoders

Autoencoders are a type of deep learning neural network that have found applications in various domains such asâ€¦
Tokenization & Byte-Pair Encoding

2024å¹´7æœˆ1æ—¥

Tokenization & Byte-Pair Encoding

â€œGemini-1.5 has a context length of 1M tokens.
Mastering Text Generation: Unveiling the Secrets of Decoding Strategies in Large Language Models

2024å¹´6æœˆ20æ—¥

Mastering Text Generation: Unveiling the Secrets of Decoding Strategies in Large Language Models

Decoding in the context of large language models (LLMs) refers to the process of generating sequences of words orâ€¦

2 æ¡è¯„è®º
Optimisation Strategies to Speed up Transformers

2024å¹´6æœˆ17æ—¥

Optimisation Strategies to Speed up Transformers

Transformers have revolutionised the field of natural language processing (NLP) and have found applications in variousâ€¦

3 æ¡è¯„è®º
MORA: A High Rank PEFT Approach for Fine-Tuning

2024å¹´6æœˆ3æ—¥

MORA: A High Rank PEFT Approach for Fine-Tuning

Introduction: Researchers from Microsoft and Beihang University have introduced MoRA, a new parameter-efficientâ€¦
RAG & GraphRAG

2024å¹´5æœˆ31æ—¥

RAG & GraphRAG

Introduction Large Language Models (LLMs) operate on fixed datasets, with their knowledge limited to the point of theirâ€¦

5 æ¡è¯„è®º
xLSTM: A New Frontier in Large Language Model Efficiency and Performance

2024å¹´5æœˆ27æ—¥

xLSTM: A New Frontier in Large Language Model Efficiency and Performance

The landscape of large language models (LLMs) has been revolutionized by Transformers since their debut in 2017. Priorâ€¦

3 æ¡è¯„è®º
How Tensor Streaming Processor (TSP) forms the backend for LPU?

2024å¹´5æœˆ13æ—¥

How Tensor Streaming Processor (TSP) forms the backend for LPU?

Recently, Groq garnered attention for surpassing LLM inference benchmarks with their Language Processing Unit (LPU)â€¦

See all articles

Why Decoder-only Transformers?

Himank Jain

Senior Data Scientist @Bajaj Finserv Health | Translating complex data into simpler solutions for Healthcare | Problem Solver | Learner

Encoder and Decoder

Overview of Language Model Architectures

How Decoder Only Models work?

Key Components

é¢†è‹±æŽ¨è

Causal Decoder (CD) vs Encoder-Decoder (ED)

Conclusion

Reference

Himank Jainçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Stopping bias and discrimination in the training of generative AI tools

Beyond Words: The Future of Machine Learning with Transformer Models

The Impact of Tokenization on the Speed and Efficiency of Large Language Models

Large Language Models (LLMs): A Deep Dive into the Mechanics, Applications, and Future

Prompting

Chunking Strategies for LLMs: A Deep Dive

The Future of AI: Integrated Large Language Models and Knowledge Graphs

TAILORING LLMS TO YOUR APPLICATIONS: A 3-STEP APPROACH

Language Models: Everything You Need To Know

The Power of Context: Understanding the Importance of Context Window in Large Language Models

Encoder and Decoder

Overview of Language Model Architectures

How Decoder Only Models work?

Key Components

é¢†è‹±æŽ¨è

Causal Decoder (CD) vs Encoder-Decoder (ED)

Conclusion

Reference

Himank Jainçš„æ›´å¤šæ–‡ç«

Accelerating Language Models with Multi-Token Prediction

Understanding the variance of Variational Autoencoders

Decoding Autoencoders

Tokenization & Byte-Pair Encoding

Mastering Text Generation: Unveiling the Secrets of Decoding Strategies in Large Language Models

Optimisation Strategies to Speed up Transformers

MORA: A High Rank PEFT Approach for Fine-Tuning

RAG & GraphRAG

xLSTM: A New Frontier in Large Language Model Efficiency and Performance

How Tensor Streaming Processor (TSP) forms the backend for LPU?

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Stopping bias and discrimination in the training of generative AI tools

Beyond Words: The Future of Machine Learning with Transformer Models

The Impact of Tokenization on the Speed and Efficiency of Large Language Models

Large Language Models (LLMs): A Deep Dive into the Mechanics, Applications, and Future

Prompting

Chunking Strategies for LLMs: A Deep Dive

The Future of AI: Integrated Large Language Models and Knowledge Graphs

TAILORING LLMS TO YOUR APPLICATIONS: A 3-STEP APPROACH

Language Models: Everything You Need To Know

The Power of Context: Understanding the Importance of Context Window in Large Language Models

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†