Why Decoder-only Transformers?

Why Decoder-only Transformers?

In the realm of natural language processing (NLP), Transformer architectures have revolutionized the way machines understand and generate human language. At the heart of these architectures are the Encoder and Decoder blocks, which are the main components driving contextual understanding and output generation. Among the various transformer models, decoder-only transformers have garnered significant attention for their efficiency and effectiveness, particularly in text generation tasks. This article will introduce encoder-only models, encoder-decoder models, and decoder-only models, highlighting the unique characteristics and advantages of each. We will explore why most large language models (LLMs) today are based on decoder-only architecture, illustrating their prominence in the current NLP landscape.

Encoder and Decoder

  • Encoder: Captures the meaning of the entire input sentence and produces a rich, contextual representation of each word.
  • Decoder: Uses these contextual representations to generate the output sentence, word by word, ensuring the translation is contextually and grammatically correct.

Imagine we want to translate the English sentence "I love cats" to French, which is "J'aime les chats".

  1. Input: "I love cats" (English)

2. Encoder Processing: Take input and apply

  • Tokenization: ["I", "love", "cats"]
  • Embedding: [Vector(I), Vector(love), Vector(cats)]
  • Positional Encoding and Self-Attention layers generate contextual representations.

3. Encoder Output: Contextual vectors representing ["I", "love", "cats"]

4. Decoder Processing:

  • Initial input: "<start>"
  • Step-by-step generation with attention to encoder output:
  • "<start>" -> "J'"
  • "J'" -> "aime"
  • "J'aime" -> "les"
  • "J'aime les" -> "chats"
  • "J'aime les chats" -> "<end>"

5. Output: "J'aime les chats"

Overview of Language Model Architectures

Encoder-only

Encoder-only transformers, like BERT (Bidirectional Encoder Representations from Transformers), consist solely of a stack of encoder layers. These models are designed to understand and process input sequences by attending to all tokens bidirectionally.

Applications:

  • Text Classification: Tasks like sentiment analysis where the entire input sequence needs to be understood as a whole.
  • Named Entity Recognition (NER): Identifying and classifying entities within a text.

Encoder-Decoder Models

Encoder-decoder transformers, like the T5 or BART, consist of two separate stacks: an encoder that processes the input sequence and a decoder that generates the output sequence.

Applications:

  • Machine Translation: They excel at tasks requiring the transformation of one sequence into another, such as translating text from one language to another.
  • Summarization: Generating a summary of a long document.
  • Question Answering: Where the encoder processes the context and the decoder generates the answer.

Decoder-Only Models

Decoder-only transformers, like GPT (Generative Pre-trained Transformer), consist solely of a stack of decoder layers. Each layer includes self-attention mechanisms and feed-forward neural networks. The self-attention mechanism in each layer allows the model to attend to all previous tokens in the sequence.

Applications:

  • Autoregressive Text Generation: They are particularly well-suited for tasks that involve generating text, such as language modeling, text completion, and creative writing. For example, generating a paragraph of text based on a given prompt.
  • Conversational AI: Useful in chatbots and dialogue systems where the model generates responses based on the conversation history.

How Decoder Only Models work?

Key Components

Token Input Layer: This is where the model receives the sequence of input tokens (words or subwords)

Decoder Block Layer:

  • Positional Encoding:Adds information about the position of each token to understand the order of words in the sequence.
  • Masked Multi-Head Self-Attention Layer: Understands the input sequence. The "masked" aspect ensures the model predicts each token without looking at future tokens, maintaining the model's autoregressive nature.
  • Fully Connected Layers: Further process the information from the self-attention layer.

Output Layer: A fully connected layer followed by a SoftMax function to predict the next token in the sequence.

Depending on the requirements, a Decoder-Only Model can have multiple Decoder Blocks stacked on top of each other. The real driving force behind these models is the Masked Self-Attention mechanism. This mechanism allows the model to focus on different parts of the input sequence when predicting each token, facilitating the generation of contextually relevant text.

During inference or production, Decoder-Only Models employ algorithms like Greedy Search or Sampling to choose the most appropriate words for generating the next part of the text. This ability to generate text makes Decoder-Only Models particularly useful in creating content that requires a high degree of contextual understanding and coherence, making them ideal for applications that involve human-like interaction and content creation.


Having a decoder component and thereby generative ability, wouldn’t having the extra encoder components only help?

Causal Decoder (CD) vs Encoder-Decoder (ED)

In the study "What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?" by Wang et al. (ICML 2022), researchers compared different language model architectures and training methods. They found that:

  • Causal Decoder-Only Models: These models, trained to predict the next word in a sequence, are the best at generalizing to new tasks without additional training (zero-shot generalization).
  • Non-Causal Models: Models trained to fill in missing words (masked language modeling) and then fine-tuned on multiple tasks performed the best overall in their experiments.

Choosing between decoder-only and encoder-decoder architectures involves several factors, as outlined below:

Cost of Training:

  • Encoder-Decoder Models: Achieve maximum potential with multitask finetuning on labeled data, which is costly, especially for larger models.
  • Decoder-Only Models: Perform well with zero-shot generalization using self-supervised learning on large datasets, making them cost-effective.

Emergent Ability:

  • The models in the study (with around 5 billion parameters and 170 billion tokens) are not large enough to exhibit emergent abilities.
  • Emergent abilities refer to sophisticated capabilities that arise naturally as models scale, enabling complex reasoning and task decomposition.
  • These abilities reduce the performance gap between decoder-only and encoder-decoder models without multitask finetuning.

In-Context Learning from Prompts:

  • Prompting techniques (like few-shot examples) help models understand tasks better.
  • In decoder-only models, prompts have a more direct effect as they do not need intermediate context translation.
  • Encoder-decoder models require careful tuning of the encoder for optimal performance with prompts.

Efficiency Optimization:

  • Decoder-only models reuse Key (K) and Value (V) matrices from previous tokens, enhancing efficiency and reducing computational costs during inference.

Autoregressive vs Bidirectional Attention:

  • Decoder-Only (Causal Decoder): Uses autoregressive attention, which maintains full rank status and theoretically offers stronger expressive capability.
  • Encoder-Decoder: Uses bidirectional attention, which speeds up learning but may limit the model’s ability to learn deeper predictive patterns.
  • Experiments show that a mix of forward and backward attention (Forward-Backward attention) performs slightly better than full bidirectional attention but the difference is marginal with sufficient training.

Conclusion

The popularity of decoder-only architecture comes from its simplicity, good zero-shot generalization, and cheaper training cost to attain a reasonable performance. Many works have been done studying the performance of decoder-only and encoder-decoder architectures, but given there is sufficient training and model size, there really is no hard evidence that proves one architecture is superior to another in terms of final performance.

Reference

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?https://proceedings.mlr.press/v162/wang22u/wang22u.pdf


Alammar, J (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/


Wei et al. (2022). Emergent Abilities of Large Language Models. Retrieved from https://arxiv.org/abs/2206.07682




要查看或添加评论,请登录

Himank Jain的更多文章

社区洞察

其他会员也浏览了