LLM Transformer Overview ...for the busy AI Engineer

[Reposted from https://www.anup.io/p/llm-transformer-overview]

Introduction

Large Language Models (LLMs) are built on the Transformer architecture introduced in Attention Is All You Need (AIAYN) paper in 2017.

Transformers are based solely on attention mechanisms and dispense with recurrence (as used in Recurrent Neural Networks or RNNs) and convolutions (as used in Convolutional Neural Networks or CNNs). RNNs are typically used in Natural Language Processing or NLP, and CNNs are used for Computer Vision.

Before the Transformers architecture, the dominant machine learning (ML) architecture was sequence-to-sequence modelling (also referred to as sequence transduction models). Sequence-to-sequence modelling transforms input sequence to an output sequence. Examples include:

  • Text Translation - Converting text between languages
  • Text Summarisation - Condensing long text into shorter versions
  • Conversation - Generating responses to questions

The Transformer described in the AIAYN paper is based on an encoder-decoder architecture, as seen in this image.

Transformers operate on tokens, which are units of data like words, sub-words, or characters. For example:

  • The string "tokenisation" is decomposed as "token" and "isation."
  • A short and common word like "the" is represented as a single token.

As a rule of thumb, 1 token is approximately 4 characters or 0.75 words for English text.

The three main variants of Transformers are:

BART vs BERT vs GPT

  • BART - Bidirectional and Auto-Regressive Transformer
  • BERT - Bidirectional Encoder Representations from Transformers
  • GPT - Generative Pre-Trained Transformers

Key Architectural Concepts

Bidirectional means that the transformer attends to a single token by looking at tokens to the left (before) and to the right (after) to fully understand the sequence. Bidirectionality corresponds to the encoder stack and the multi-head attention layer in the Transformer architecture.

Encoders that look bidirectionally are good at understanding input.

Auto-Regressive means that the value at a particular time (or position in a sequence) depends on its own previous values. Auto-Regressive predictions correspond to the decoder stack (and the Masked Multi-Head Attention layer).

Decoders that mask all words after the current word in the sentence are good at being generative or generating one word (or more accurately, a token) at a time.

Comparison of Transformer Variants:

Modern LLMs are based on the Decoder-only GPT architecture. Examples include:

  • OpenAI's GPT-series models
  • Anthropic's Claude-series models
  • Meta's Llama-series models

These GPT-based LLMs power many current Generative AI (GenAI) applications.

Original Transformer Processing Pipeline

What happens when you put a sequence of words into a transformer, as described in the AIAYN paper:

  1. Tokenisation: Splits input text into token units.
  2. Embedding: Transforms tokens into a vector (or list) of numbers by creating an embedding representation.
  3. Positional Encoding: Adds sequence position information to each token, i.e., keeps track of word positions.
  4. Residual Connection: Remembers what you've already learned, i.e., maintains information flow through layers.
  5. Layer Normalisation: Stabilises training and prevents overfitting.
  6. Multi-Headed Attention: Processes input from multiple perspectives.
  7. Feed Forward Neural Network: Provides additional input analysis, i.e., looks at the sequence from another angle.
  8. Encoder Block: Processes input bidirectionally for understanding.
  9. Decoder Block: Generates tokens based on the previous sequence (auto-regressive).
  10. Linear Projection: Calculates raw scores (logits) for vocabulary tokens.
  11. Softmax: Converts logits into probability distributions for token selection.


Transformers have revolutionised machine learning, laying the foundation for modern Generative AI and reshaping the future of AI-driven innovation. If you want to learn more, I’d highly recommend the following resources:

要查看或添加评论,请登录

Anup Jadhav的更多文章

社区洞察

其他会员也浏览了