LLM Transformer Overview ...for the busy AI Engineer
[Reposted from https://www.anup.io/p/llm-transformer-overview]
Introduction
Large Language Models (LLMs) are built on the Transformer architecture introduced in Attention Is All You Need (AIAYN) paper in 2017.
Transformers are based solely on attention mechanisms and dispense with recurrence (as used in Recurrent Neural Networks or RNNs) and convolutions (as used in Convolutional Neural Networks or CNNs). RNNs are typically used in Natural Language Processing or NLP, and CNNs are used for Computer Vision.
Before the Transformers architecture, the dominant machine learning (ML) architecture was sequence-to-sequence modelling (also referred to as sequence transduction models). Sequence-to-sequence modelling transforms input sequence to an output sequence. Examples include:
The Transformer described in the AIAYN paper is based on an encoder-decoder architecture, as seen in this image.
Transformers operate on tokens, which are units of data like words, sub-words, or characters. For example:
As a rule of thumb, 1 token is approximately 4 characters or 0.75 words for English text.
The three main variants of Transformers are:
领英推荐
Key Architectural Concepts
Bidirectional means that the transformer attends to a single token by looking at tokens to the left (before) and to the right (after) to fully understand the sequence. Bidirectionality corresponds to the encoder stack and the multi-head attention layer in the Transformer architecture.
Encoders that look bidirectionally are good at understanding input.
Auto-Regressive means that the value at a particular time (or position in a sequence) depends on its own previous values. Auto-Regressive predictions correspond to the decoder stack (and the Masked Multi-Head Attention layer).
Decoders that mask all words after the current word in the sentence are good at being generative or generating one word (or more accurately, a token) at a time.
Comparison of Transformer Variants:
Modern LLMs are based on the Decoder-only GPT architecture. Examples include:
These GPT-based LLMs power many current Generative AI (GenAI) applications.
Original Transformer Processing Pipeline
What happens when you put a sequence of words into a transformer, as described in the AIAYN paper:
Transformers have revolutionised machine learning, laying the foundation for modern Generative AI and reshaping the future of AI-driven innovation. If you want to learn more, I’d highly recommend the following resources: