Read the white paper “Foundational Large Language Models & Text Generationâ€.
Transformer
The transformer architecture was developed at Google in 2017 for use in a translation model.
?
It’s a sequence-to-sequence model capable of converting sequences from one domain into sequences in another domain. For example, translating French sentences to English sentences. The original transformer architecture consists of two parts: an encoder and a decoder. The encoder converts the input text (e.g., a French sentence) into a representation, which is then passed to the decoder. The decoder uses this representation to generate the
output text (e.g., an English translation) autoregressively.
?
?
Notably, the size of the output of the transformer encoder is linear in the size of its input. Figure 1 shows the design of the original transformer architecture.
?
The transformer consists of multiple layers. A layer in a neural network comprises a set of parameters that perform a specific transformation on the data. In the diagram you can see an example of some layers which include Multi-Head Attention, Add & Norm, Feed-Forward, Linear, Softmax etc. The layers can be sub-divided into the input, hidden and output layers. The input layer (e.g., Input/Output Embedding) is the layer where the raw data enters the
network. Input embeddings are used to represent the input tokens to the model. Output embeddings are used to represent the output tokens that the model predicts. For example, in a machine translation model, the input embeddings would represent the words in the source language, while the output embeddings would represent the words in the target language. The output layer (e.g., Softmax) is the final layer that produces the output of the network. The hidden layers (e.g., Multi-Head Attention) are between the input and output layers and are
where the magic happens.
?
Step 1:
Input preparation and embedding
?
1. Normalization (Optional): Standardizes text by removing redundant whitespace, accents, etc.
2. Tokenization: Breaks the sentence into words or subwords and maps them to integer token IDs from a vocabulary.
3. Embedding: Converts each token ID to its corresponding high-dimensional vector, typically using a lookup table. These can be learned during the training process.
4. Positional Encoding: Adds information about the position of each token in the sequence to help the transformer understand word order.
?
Step 2:
Multi-head attention
After converting input tokens into embedding vectors, you feed these embeddings into the multi-head attention module (see Figure 1). Self-attention is a crucial mechanism in transformers; it enables them to focus on specific parts of the input sequence relevant to the task at hand and to capture long-range dependencies within sequences more effectively than traditional RNNs.
?
To understand Multiheaded attention you first dive into single head attention as described below:
?
Self-attention is a core mechanism in transformer-based models that allows a model to determine relationships between different words in a sentence. For instance, in the sentence "The tiger jumped out of a tree to get a drink because it was thirsty," self-attention identifies that "the tiger" and "it" refer to the same entity, creating a connection between them. This is achieved through the following steps:
- Creating Queries, Keys, and Values: Each word embedding is multiplied by learned weight matrices to generate query, key, and value vectors. These vectors enable the model to identify relevant words (query), assign relevance (key), and hold word information (value).
- Calculating Scores: The model calculates scores by taking the dot product between query and key vectors, which reveals how strongly each word relates to others.
- Normalization: Scores are stabilized by dividing by the square root of the key dimension and applying a softmax function, producing attention weights that show word connections.
- Weighted Values: Each value vector is weighted by its attention score, resulting in a context-aware representation for each word.
?
To handle complex patterns, multi-head attention employs multiple sets of Q, K, and V matrices. Each head processes relationships differently, and the combined output provides a richer representation, enhancing the model's understanding of intricate language patterns, long-range dependencies, and tasks like translation and summarization.
?
Step 3:
Layer Normalization and Residual connection:
?
In a transformer model, each layer includes mechanisms called layer normalization and residual connections to help the model learn better and more effectively. Here’s how they work:
- Layer Normalization: This process adjusts the values (or activations) within a layer to have a similar scale. It calculates the mean and variance of the activations, then uses these to standardize the values. This keeps the model stable, speeds up training, and improves overall performance by reducing the risk of large shifts in activations across layers.
- Residual Connections: These connections add the original input of a layer to its output. By "skipping" over parts of the layer, residual connections help the model avoid issues like "vanishing gradients" (where information fades as it moves through layers) and make learning easier for the model.
In the transformer, Add and Norm refers to this combined process of adding residual connections and normalizing values. It’s applied to both the multi-head attention module and the feedforward layer.
- Feedforward Layer: After the multi-head attention and normalization, the data goes into a feedforward layer. This layer processes each position in the input sequence individually, adding more non-linear complexity (often using functions like ReLU or GELU). This step helps the model develop richer, more powerful representations of the data.
Finally, another Add and Norm step is applied after the feedforward layer, keeping the model stable and effective, even with many layers.
?
Step 4:
Encoder, Decoder and Recent Decoder Only Model
?
- Encoder: Imagine we’re using a transformer to translate the sentence “The cat sits on the mat†into French. The encoder’s job is to understand this sentence fully. First, each word in the input sentence is tokenized, embedded, and given positional encodings (to retain word order). Using self-attention, the encoder ensures each word can reference others to grasp context. For instance, “cat†can relate to “sits†and “mat,†forming a complete idea of “The cat sits on the mat.†The encoder’s output (Z) is a context-rich representation of the whole sentence, ready for the decoder to use.
- Decoder: Now the decoder’s job is to generate the translated sentence in French, e.g., “Le chat est assis sur le tapis.†It begins with a start-of-sequence token and generates one token at a time: “Le,†then “chat,†then “est,†and so on. At each step, the decoder uses masked self-attention so each new token only looks at the ones generated before it, maintaining an ordered flow. Through encoder-decoder cross-attention, it also refers back to the encoder’s output Z, focusing on the relevant English words. This continues until the decoder reaches an end-of-sequence token, completing the translation.
- Decoder-Only Models in Recent LLMs: In tasks like text generation, summarization, or question-answering, many modern large language models use a decoder-only approach. For example, when using GPT-4 to generate a paragraph, the input prompt “Write a summary of climate change impacts†is processed by a single decoder. The decoder starts with the prompt and generates each following word in the response using masked self-attention, looking back at the words it has already generated to maintain a logical flow. The process repeats until it completes the response, without a separate encoder module.
?
?
Training the Transformer:
Preparing the Data
- Data Collection: A large and diverse dataset is gathered from sources like books, websites, articles, etc., to ensure the model learns from a variety of language structures and contexts.
- Data Preprocessing: This step involves cleaning the data by removing special characters, standardizing formats, and breaking down the text into tokens (the smallest units of meaning, like words or parts of words).
- Tokenization: Text is divided into tokens, and each token is assigned an ID. A vocabulary list is created for all unique tokens that the model will recognize.
Encoding Input Sequences
- Each token is converted into an embedding, a numerical representation that the model can process.
- Positional Encoding: Since transformers don’t have a built-in sense of order (unlike RNNs), positional encodings are added to these embeddings to give each word a unique position in the sequence. This allows the model to understand the order of words, which is crucial for language.
Self-Attention Mechanism and Multi-head Attention:
Residual Connections and Layer Normalization:
Training Objective (Next-Token Prediction):
Backpropagation and Optimization:
Repeating Across Layers and Epochs:
?
Fine Tuning:
- A very costly business is fine tuning of LLM, it is done in very different manner than other ML models.
- After training, an LLM can be further specialized with supervised fine-tuning (SFT) on task-specific datasets, enhancing its performance in targeted applications. Examples include: Instruction-tuning: Teaching the model to follow specific instructions, like summarizing text, writing code, or crafting poetry in a given style. Each data point consists of an input (prompt) and a demonstration (target response). For example, questions (prompt) and answers (target response), translations from one language (prompt) to another language (target response), a document to summarize (prompt), and the corresponding summary (target response) Dialogue-tuning: Fine-tuning on conversational data to improve multi-turn interactions, where the model learns to respond effectively in dialogue. Safety-tuning: Implementing safeguards to reduce biased or harmful outputs, often using methods like reinforcement learning with human feedback (RLHF) and human validation to promote safe, ethical responses.
?
?
Reinforcement learning from human feedback
Typically, after performing SFT, a second stage of fine-tuning occurs which is called reinforcement learning from human feedback (RLHF). This is a very powerful fine-tuning technique that enables an LLM to better align with human-preferred responses (i.e. making its responses more helpful, truthful, safer, etc.).
?
But theoretically it is simple but practically a million parameter based LLM cannot fine tune for both SFT and RLHF , we do efficient parameter based fine tuning called parameter efficient fine-tuning (PEFT).
?
Adapter-based fine-tuning and Low-Rank Adaptation (LoRA) are efficient techniques for fine-tuning large language models with minimal resource requirements:
- Adapter-based Fine-tuning: This approach adds small modules, called adapters, to a pretrained model. Only the adapter parameters are trained, leaving the rest of the model unchanged. This method significantly reduces the number of parameters to be trained compared to traditional fine-tuning, making it more efficient.
- Low-Rank Adaptation (LoRA): LoRA fine-tunes the model using two smaller matrices to approximate updates to the original weight matrix, rather than retraining the entire model. It freezes the original weights and trains these smaller update matrices, minimizing computational resources and inference latency. LoRA can be further optimized with variants like QLoRA, which uses quantized weights for even greater efficiency. One key advantage of LoRA is that these update modules are plug-and-play, allowing you to swap or replace them easily based on the task, which simplifies model transfer and specialization.
?
?
?
?
Once training is done you readily cannot use the model for inferences or day to day job, next thing you need to do get desired output is Prompting aka Prompt engineering.
?
Prompt engineering is crucial for maximizing the effectiveness of large language models (LLMs) by guiding them to produce desired outputs. It involves providing clear instructions, examples, and context to the model, helping it generate responses that are factual, creative, or tailored to specific tasks. Examples of prompt engineering include using structured instructions, background details, and emphasizing important information.
Key prompt techniques include:
- Few-shot prompting: Provides a few examples along with the task description. The model uses these examples to generate its response. For instance, if given a few countries and their capitals, the model can generate the capital of a new country.
- Zero-shot prompting: Involves giving the LLM just the task description without examples. The model uses its existing knowledge to generate the response. This method relies solely on the model's pre-trained data and can be less reliable than few-shot prompting.
- Chain-of-thought prompting: Encourages the model to break down complex tasks into smaller, logical steps. It demonstrates how to solve problems by reasoning step-by-step, which helps the model explain its reasoning before generating an answer.
In addition to prompt engineering, sampling techniques influence the quality and diversity of LLM outputs:
- Greedy search: Chooses the most probable token at each step, which can lead to repetitive outputs.
- Random sampling: Selects tokens randomly based on their probabilities, adding creativity but increasing the risk of nonsensical outputs.
- Temperature sampling: Adjusts the diversity of outputs based on a temperature parameter. Higher temperatures favor more varied outputs.
- Top-K sampling: Samples from the top K most probable tokens, controlling randomness.
- Top-P (nucleus) sampling: Samples from a dynamic subset of tokens whose cumulative probability is less than P, offering flexibility in choosing diverse or confident words.
- Best-of-N sampling: Generates multiple responses and selects the best one based on a predetermined metric, useful for tasks requiring high accuracy.
By combining prompt engineering with appropriate sampling methods and hyperparameters, you can guide LLMs to produce relevant, creative, and coherent outputs, enhancing their effectiveness for a variety of application
?
In next we will see next white papers on Prompt engineering.