GPU and AI Model Optimizations
If you're involved with AI, you've likely been inundated by expert commentary on techniques like RAG, knowledge graphs, fine-tuning, prompt engineering, instruction-tuning, and the recommended success recipes, such as prompt engineering => RAG => fine-tuning, to unlock the value of GenAI.
Despite the abundance of information available on building AI solutions, productionizing LLMs at scale is hard and humbling. Successful implementations require a 'learn to do' rather than a 'learn to know' mindset. There are no shortcuts or easy DIY recipes.
Teams often struggle to move beyond limited POCs. In my experience, one key reason is the lack of focus and understanding of how these systems operate at a deeper level, including GPU, CPU, memory, and networking.
As an aside, if you need a quick overview of the various hardware technologies powering AI, you can refer here.
While knowledge of high-level concepts is undoubtedly important, there's no substitute for understanding AI at the system level. Knowing what to look for and what changes to make to achieve that extra 1% efficiency can sometimes require weeks or even months of painstaking work. Even when you are aware of system-level optimizations, many optimizations may not yield the expected return on investment and could remain purely textbook exercises. It's crucial to have a thorough understanding of how AI workloads interact at the system level to truly unlock their potential.
Where relevant, I've included additional references and guidance on implementation. Some techniques were developed in-house based on our custom needs or because they weren't supported by the ML framework we were using. In some cases, we found that the open-source implementations were inefficient, requiring significant modifications. Feel free to reach out if you'd like more details. Please note that this collection is not exhaustive—there are countless other techniques and concepts waiting to be explored. I hope you find it valuable.
Let's dive into the exciting world of optimizing AI systems for performance and efficiency. From model architecture to deployment, every step offers opportunities to enhance performance and improve efficiency. Whether you're training foundational models, fine-tuning existing ones, or simply focusing on inference, the goal remains the same: to create robust, high-performing AI solutions.
GPU level optimizations
In the AI world, a kernel is a function launched from the host to run on the GPU. The nature of a kernel, whether it is compute-bound or memory-bound, largely determines the optimizations needed. Operations can be classified as either compute-bound or memory-bound based on the balance between computation and memory accesses. This balance is commonly quantified by the arithmetic intensity, which is the number of arithmetic operations performed per byte of memory accessed. A kernel is compute-bound if Arithmetic intensity is above GPU’s FLOPS:MemBandwidth ratio. This means the kernel performs many computations relative to the amount of data it accesses from memory. In contrast, kernels with lower arithmetic intensity are typically memory-bound.
As a rule of thumb, you want the kernel to be compute-bound rather than memory-bound to drive maximum efficiency and FLOPS throughput.
Memory-bound operations are primarily constrained by memory access times, where the actual computation takes comparatively less time. The performance of these operations depends heavily on the number of memory accesses rather than the computational workload. For example, operations like vector addition, softmax, and batch normalization are examples of memory-bound kernels with low operational intensity. In contrast, compute-bound kernels, such as Fast Fourier Transform and Matrix Multiplication with Large Inner Dimensions, emphasize computational tasks over memory access.
Several techniques and approaches can significantly increase GPU utilization (compute-bound) and enhance algorithm efficiency. Many of these methods may already be available by default or can be explicitly enabled in the AI framework or compiler you are using. Understanding these concepts is crucial, especially when developing new code or optimizing existing code for better performance and efficiency.
Here are some specific techniques for optimizing memory access and improving compute efficiency.
1. Operator Fusion
In many computational tasks, especially in neural network operations, sequences of operations are performed on tensors. Operator fusion takes these sequences and merges them into a single, more complex operation. This reduces the overhead associated with scheduling and executing multiple separate operations. It also has the benefit of in-memory computing instead of executing multiple memory LOAD and STORE for pointwise operators when there is a data dependency.
Here is a sample,
Without Operator Fusion:
import torch
import torch.nn.functional as F
A = torch.randn(100, 100)
B = F.relu(A)
C = F.batch_norm(B, running_mean, running_var)
With Operator Fusion:
A fused operation can be created that performs both ReLU and batch normalization in one step.
import torch
class FusedReLUAndBatchNorm(torch.autograd.Function):
@staticmethod
def forward(ctx, input, running_mean, running_var):
relu_output = torch.relu(input)
batch_norm_output = torch.batch_norm(relu_output, running_mean, running_var, training=True)
return batch_norm_output
A = torch.randn(100, 100)
C = FusedReLUAndBatchNorm.apply(A, running_mean, running_var)
2. Kernel Fusion
Similar concept to Operator fusion but at the granularity of kernels. Merge kernels operating on same data to minimize transfers across kernel boundaries, it also reduces launch overheads.
Here is a sample,
Without Kernel Fusion:
__global__ void add(float* A, float* B, float* D, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
D[idx] = A[idx] + B[idx];
}
}
__global__ void multiply(float* D, float* C, float* E, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
E[idx] = D[idx] * C[idx];
}
}
With Kernel Fusion:
A single kernel performs both addition and multiplication.
__global__ void fused_add_multiply(float* A, float* B, float* C, float* E, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
float temp = A[idx] + B[idx];
E[idx] = temp * C[idx];
}
}
// Launching the fused kernel
fused_add_multiply<<<numBlocks, blockSize>>>(A, B, C, E, N);
3. Recompute
Recompute optimization involves recomputing one or more values that were computed earlier, which might seem counterintuitive. However, if storing and loading these computed values involves significant memory LOAD/STORE operations, it can make the process memory-bound. Since the computational capabilities of GPUs far exceed their data transfer capabilities to and from memory, recomputation can result in significant performance gains.
For example, during training, the activations computed during the forward pass are needed again during the backward pass to compute gradients with respect to the loss function. Storing and then accessing these activations from memory can create a significant bottleneck. Instead, recomputing the activations during the backward pass, rather than fetching them from memory, can significantly speed up the process by leveraging the GPU's superior computational power.
Here is a sample
Without Recompute Optimization
import torch
import torch.nn.functional as F
# Forward and backward pass without recompute
x = torch.randn(64, 784, requires_grad=True) # Batch size 64, input size 784
y = torch.relu(x)
loss = y.sum()
loss.backward()
With Recompute Optimization
import torch
import torch.nn.functional as F
from torch.utils.checkpoint import checkpoint
# Function that will be recomputed
def relu_fn(x):
return F.relu(x)
# Forward and backward pass with recompute
x = torch.randn(64, 784, requires_grad=True) # Batch size 64, input size 784
y = checkpoint(relu_fn, x)
loss = y.sum()
loss.backward()
4. Spatial Blocking
The key idea behind spatial blocking is to divide (or "tile") large data structures into smaller, more manageable blocks that fit into the GPU's cache. Each block is processed independently, which helps to ensure that data remains in the cache for as long as possible, thereby reducing the frequency of slow global memory accesses. It helps improve cache utilization, reduce memory bandwidth requirements, and enabled enhanced parallelism.
Here is a sample
Without Spatial Blocking
__global__ void matrixMul(float* A, float* B, float* C, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
for (int k = 0; k < N; k++) {
sum += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
With Spatial Blocking
#define BLOCK_SIZE 16
__global__ void matrixMulTiled(float* A, float* B, float* C, int N) {
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
int row = by * BLOCK_SIZE + ty;
int col = bx * BLOCK_SIZE + tx;
float sum = 0.0f;
for (int t = 0; t < (N + BLOCK_SIZE - 1) / BLOCK_SIZE; ++t) {
if (row < N && t * BLOCK_SIZE + tx < N) {
As[ty][tx] = A[row * N + t * BLOCK_SIZE + tx];
} else {
As[ty][tx] = 0.0f;
}
if (t * BLOCK_SIZE + ty < N && col < N) {
Bs[ty][tx] = B[(t * BLOCK_SIZE + ty) * N + col];
} else {
Bs[ty][tx] = 0.0f;
}
__syncthreads();
for (int k = 0; k < BLOCK_SIZE; ++k) {
sum += As[ty][k] * Bs[k][tx];
}
__syncthreads();
}
if (row < N && col < N) {
C[row * N + col] = sum;
}
}
5. Compress Data
Data compression is a crucial technique in GPU optimization aimed at reducing memory usage, storage requirements, and data transfer between GPUs as well as between GPU and CPU. This approach is particularly effective for handling large datasets or models that exceed GPU memory capacity or when dealing with sparse data structures.
The choice of the compressor depends on your dataset and compression ratio/compression-decompression speed requirements.
I've used two compression schemes for DL workloads,
LZ4 is a versatile compressor designed for arbitrary data formats. It operates at the byte level, emphasizing simplicity and speed. For optimal performance on GPUs, it's recommended to partition datasets into blocks and leverage GPU thread blocks for concurrent compression and decompression operations. This approach maximizes throughput by exploiting the parallel processing capabilities of modern GPUs.
Cascaded compressor is tailored for numerical and analytical datasets. It excels in scenarios where data has a structured format, allowing for efficient parallelization. This algorithm achieves high throughput on modern GPUs, making it suitable for applications requiring rapid data compression and decompression.
While cascaded compression offers high throughput and efficient parallelization, it may not be suitable for all types of data inputs. Particularly, it may not perform optimally with unstructured datasets such as strings or data lacking clear patterns. In such cases, LZ4 is a better choice.
As a side note, deflate is another compressor, but I don't have any hands-on experience with it.
For NVidia GPUs, NVCOMP CUDA library is great place to get started.
6. Occupancy
GPUs leverage extensive parallelism to mitigate latency, with Occupancy serving as a key metric for thread parallelism. It measures how effectively the GPU resources (such as threads, registers, and shared memory) are utilized. Instruction-Level Parallelism (ILP) gauges parallelism among threads.
Higher Occupancy and ILP enhance compute and memory efficiency, optimizing LOAD/STORE operations. Threads awaiting data dependencies or barriers are suspended until hazards are resolved. Tools like Nvidia's ncu and nvprof facilitate understanding and fine-tuning of parameters influencing Occupancy and ILP on Nvidia GPUs.
Here is an example
__global__ void matrixMultiplication(float* A, float* B, float* C, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
for (int k = 0; k < N; ++k) {
sum += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
Model level Optimizations
Before diving into specific techniques, let's cover how inference works and address some common inefficiencies.
LLM inference consists of two distinct phases: the prefill phase, which processes the input prompt, and the decode or generate phase, which generates output tokens autoregressively.
During the prefill phase, all tokens in the input sequence are processed in parallel, leading to high GPU utilization even with a small batch size. In contrast, the decode phase processes only a single token in each autoregressive pass, resulting in very low GPU utilization.
From my experience, the decode cost per token can be 192-214 times higher than the prefill cost per token. This disparity causes significantly different latencies between the time to the first token (TFFT) and the time between tokens (TBT), especially with Chain of Thought queries where users don’t see the intermediate outputs.
Such disproportionate resource utilization implies that for every request, there are periods of high compute utilization due to efficient prefill operations, followed by potentially long periods of inefficient decode operations. This results in poor overall GPU utilization.
LLM inference faces inefficiencies because the decode phase is memory-bound, and the use of pipeline parallelism leads to significant pipeline bubbles in LLMs.
One common challenge is balancing batch sizes to shift the decode phase from memory-bound to compute-intensive. However, scaling up the batch size beyond a limit is impractical due to the KV-cache footprint of each request. For example, we can fit a maximum batch size of 18 requests at a sequence length of 1K for the LLaMA-13B model on an A6000 GPU. Therefore, within the practical range of batch sizes today, the decode phase remains memory-bound.
Another challenge unique to LLMs is variable user prompt length and variable-length output. It poses a challenge to inference, requiring LLMs to constantly adjust memory usage and processing strategies for efficient performance.
1. Intermixing prefill with decode tokens
At any given time, LLM inference involves processing varying lengths of prefills and decoding tokens, resulting in different and unpredictable computation times. For example, the question "What is the longest river in the world?" may have a simple answer, "Nile," whereas "Why are rivers ecologically important?" could require a longer response. Depending on how requests are pipelined and handled, the response to the first question might be delayed due to the generation of the second question's response, leading to high inference times or increased latency between token generations.
One effective method to improve GPU utilization can be to create a compute group (CG) that combines a portion of the original prefill tokens along with decodes, instead of having separate prefill-only and decode-only token blocks. The intermixing of prefill and decode within a CG is possible because the prefill operation is highly parallelizable. By processing these compute blocks of prefill and decode, we can efficiently utilize GPU resources during both phases. Additionally, this strategy helps to eliminate runtime variance across requests, leading to more consistent and predictable performance. By implementing this pipeline, we were able to decrease the time between tokens by 11.2X while improving GPU utilization by 1.6X.
2. Quantization
LLMs are continually growing larger, presenting formidable challenges for deployment and real-time inferencing due to constraints in arithmetic density, memory capacity, and especially memory bandwidth between GPUs and high-bandwidth memory (HBM) as well as lower memory tiers.
For instance, the GPT-3 model contains approximately 175 billion parameters with a memory footprint of more than 300GB, making it impractical to deploy on a single GPU such as the NVIDIA H100. Another challenge is the increase in inference latency, which can be prohibitive for many real-world applications.
Model compression using Quantization can be an all-around win with relatively little effort to reduce the model size and compute needs, however there are various nuances that must be considered to hit the accuracy mark.
Quantization involves reducing the precision of the model's weights and activations from high-precision formats like FP32 (32-bit floating point) to lower-precision formats such as FP16 (16-bit floating point), INT8 (8-bit integer), INT4 (4-bit integer), or lower. For example, reducing the precision of tensors from 32-bit to 8-bit decreases the memory required to store these tensors by a factor of 4. Additionally, the computational cost for matrix multiplication operations is reduced quadratically, resulting in a 16-fold decrease in computation needed. During inference, this not only reduces the in-memory footprint but also minimizes data movement from lower-memory tiers to the GPU.
Here are some technical details and practical considerations one must be aware of:
2.1. Symmetric quantization
In symmetric quantization, the quantized values are symmetrically distributed around zero. This means that the range of values for the quantized integers is centered at zero, and the positive and negative ranges are equal. A single scale factor is used to map the floating-point values to the integer range. This is simpler to implement and is hardware efficient, however, it may not utilize the full integer range effectively if the original data is not symmetrically distributed around zero.
2.2. Asymmetric quantization
In asymmetric quantization, the quantized values are not symmetrically distributed around zero. This allows for a more flexible representation, especially useful when the data range is not centered around zero. It is more complex to implement due to the need for an additional zero-point parameter.
2.3. Quantization parameters
Quantization involves three primary parameters: the scale factor (S), zero-point (Z), and bit-width (b), which collectively determine how data is compressed and represented in integer form.
Key Parameters of Quantization:
2.4. Quantization Grid and Error Management:
The parameters define the limits of the quantization grid (gmin and gmax). Values outside this range are clipped, introducing clipping errors. This is problematic for models where extreme values hold significant meaning. To mitigate clipping errors, adjusting the scale factor to expand the quantization range is considered. However, this adjustment can lead to increased rounding errors, potentially degrading model precision.
2.5. Choosing Quantization Parameters:
To strike a balance between clipping and rounding errors, various methods can be employed:
2.6. Quantization considerations
Next consideration is for Quantization granularity,?the level at which we apply quantization techniques,?as well as how we apply them , e.g. static vs dynamic.
Per-Tensor Quantization (PTQ)
This is the most straightforward approach,?applying the same quantization parameters (scale factor,zero-point,?bit-width) to all elements within a single tensor. PTQ struggles with tensors containing a wide range of values.?The single quantization range might not capture the nuances of all elements,?leading to significant accuracy loss for LLMs that rely on precise numerical representations
Group-wise Quantization (GWQ)
GWQ partitions a tensor into smaller groups (channels in an activation or weight matrix) and calculates quantization parameters for each group individually.?This allows for a better fit to the distribution of values within each group. Compared to PTQ,?GWQ offers better accuracy preservation by accommodating the varying ranges of values within different groups. However, this comes at cost of increased complexity for handling group management and potentially higher memory usage during the quantization process.
Per-Layer Quantization (PLQ)
PLQ offers a balance between PTQ's simplicity and GWQ's group-specific approach.?It can achieve better efficiency gains than PTQ while maintaining acceptable accuracy for many LLM tasks. However, PLQ might not capture the full range of value distributions within a layer,?especially for complex LLM architectures with diverse layers.
Besides these, there are two additional approaches which can come in handy in certain scenarios,
Dynamic Quantization (DQ)
DQ adjusts quantization parameters (such as scale factors and zero-points) during inference based on the data distribution encountered at runtime. Unlike static quantization where parameters are fixed, dynamic quantization allows flexibility to adapt to varying input ranges and distributions. This approach is particularly beneficial in scenarios where input data ranges widely during inference, making fixed quantization parameters less effective.
Adaptive Quantization (AQ)
AQ goes a step further by dynamically adjusting quantization strategies or parameters based on feedback or runtime conditions. This approach considers factors beyond just data distribution, such as Real-time performance metrics (e.g., inference latency, throughput) and adaptation to hardware capabilities or operational constraints (e.g., memory limitations, computational resources).
2.7. Handling Outliers in Quantization
Weights typically follows a normal distribution centered around zero. They are stable and usually devoid of outliers, making them straightforward to quantize without significant loss in efficiency or accuracy. However, Activations are often non-Gaussian and may contain outliers, particularly in scenarios with sparse or highly dynamic data. Quantizing tensors containing outliers can lead to reduced efficiency and accuracy due to the non-uniform distribution and potential large magnitude values.
Managing outliers in activation tensors is essential for optimizing quantization in neural networks. Strategies such as clipping, scaling, and other strategis discussed above can mitigate the impact of outliers on quantization efficiency and accuracy.
LLM.int8() is an elegant solution to the outlier problem. It relies on a vector-wise (absmax) quantization scheme and introduces mixed-precision quantization. This means that outlier features are processed in a FP16 format to retain their precision, while the other values are processed in an INT8 format. As outliers represent about 0.1% of values, this effectively reduces the memory footprint of the LLM while handling outliers.
2.8. Quantization Techniques: Post-Training vs. Quantization-Aware Training
There are two main approaches to quantization:?Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT).?Let's dive into their differences and their suitability for LLMs.
领英推荐
Post-Training Quantization (PTQ): PTQ involves training an LLM in full precision (FP32) and then applying quantization techniques after training is complete.?This is a simpler approach,?requiring minimal modifications to the existing training pipeline. PTQ is straightforward to implement and computationally less expensive compared to QAT. Since the model wasn't trained with quantization in mind,?PTQ can lead to accuracy loss,?especially if the quantization process is aggressive (using very low bit-widths).?This is because the quantization scheme might not capture the full range of values effectively,?particularly for models sensitive to numerical precision like LLMs.
Quantization-Aware Training (QAT): QAT integrates quantization techniques into the training process itself.?The model is trained with simulated quantization steps,?allowing it to learn representations that are more robust to the quantization errors introduced during deployment. QAT can potentially achieve higher accuracy compared to PTQ for the same level of quantization (bit-width),?especially for LLMs.?This is because the model learns to adapt to the lower precision format during training. QAT requires modifying the training pipeline to incorporate quantization steps,?which can be more complex than PTQ.?Additionally,?it might involve longer training times due to the additional computations involved in simulating quantization.
The optimal choice between PTQ and QAT depends on several factors:
Here are three popular quantization methods that I have extensively used. Though there are many more to evaluate and choose from depending on your own unique needs.
SmoothQuant: Enhancing Activation Quantization
SmoothQuant addresses the challenge of quantizing LLMs by focusing on the quantization of activations, which are intermediate results computed during model inference. Activations often exhibit wide distributions with outliers—extreme values that can complicate the quantization process. Key features of SmoothQuant include:
SmoothQuant enables the efficient post-training quantization of even very large LLMs. By reducing computational complexity and memory requirements, it facilitates faster inference speeds and extends the reach of these powerful models to devices with limited resources.
SPQR: Hybrid Compression for Model Size Reduction
SPQR (Sparse-Quantized Representation) introduces a hybrid compression strategy designed to significantly reduce the size of LLMs while preserving their accuracy. The approach integrates two main techniques:
SPQR also addresses the challenge posed by outlier weights—those with a substantial impact on model performance. It employs specialized strategies to handle outliers by storing them in higher precision formats, ensuring critical model accuracy is preserved during compression.
This two-pronged approach of sparsity and targeted quantization with outlier handling enables SPQR to achieve near-lossless compression of LLMs. It opens avenues for deploying sophisticated language models on a wide range of devices, from edge computing devices to cloud servers, without compromising performance.
GPTQ: Layer-by-Layer Quantization for Precision Optimization
GPTQ represents an innovative approach to optimizing LLMs through layer-by-layer quantization. This technique focuses on:
By optimizing quantization at the layer level, GPTQ enhances the overall efficiency of LLMs while maintaining high accuracy. It supports the deployment of complex language models in diverse environments, ranging from real-time applications on edge devices to large-scale deployments in cloud infrastructures.
SmoothQuant, SPQR, and GPTQ offer unique strategies to enhance efficiency and performance across various deployment scenarios. These techniques not only reduce the computational and memory footprint of LLMs but also democratize access to their capabilities, enabling broader adoption in resource constrained environments.
3. Sparsity
Sparsity is a powerful technique used in LLM model compression.?It focuses on?identifying and zero'ing out?elements within a model that have minimal impact on its overall accuracy.?Imagine a large weight matrix in an LLM – some elements might have values very close to zero and contribute little to the model's output.?Sparsity techniques can identify these "unimportant" elements and set them to zero,?effectively?pruning?them from the model.
Benefits of Sparsity:
One popular sparsity technique is?magnitude-based sparsity.?Here,?elements in a tensor (a multidimensional array) are pruned based on their absolute values (magnitudes).?Elements with very small magnitudes are considered less important and are removed.?This approach can achieve remarkable compression rates,?significantly reducing the model size with minimal impact on accuracy.
In practice, applying sparsity before quantization yields a much better model accuracy.? For example, at a 40% sparsity level, 8-bit quantization can result in a total reduction in memory footprint by 6.5X.
4. LoRA (Low-Rank Adaptation)
Fine-tuning LLMs can be computationally expensive due to the massive number of parameters involved.?LoRA addresses this challenge by significantly reducing the number of parameters that need to be trained during fine-tuning. Instead of modifying the entire weight matrix directly (billions of parameters),?it introduces a pair of much smaller,?low-rank matrices.?These low-rank matrices capture the essential information needed for adaptation,?significantly reducing the number of trainable parameters. LoRA leads to reduced training time because only a small subset of parameters needs update leading to faster convergence during fine-tuning.?LoRA also significantly reduces the memory footprint required during fine-tuning.?This makes it particularly beneficial for resource-constrained environments.
5. QLoRA (Quantized Low-Rank Adaptation)
QLoRA builds upon the success of LoRA for efficient LLM fine-tuning.?While LoRA significantly reduces the number of trainable parameters by using low-rank matrices,?QLoRA takes it a step further by incorporating?quantization. By using lower precision formats for the low-rank matrices,?QLoRA significantly minimizes memory usage compared to full fine-tuning with floating-point parameters.?This allows fine-tuning on devices with limited memory resources. Quantization also reduces the computational cost associated with training and inference.?Since calculations with lower precision numbers are simpler,?QLoRA can achieve faster training and potentially faster model execution. Despite the reduction in precision,?QLoRA aims to maintain the model's ability to adapt to new fine-tuning tasks while preserving its overall performance.?This balance between efficiency and accuracy is crucial for practical applications.
6. 1-bit LLM
1-bit LLM needs a special mention because of?the?amazing compression it offers?and matches?16-bit Floating?Point LLM baseline?accuracy (according to?claims?made by?the authors of?the?BitNet b1.58 paper). It focuses on optimizing the storage and processing requirements of?language models,?where?every single?weight?of the LLM is?ternary?{-1, 0, 1}. A 7B parameter model using 32bit precision requires 26GB of memory,?which?rules out deployment on a single device in many cases, e.g. mobile phones.?The?same model with?a?1-bit LLM will need only 0.815?bytes?of memory.?I am uncertain?when?the?model will be available or?if someone has?been able to train it similar to the methodology used in the BitNet b1.58 paper. It will be particularly useful for IoT devices because of?their?extremely limited memory and processing power. It can also benefit real-time applications?where?real-time responses?are?crucial,?such as?Real-Time Text Prediction?that requires?advanced text prediction and auto-correction?features.
I?have not yet seen a working implementation;?it?remains conceptual.?My team?attempted to replicate the results with our limited resources, but?we?couldn't achieve the accuracy described in the paper,?such as?matching?the?16-bit floating point LLM?accuracy.
7. FlashAttention and FlashAttention-2
FlashAttention, as introduced in the FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness paper, is an I/O-aware exact attention algorithm that leverages established techniques such as tiling (see earlier about Spatial Blocking) to minimize the number of memory reads and writes between GPU high-bandwidth memory (HBM) and GPU on-chip SRAM. By utilizing tiling, Flash Attention efficiently partitions the computation to keep most of the data on-chip, thereby reducing the need to access slower off-chip memory.
Additionally, Flash Attention employs a recompute strategy during the backward pass. Instead of reading the intermediate attention matrix from HBM, it recalculates attention on-chip. The recomputation, despite the increase in floating-point operations (FLOPs), it significantly speeds up the process compared to the standard approach, which relies on extensive memory accesses.
According to the authors, Flash Attention not only runs up to 7.6 times faster on models like GPT-2 but also uses considerably less memory. The algorithm achieves linear memory usage with respect to sequence length, in contrast to the quadratic memory complexity of standard attention mechanisms. This efficiency is primarily due to the dramatically reduced HBM access, making Flash Attention both faster and more memory-efficient.
FlashAttention-2 builds upon the success of FlashAttention by reducing non-matmul operations and addressing inefficiencies in areas like parallelization, thread partitioning,?and shared memory access on GPUs.?This is crucial because matrix multiplications,?thanks to Tensor Cores,?can be significantly faster (up to 16X) compared to non-matmul operations. FlashAttention-2 parallelizes both the forward and backward passes across the entire sequence length in addition to FlashAttention’s batch and head parallelization. It also achieves better computation partitioning between GPU thread block warps to reduce shared memory access. All this leads to a 2X speed improvement over FlashAttention.
FlashAttention has been adopted widely by the AI community and is integrated with popular ML frameworks. PyTorch 2.2 supports both FlashAttention and FlashAttention-2 using scaled_dot_product_attention, see torch.backends.cuda.enable_flash_sdp
8. PagedAttention
The paper titled Efficient Memory Management for Large Language Model Serving with PagedAttention addresses the inefficiency of static memory allocation in LLM serving systems. These systems typically pre-allocate a fixed amount of contiguous memory to store data needed for generating a response. However, this approach suffers from several limitations:
PagedAttention introduces the concept of logical and physical memory blocks stored in the KV Cache. Similar to virtual memory in operating systems, this design provides the illusion of contiguous RAM even though data is physically scattered across memory. Logical blocks represent the data the model accesses, while physical blocks indicate the actual memory locations.
An illustration from the paper demonstrates storing the KV cache of two requests simultaneously. With PagedAttention, the KV cache is no longer stored in contiguous memory space.
To leverage PagedAttention in a distributed environment, the paper introduces vLLM. vLLM achieves up to 24 times higher throughput compared to the Hugging Face Transformers library (HF) and up to 3.5 times higher throughput compared to Hugging Face Text Generation Inference (TGI).
vLLM seamlessly supports most popular open-source models, including Llama and Mistral.
9. Computational Parallelism
Tensor Parallelism and Pipeline Parallelism are techniques used to distribute computational workloads across multiple devices during the training and inference of Large Language Models (LLMs). Both methods enhance the efficiency and scalability of processing large models but in distinct ways.
Tensor Parallelism?involves sharding the model weights across different GPUs, typically connected through a high-performance interconnect like NVLink. Each GPU holds a portion of the model's weights and performs operations on its assigned part, freeing up memory to accommodate larger contexts. The results are then combined to produce the final output. This method is effective for operations that can be divided into smaller, independent tasks, such as matrix multiplications, leveraging the parallel processing power of multiple GPUs.
Pipeline Parallelism?divides the model into different segments (or stages), assigning each to a separate device. Each device processes its segment sequentially, passing intermediate results to the next device in the pipeline. To maximize efficiency, pipeline parallelism can overlap the processing of different batches; while one batch is processed by the first segment, another batch can be processed by subsequent segments. This approach is suitable for very deep neural networks where the model can be logically divided into stages, reducing memory usage on individual devices and enabling the handling of large models that cannot fit entirely on a single GPU.
In practice, these methods are often combined to optimize the training and inference of large-scale models. Tensor Parallelism handles the parallelization of operations within each segment, while Pipeline Parallelism manages the sequential processing of these segments across devices. This hybrid approach maximizes the utilization of available hardware resources, ensuring efficient and scalable model training and inference.
10. Continuous Batching
While Tensor Parallelism and Pipeline Parallelism tackle the model itself, continuous batching?focuses on optimizing how individual requests are processed during inference.? Traditional inference treats each LLM request as a separate job.?This means the model needs to reload its massive parameters for every request,?leading to wasted processing power and increased latency.
Incoming requests are placed in a queue. The system monitors this queue and forms batches based on predefined criteria such as maximum batch size or time limits. Continuous batching works at the?token level.?It processes one token at a time from each request in the batch. New requests can now be added to the batch as soon as a slot becomes available when a previous request finishes generating its output.?This keeps the GPU constantly working on something. When a request finishes generating its output (identified by an EOS token),?its slot in the batch is freed up.?A new request can then be inserted and begin processing its first token. See illustration below,
11. Pre-Attention Input Tensor Batching
Pre-attention batching in LLMs is a novel technique that builds upon continuous batching to further improve efficiency,?particularly when dealing with requests of varying lengths or complexities.?Here is the concept:
Continuous batching processes requests on a token-by-token basis,?improving efficiency by keeping model parameters loaded and maximizing GPU utilization.?However,?it is only applicable if requests are in the same phase and have same length.
Let's say we have two requests
R1 : I am going to the
R2: Where is
R1 has gone through an execution phase and current token is 'the'. R2 is a new request and has two input tokens. R1 and R2 are in different phases and have different input tensor shapes of [1, H] and [2, H] respectively. We can't coalesce the two for Attention operation. Attention needs to be separate and model must consider all relevant information within a single request before generating the next token.
The pipeline becomes
By batching independent operations,?Pre-Attention batching further optimizes GPU utilization and can lead to faster processing compared to continuous batching alone. It allows for efficient handling of requests with different lengths and complexities.
12. Speculative Decoding
Optimizing the inference cost of large language models (LLMs) is crucial for reducing the expenses of generative AI and enhancing its adoption. The standard method for generating text from a language model involves autoregressive sampling, where decoding N tokens requires N serial model runs. Although GPUs theoretically can compute all future tokens in parallel, each token's generation is dependent on the previous one, resulting in sequential processing.
Speculative decoding is an optimization technique for inference that makes informed guesses about future tokens while generating the current token within a single forward pass. It includes a validation mechanism to ensure the accuracy of these speculated tokens, guaranteeing that the overall output matches that of regular decoding.
In speculative decoding, a small and cost-effective draft model generates a candidate sequence of K tokens. This draft sequence is then batch processed through the larger model, nearly as efficiently as processing a single token. During evaluation of logits from the large model, tokens are sampled from left to right. Matching tokens between the sampled and draft sequences allow skipping to the next token immediately. In cases of mismatch, the draft is discarded, and the additional computational cost is incurred for subsequent tokens.
This method is effective because most draft tokens are accepted, being straightforward and easily predictable by the smaller model. Accepted tokens enable rapid progression through these parts. For challenging tokens where the large model disagrees, the process reverts to its original speed, slightly slower due to increased computations.
Speculative decoding benefits significantly from sparsity (covered earlier and see Activation function below), which reduces computational overhead, enhances memory efficiency, enables parallel processing, facilitates selective activation, and improves overall scalability.
In my experience with Llama3 8B, speculative decoding accelerated inference by 1.8X. Although it requires additional compute and incurs some wasted computation, the savings in HBM round trips more than compensate for these costs.
This technique, consisting of about 50 lines of code, is available on the PyTorch website.
A recent advancement eliminates the need for a draft model. The open-source Medusa adds a speculator to the last layer of the base model, training multiple decoding heads within the same model. This approach removes the complexity of a separate model. While I haven't personally tested it yet, the results from the paper show promising outcomes, offering a simpler implementation alternative to the draft model approach.
13. Key-value caching
This optimization is standard in most ML libraries/frameworks, but it's important to understand the concept.
As discussed earlier, during the decode phase, each token is generated sequentially, with each token depending on the key and value tensors from all previous tokens. To avoid recomputing these tensors for every token at each time step, they can be cached in GPU memory. With each iteration, newly computed elements are added to the running cache, which is then used in subsequent iterations.
Monitoring and efficiently managing the KV cache is crucial for smooth inference. For instance, with a Llama2 7B model in 16-bit precision, a batch size of 3, and a context length of 4K, the size of the KV cache would be approximately 6GB.
14. Grouped Query Attention (GQA)
Before diving into GQA, let's briefly discuss a related optimization called Multi-query attention (MQA).
In traditional multi-head attention (MHA), as proposed in the original Attention paper, the key and value elements, which hold crucial context information, must be reloaded for each new token. This becomes a bottleneck for long sequences due to repeated memory access, slowing down processing.
Multi-query attention (MQA) streamlines inference by sharing a single set of key and value elements across all attention heads. While maintaining multiple heads for the "query" elements, MQA uses one set of keys and values for all heads. This approach significantly reduces memory access, improves processing speed, especially for long sequences, and reduces the size of the KV-cache, allowing for larger batch sizes. However, while MQA enhances speed, sharing keys and values across all heads may lead to a drop in accuracy compared to MHA.
GQA acts as a middle ground between MHA and MQA,?addressing MQA's limitations:
GQA divides the query heads from MHA into groups. Each group shares a single set of key-value elements. This creates a balance between speed and accuracy. Within each group, queries can still attend to different parts of the key-value pairs, allowing for some level of focused attention compared to MQA's single set for all queries. From an implementation standpoint, multiple GPUs are used during inference, with the KV cache and query heads distributed across them. Each GPU runs an attention layer with a single KV head and a group of query heads. Thus, from the perspective of a single GPU, the grouped-query attention (GQA) component functions similarly to multi-query attention (MQA).
By allowing some differentiation within groups,?GQA maintains a better ability to capture complex relationships compared to MQA's single key-value set for all queries. The approach leads to faster-processing and reduced memory requirements for inference.
If you're well-versed in multi-head attention code, enhancing it for GQA is pretty straightforward. It's important to have number of groups as a tunable to experiment.
Finding the optimal number of groups (and in turn number of GPUs needed) for your use case is crucial.?Too few groups might lead to accuracy issues like MQA,?while too many might negate the speedup benefit.
15. Activation function
Activation functions are crucial in neural networks, introducing non-linearity and enabling the modeling of complex patterns within data.
The original Transformer architecture employed the ReLU activation function. Subsequent variants such as SwiGLU, GELU, and SiLU have been introduced, claiming faster convergence and improved accuracy. However, non-ReLU activations often incur higher computational costs due to reduced sparsity. In contrast, ReLU induces activation sparsity, leading to significant savings in weight transfer (I/O) between GPUs and CPUs. This sparsity affects approximately 90-95% of rows in projection layer weights, enhancing computational efficiency.
A recent paper from Apple, titled ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models, effectively highlights this concept. The authors provide activation sparsity measurements for various activation functions.
ReLU induces over 90% activation sparsity, while with SiLU and GELU it barely moves the needle on activation sparsity. The authors demonstrated that when trained from scratch, there is no significant difference in terms of performance between different activation functions.
Even if non-ReLU activations offer some advantages in accuracy and convergence, in my experience, these benefits are offset by the increased computational cost (FLOPS) required for inference, ultimately negating the net benefit.
For memory and computationally optimized models from training to inference, I recommend starting with ReLU (or Leaky ReLU if necessary). When experimenting with non-ReLU activations, it's crucial to establish clear success criteria across multiple dimensions such as cost, performance, and latency, and evaluate comprehensively.
Key System Metrics for Inference
Time To First Token (TTFT): Measures the time taken from receiving an input prompt to generating the first token of the response, indicating the model's responsiveness. Key factors include the model partitioning, scheduling algorithm, GPU FLOPs, and interconnect latency.
Time Per Output Token (TPOT): Also referred to as Time Between Tokens (TBT). Measures the average time taken to process and generate each token after the initial token has been generated. This is directly related to keeping the user engaged. For example, a TPOT of 50 milliseconds/token would be 300 tokens per minute. This translates to about 225 words per minute.
The total latency for the response is the sum of TTFT and the TPOT * number of tokens.
Throughput: Total number of output tokens per second the inference setup can generate across all user requests.
Tail Latency: Represents the worst-case response times (e.g. at the 99th percentile) that only a small fraction of requests experience. For real-time use cases, high tail latency will lead to poor user experiences.
FLOPS Utilization: Ratio of the observed FLOPS to the GPUs theoretical max.
Cost Efficiency: The cost associated with running inferences, often measured in terms of compute hours or dollars per inference.
Closing thoughts
This concludes Part 2, where we explored various system and model optimizations for enhancing training and inference in the development of high-performing, cost-optimized AI systems. By being aware of and implementing these techniques based on their specific needs, organizations can effectively leverage advancements in AI to build efficient and scalable solutions.
To get started, I recommend focusing on some of the more accessible and impactful optimizations. Begin with Quantization, FlashAttention, Key-value caching, and LoRA.
These techniques are readily available and supported by most machine learning frameworks, making them relatively straightforward to integrate into your existing workflows.
Quantization can significantly reduce the computational load and memory footprint of your models, enabling faster inference and lower power consumption without substantial loss of accuracy. FlashAttention optimizes the attention mechanism in transformer models, leading to speed improvements in training and inference. Key-value caching helps manage memory more efficiently during sequential data processing, which can be especially beneficial for tasks involving long context windows. Lastly, LoRA offers a method for fine-tuning large models with minimal additional parameters, making it an efficient approach for domain-specific adaptations.
Once these foundational techniques are in place, you can explore more advanced optimizations tailored to your unique requirements to push the boundaries of performance and efficiency.
Iteration and experimentation pave the way to unlocking the full potential of AI and achieving strategic goals.