South african online casino games no deposit bonus.Claim Your Free 999 Pesos Bonus Today

If you're involved with AI, you've likely been inundated by expert commentary on techniques like RAG, knowledge graphs, fine-tuning, prompt engineering, instruction-tuning, and the recommended success recipes, such as prompt engineering => RAG => fine-tuning, to unlock the value of GenAI.

Despite the abundance of information available on building AI solutions, productionizing LLMs at scale is hard and humbling. Successful implementations require a 'learn to do' rather than a 'learn to know' mindset. There are no shortcuts or easy DIY recipes.

Teams often struggle to move beyond limited POCs. In my experience, one key reason is the lack of focus and understanding of how these systems operate at a deeper level, including GPU, CPU, memory, and networking.

As an aside, if you need a quick overview of the various hardware technologies powering AI, you can refer here.

While knowledge of high-level concepts is undoubtedly important, there's no substitute for understanding AI at the system level. Knowing what to look for and what changes to make to achieve that extra 1% efficiency can sometimes require weeks or even months of painstaking work. Even when you are aware of system-level optimizations, many optimizations may not yield the expected return on investment and could remain purely textbook exercises. It's crucial to have a thorough understanding of how AI workloads interact at the system level to truly unlock their potential.

Where relevant, I've included additional references and guidance on implementation. Some techniques were developed in-house based on our custom needs or because they weren't supported by the ML framework we were using. In some cases, we found that the open-source implementations were inefficient, requiring significant modifications. Feel free to reach out if you'd like more details. Please note that this collection is not exhaustive—there are countless other techniques and concepts waiting to be explored. I hope you find it valuable.

Let's dive into the exciting world of optimizing AI systems for performance and efficiency. From model architecture to deployment, every step offers opportunities to enhance performance and improve efficiency. Whether you're training foundational models, fine-tuning existing ones, or simply focusing on inference, the goal remains the same: to create robust, high-performing AI solutions.

GPU level optimizations

In the AI world, a kernel is a function launched from the host to run on the GPU. The nature of a kernel, whether it is compute-bound or memory-bound, largely determines the optimizations needed. Operations can be classified as either compute-bound or memory-bound based on the balance between computation and memory accesses. This balance is commonly quantified by the arithmetic intensity, which is the number of arithmetic operations performed per byte of memory accessed. A kernel is compute-bound if Arithmetic intensity is above GPU’s FLOPS:MemBandwidth ratio. This means the kernel performs many computations relative to the amount of data it accesses from memory. In contrast, kernels with lower arithmetic intensity are typically memory-bound.

As a rule of thumb, you want the kernel to be compute-bound rather than memory-bound to drive maximum efficiency and FLOPS throughput.

Memory-bound operations are primarily constrained by memory access times, where the actual computation takes comparatively less time. The performance of these operations depends heavily on the number of memory accesses rather than the computational workload. For example, operations like vector addition, softmax, and batch normalization are examples of memory-bound kernels with low operational intensity. In contrast, compute-bound kernels, such as Fast Fourier Transform and Matrix Multiplication with Large Inner Dimensions, emphasize computational tasks over memory access.

Several techniques and approaches can significantly increase GPU utilization (compute-bound) and enhance algorithm efficiency. Many of these methods may already be available by default or can be explicitly enabled in the AI framework or compiler you are using. Understanding these concepts is crucial, especially when developing new code or optimizing existing code for better performance and efficiency.

Here are some specific techniques for optimizing memory access and improving compute efficiency.

1. Operator Fusion

In many computational tasks, especially in neural network operations, sequences of operations are performed on tensors. Operator fusion takes these sequences and merges them into a single, more complex operation. This reduces the overhead associated with scheduling and executing multiple separate operations. It also has the benefit of in-memory computing instead of executing multiple memory LOAD and STORE for pointwise operators when there is a data dependency.

Here is a sample,

Without Operator Fusion:

Apply ReLU activation to input tensor?A
Apply batch normalization to the result

import torch
import torch.nn.functional as F

A = torch.randn(100, 100)
B = F.relu(A)
C = F.batch_norm(B, running_mean, running_var)

With Operator Fusion:

A fused operation can be created that performs both ReLU and batch normalization in one step.

import torch

class FusedReLUAndBatchNorm(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input, running_mean, running_var):
        relu_output = torch.relu(input)
        batch_norm_output = torch.batch_norm(relu_output, running_mean, running_var, training=True)
        return batch_norm_output

A = torch.randn(100, 100)
C = FusedReLUAndBatchNorm.apply(A, running_mean, running_var)

2. Kernel Fusion

Similar concept to Operator fusion but at the granularity of kernels. Merge kernels operating on same data to minimize transfers across kernel boundaries, it also reduces launch overheads.

Here is a sample,

Without Kernel Fusion:

Kernel 1: Element-wise addition of tensors?A?and?B
Kernel 2: Element-wise multiplication of the result with tensor?C

__global__ void add(float* A, float* B, float* D, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        D[idx] = A[idx] + B[idx];
    }
}

__global__ void multiply(float* D, float* C, float* E, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        E[idx] = D[idx] * C[idx];
    }
}

With Kernel Fusion:

A single kernel performs both addition and multiplication.

__global__ void fused_add_multiply(float* A, float* B, float* C, float* E, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        float temp = A[idx] + B[idx];
        E[idx] = temp * C[idx];
    }
}

// Launching the fused kernel
fused_add_multiply<<<numBlocks, blockSize>>>(A, B, C, E, N);

3. Recompute

Recompute optimization involves recomputing one or more values that were computed earlier, which might seem counterintuitive. However, if storing and loading these computed values involves significant memory LOAD/STORE operations, it can make the process memory-bound. Since the computational capabilities of GPUs far exceed their data transfer capabilities to and from memory, recomputation can result in significant performance gains.

For example, during training, the activations computed during the forward pass are needed again during the backward pass to compute gradients with respect to the loss function. Storing and then accessing these activations from memory can create a significant bottleneck. Instead, recomputing the activations during the backward pass, rather than fetching them from memory, can significantly speed up the process by leveraging the GPU's superior computational power.

Here is a sample

Without Recompute Optimization

import torch
import torch.nn.functional as F

# Forward and backward pass without recompute
x = torch.randn(64, 784, requires_grad=True)  # Batch size 64, input size 784
y = torch.relu(x)
loss = y.sum()
loss.backward()

With Recompute Optimization

import torch
import torch.nn.functional as F
from torch.utils.checkpoint import checkpoint

# Function that will be recomputed
def relu_fn(x):
    return F.relu(x)

# Forward and backward pass with recompute
x = torch.randn(64, 784, requires_grad=True)  # Batch size 64, input size 784
y = checkpoint(relu_fn, x)
loss = y.sum()
loss.backward()

4. Spatial Blocking

The key idea behind spatial blocking is to divide (or "tile") large data structures into smaller, more manageable blocks that fit into the GPU's cache. Each block is processed independently, which helps to ensure that data remains in the cache for as long as possible, thereby reducing the frequency of slow global memory accesses. It helps improve cache utilization, reduce memory bandwidth requirements, and enabled enhanced parallelism.

Here is a sample

Without Spatial Blocking

__global__ void matrixMul(float* A, float* B, float* C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0.0f;
    for (int k = 0; k < N; k++) {
        sum += A[row * N + k] * B[k * N + col];
    }
    C[row * N + col] = sum;
}

With Spatial Blocking

#define BLOCK_SIZE 16

__global__ void matrixMulTiled(float* A, float* B, float* C, int N) {
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

    int bx = blockIdx.x;
    int by = blockIdx.y;
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int row = by * BLOCK_SIZE + ty;
    int col = bx * BLOCK_SIZE + tx;

    float sum = 0.0f;
    for (int t = 0; t < (N + BLOCK_SIZE - 1) / BLOCK_SIZE; ++t) {
        if (row < N && t * BLOCK_SIZE + tx < N) {
            As[ty][tx] = A[row * N + t * BLOCK_SIZE + tx];
        } else {
            As[ty][tx] = 0.0f;
        }
        if (t * BLOCK_SIZE + ty < N && col < N) {
            Bs[ty][tx] = B[(t * BLOCK_SIZE + ty) * N + col];
        } else {
            Bs[ty][tx] = 0.0f;
        }

        __syncthreads();

        for (int k = 0; k < BLOCK_SIZE; ++k) {
            sum += As[ty][k] * Bs[k][tx];
        }
        __syncthreads();
    }

    if (row < N && col < N) {
        C[row * N + col] = sum;
    }
}

5. Compress Data

Data compression is a crucial technique in GPU optimization aimed at reducing memory usage, storage requirements, and data transfer between GPUs as well as between GPU and CPU. This approach is particularly effective for handling large datasets or models that exceed GPU memory capacity or when dealing with sparse data structures.

The choice of the compressor depends on your dataset and compression ratio/compression-decompression speed requirements.

I've used two compression schemes for DL workloads,

LZ4 is a versatile compressor designed for arbitrary data formats. It operates at the byte level, emphasizing simplicity and speed. For optimal performance on GPUs, it's recommended to partition datasets into blocks and leverage GPU thread blocks for concurrent compression and decompression operations. This approach maximizes throughput by exploiting the parallel processing capabilities of modern GPUs.

Cascaded compressor is tailored for numerical and analytical datasets. It excels in scenarios where data has a structured format, allowing for efficient parallelization. This algorithm achieves high throughput on modern GPUs, making it suitable for applications requiring rapid data compression and decompression.

While cascaded compression offers high throughput and efficient parallelization, it may not be suitable for all types of data inputs. Particularly, it may not perform optimally with unstructured datasets such as strings or data lacking clear patterns. In such cases, LZ4 is a better choice.

As a side note, deflate is another compressor, but I don't have any hands-on experience with it.

For NVidia GPUs, NVCOMP CUDA library is great place to get started.

6. Occupancy

GPUs leverage extensive parallelism to mitigate latency, with Occupancy serving as a key metric for thread parallelism. It measures how effectively the GPU resources (such as threads, registers, and shared memory) are utilized. Instruction-Level Parallelism (ILP) gauges parallelism among threads.

Higher Occupancy and ILP enhance compute and memory efficiency, optimizing LOAD/STORE operations. Threads awaiting data dependencies or barriers are suspended until hazards are resolved. Tools like Nvidia's ncu and nvprof facilitate understanding and fine-tuning of parameters influencing Occupancy and ILP on Nvidia GPUs.

Here is an example

__global__ void matrixMultiplication(float* A, float* B, float* C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    float sum = 0.0f;
    for (int k = 0; k < N; ++k) {
        sum += A[row * N + k] * B[k * N + col];
    }
    
    C[row * N + col] = sum;
}

Adjusting?blockDim?(thread block dimensions) affects the occupancy.
Smaller?blockDim?can lead to higher occupancy but may require more blocks for large matrices.

Model level Optimizations

Before diving into specific techniques, let's cover how inference works and address some common inefficiencies.

LLM inference consists of two distinct phases: the prefill phase, which processes the input prompt, and the decode or generate phase, which generates output tokens autoregressively.

During the prefill phase, all tokens in the input sequence are processed in parallel, leading to high GPU utilization even with a small batch size. In contrast, the decode phase processes only a single token in each autoregressive pass, resulting in very low GPU utilization.

From my experience, the decode cost per token can be 192-214 times higher than the prefill cost per token. This disparity causes significantly different latencies between the time to the first token (TFFT) and the time between tokens (TBT), especially with Chain of Thought queries where users don’t see the intermediate outputs.

Such disproportionate resource utilization implies that for every request, there are periods of high compute utilization due to efficient prefill operations, followed by potentially long periods of inefficient decode operations. This results in poor overall GPU utilization.

LLM inference faces inefficiencies because the decode phase is memory-bound, and the use of pipeline parallelism leads to significant pipeline bubbles in LLMs.

One common challenge is balancing batch sizes to shift the decode phase from memory-bound to compute-intensive. However, scaling up the batch size beyond a limit is impractical due to the KV-cache footprint of each request. For example, we can fit a maximum batch size of 18 requests at a sequence length of 1K for the LLaMA-13B model on an A6000 GPU. Therefore, within the practical range of batch sizes today, the decode phase remains memory-bound.

Another challenge unique to LLMs is variable user prompt length and variable-length output. It poses a challenge to inference, requiring LLMs to constantly adjust memory usage and processing strategies for efficient performance.

1. Intermixing prefill with decode tokens

At any given time, LLM inference involves processing varying lengths of prefills and decoding tokens, resulting in different and unpredictable computation times. For example, the question "What is the longest river in the world?" may have a simple answer, "Nile," whereas "Why are rivers ecologically important?" could require a longer response. Depending on how requests are pipelined and handled, the response to the first question might be delayed due to the generation of the second question's response, leading to high inference times or increased latency between token generations.

One effective method to improve GPU utilization can be to create a compute group (CG) that combines a portion of the original prefill tokens along with decodes, instead of having separate prefill-only and decode-only token blocks. The intermixing of prefill and decode within a CG is possible because the prefill operation is highly parallelizable. By processing these compute blocks of prefill and decode, we can efficiently utilize GPU resources during both phases. Additionally, this strategy helps to eliminate runtime variance across requests, leading to more consistent and predictable performance. By implementing this pipeline, we were able to decrease the time between tokens by 11.2X while improving GPU utilization by 1.6X.

2. Quantization

LLMs are continually growing larger, presenting formidable challenges for deployment and real-time inferencing due to constraints in arithmetic density, memory capacity, and especially memory bandwidth between GPUs and high-bandwidth memory (HBM) as well as lower memory tiers.

For instance, the GPT-3 model contains approximately 175 billion parameters with a memory footprint of more than 300GB, making it impractical to deploy on a single GPU such as the NVIDIA H100. Another challenge is the increase in inference latency, which can be prohibitive for many real-world applications.

Model compression using Quantization can be an all-around win with relatively little effort to reduce the model size and compute needs, however there are various nuances that must be considered to hit the accuracy mark.

Quantization involves reducing the precision of the model's weights and activations from high-precision formats like FP32 (32-bit floating point) to lower-precision formats such as FP16 (16-bit floating point), INT8 (8-bit integer), INT4 (4-bit integer), or lower. For example, reducing the precision of tensors from 32-bit to 8-bit decreases the memory required to store these tensors by a factor of 4. Additionally, the computational cost for matrix multiplication operations is reduced quadratically, resulting in a 16-fold decrease in computation needed. During inference, this not only reduces the in-memory footprint but also minimizes data movement from lower-memory tiers to the GPU.

Here are some technical details and practical considerations one must be aware of:

2.1. Symmetric quantization

In symmetric quantization, the quantized values are symmetrically distributed around zero. This means that the range of values for the quantized integers is centered at zero, and the positive and negative ranges are equal. A single scale factor is used to map the floating-point values to the integer range. This is simpler to implement and is hardware efficient, however, it may not utilize the full integer range effectively if the original data is not symmetrically distributed around zero.

2.2. Asymmetric quantization

In asymmetric quantization, the quantized values are not symmetrically distributed around zero. This allows for a more flexible representation, especially useful when the data range is not centered around zero. It is more complex to implement due to the need for an additional zero-point parameter.

2.3. Quantization parameters

Quantization involves three primary parameters: the scale factor (S), zero-point (Z), and bit-width (b), which collectively determine how data is compressed and represented in integer form.

Key Parameters of Quantization:

Scale Factor (S): This floating-point parameter dictates the step size of the quantizer, scaling down floating-point values to fit within the integer range. It plays a critical role in preserving the fidelity of the original data during quantization.
Zero-Point (Z): An integer offset ensuring that the real zero value aligns precisely with an integer value in the quantized representation. This minimizes errors for operations involving zero, such as ReLU activations or zero padding.
Bit-Width (b): Specifies the number of bits used to represent the quantized values. A higher bit-width results in a finer granularity of representation but increases memory and computational overhead.

2.4. Quantization Grid and Error Management:

The parameters define the limits of the quantization grid (gmin and gmax). Values outside this range are clipped, introducing clipping errors. This is problematic for models where extreme values hold significant meaning. To mitigate clipping errors, adjusting the scale factor to expand the quantization range is considered. However, this adjustment can lead to increased rounding errors, potentially degrading model precision.

2.5. Choosing Quantization Parameters:

To strike a balance between clipping and rounding errors, various methods can be employed:

Min-Max Range: Setting the range to cover tensor values without clipping seems straightforward but is sensitive to outliers and can lead to significant rounding errors.
Mean Squared Error (MSE): Minimizing the MSE between the original and quantized tensors helps manage outlier values better but may not preserve accuracy when dealing with a mix of large and small values.
Cross-Entropy: Particularly effective in scenarios where all tensor values are not equally critical, such as classification tasks with a large number of classes. It minimizes quantization errors more effectively than MSE, aligning well with real-world accuracy requirements.

2.6. Quantization considerations

Next consideration is for Quantization granularity,?the level at which we apply quantization techniques,?as well as how we apply them , e.g. static vs dynamic.

Per-Tensor Quantization (PTQ)

This is the most straightforward approach,?applying the same quantization parameters (scale factor,zero-point,?bit-width) to all elements within a single tensor. PTQ struggles with tensors containing a wide range of values.?The single quantization range might not capture the nuances of all elements,?leading to significant accuracy loss for LLMs that rely on precise numerical representations

Group-wise Quantization (GWQ)

GWQ partitions a tensor into smaller groups (channels in an activation or weight matrix) and calculates quantization parameters for each group individually.?This allows for a better fit to the distribution of values within each group. Compared to PTQ,?GWQ offers better accuracy preservation by accommodating the varying ranges of values within different groups. However, this comes at cost of increased complexity for handling group management and potentially higher memory usage during the quantization process.

Per-Layer Quantization (PLQ)

PLQ offers a balance between PTQ's simplicity and GWQ's group-specific approach.?It can achieve better efficiency gains than PTQ while maintaining acceptable accuracy for many LLM tasks. However, PLQ might not capture the full range of value distributions within a layer,?especially for complex LLM architectures with diverse layers.

Besides these, there are two additional approaches which can come in handy in certain scenarios,

Dynamic Quantization (DQ)

DQ adjusts quantization parameters (such as scale factors and zero-points) during inference based on the data distribution encountered at runtime. Unlike static quantization where parameters are fixed, dynamic quantization allows flexibility to adapt to varying input ranges and distributions. This approach is particularly beneficial in scenarios where input data ranges widely during inference, making fixed quantization parameters less effective.

Adaptive Quantization (AQ)

AQ goes a step further by dynamically adjusting quantization strategies or parameters based on feedback or runtime conditions. This approach considers factors beyond just data distribution, such as Real-time performance metrics (e.g., inference latency, throughput) and adaptation to hardware capabilities or operational constraints (e.g., memory limitations, computational resources).

2.7. Handling Outliers in Quantization

Weights typically follows a normal distribution centered around zero. They are stable and usually devoid of outliers, making them straightforward to quantize without significant loss in efficiency or accuracy. However, Activations are often non-Gaussian and may contain outliers, particularly in scenarios with sparse or highly dynamic data. Quantizing tensors containing outliers can lead to reduced efficiency and accuracy due to the non-uniform distribution and potential large magnitude values.

Managing outliers in activation tensors is essential for optimizing quantization in neural networks. Strategies such as clipping, scaling, and other strategis discussed above can mitigate the impact of outliers on quantization efficiency and accuracy.

LLM.int8() is an elegant solution to the outlier problem. It relies on a vector-wise (absmax) quantization scheme and introduces mixed-precision quantization. This means that outlier features are processed in a FP16 format to retain their precision, while the other values are processed in an INT8 format. As outliers represent about 0.1% of values, this effectively reduces the memory footprint of the LLM while handling outliers.

2.8. Quantization Techniques: Post-Training vs. Quantization-Aware Training

There are two main approaches to quantization:?Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT).?Let's dive into their differences and their suitability for LLMs.

Post-Training Quantization (PTQ): PTQ involves training an LLM in full precision (FP32) and then applying quantization techniques after training is complete.?This is a simpler approach,?requiring minimal modifications to the existing training pipeline. PTQ is straightforward to implement and computationally less expensive compared to QAT. Since the model wasn't trained with quantization in mind,?PTQ can lead to accuracy loss,?especially if the quantization process is aggressive (using very low bit-widths).?This is because the quantization scheme might not capture the full range of values effectively,?particularly for models sensitive to numerical precision like LLMs.

Quantization-Aware Training (QAT): QAT integrates quantization techniques into the training process itself.?The model is trained with simulated quantization steps,?allowing it to learn representations that are more robust to the quantization errors introduced during deployment. QAT can potentially achieve higher accuracy compared to PTQ for the same level of quantization (bit-width),?especially for LLMs.?This is because the model learns to adapt to the lower precision format during training. QAT requires modifying the training pipeline to incorporate quantization steps,?which can be more complex than PTQ.?Additionally,?it might involve longer training times due to the additional computations involved in simulating quantization.

The optimal choice between PTQ and QAT depends on several factors:

Accuracy Requirements:?If maintaining the highest possible accuracy is critical,?QAT might be preferred due to its potential for better accuracy preservation.
Deployment Constraints:?For resource-limited devices with tight latency requirements,?the simpler implementation of PTQ might be more attractive,?even if it means a slight accuracy trade-off.
Training Resources:?If computational resources are abundant,?QAT's benefits in accuracy might outweigh the additional training complexity.

Here are three popular quantization methods that I have extensively used. Though there are many more to evaluate and choose from depending on your own unique needs.

SmoothQuant: Enhancing Activation Quantization

SmoothQuant addresses the challenge of quantizing LLMs by focusing on the quantization of activations, which are intermediate results computed during model inference. Activations often exhibit wide distributions with outliers—extreme values that can complicate the quantization process. Key features of SmoothQuant include:

Quantization Shift: SmoothQuant strategically shifts the focus of quantization difficulty from activations to weights (parameters) in the model. This shift takes advantage of weights typically having a more uniform distribution, making them more suitable for efficient quantization.
Activation Smoothing: Using advanced mathematical techniques, SmoothQuant smooths out activation values to reduce the impact of outliers. This preprocessing step enhances the quantizability of activations, leading to improved efficiency without significant loss of model accuracy.

SmoothQuant enables the efficient post-training quantization of even very large LLMs. By reducing computational complexity and memory requirements, it facilitates faster inference speeds and extends the reach of these powerful models to devices with limited resources.

SPQR: Hybrid Compression for Model Size Reduction

SPQR (Sparse-Quantized Representation) introduces a hybrid compression strategy designed to significantly reduce the size of LLMs while preserving their accuracy. The approach integrates two main techniques:

Sparsity: SPQR identifies and eliminates unimportant weights from the model, leveraging the observation that many weights close to zero have minimal impact on model outputs. By pruning these weights, SPQR effectively reduces the model's size.
Quantization: Following sparsity pruning, SPQR quantizes the remaining weights to a lower precision format (such as 3 or 4 bits per parameter). This further compresses the model while maintaining computational efficiency.

SPQR also addresses the challenge posed by outlier weights—those with a substantial impact on model performance. It employs specialized strategies to handle outliers by storing them in higher precision formats, ensuring critical model accuracy is preserved during compression.

This two-pronged approach of sparsity and targeted quantization with outlier handling enables SPQR to achieve near-lossless compression of LLMs. It opens avenues for deploying sophisticated language models on a wide range of devices, from edge computing devices to cloud servers, without compromising performance.

GPTQ: Layer-by-Layer Quantization for Precision Optimization

GPTQ represents an innovative approach to optimizing LLMs through layer-by-layer quantization. This technique focuses on:

Calibration: GPTQ begins by taking a small calibration dataset to determine optimal quantization parameters for each layer of the model. It aims to minimize the squared error between the outputs of the quantized and full-precision layers.
Variable Precision: Unlike traditional fixed-bit quantization, GPTQ adapts the quantization precision for each layer based on the specific data distribution and computational requirements. This adaptive approach optimizes both memory usage and computational efficiency across different layers of the model.

By optimizing quantization at the layer level, GPTQ enhances the overall efficiency of LLMs while maintaining high accuracy. It supports the deployment of complex language models in diverse environments, ranging from real-time applications on edge devices to large-scale deployments in cloud infrastructures.

SmoothQuant, SPQR, and GPTQ offer unique strategies to enhance efficiency and performance across various deployment scenarios. These techniques not only reduce the computational and memory footprint of LLMs but also democratize access to their capabilities, enabling broader adoption in resource constrained environments.

3. Sparsity

Sparsity is a powerful technique used in LLM model compression.?It focuses on?identifying and zero'ing out?elements within a model that have minimal impact on its overall accuracy.?Imagine a large weight matrix in an LLM – some elements might have values very close to zero and contribute little to the model's output.?Sparsity techniques can identify these "unimportant" elements and set them to zero,?effectively?pruning?them from the model.

Benefits of Sparsity:

Reduced Memory Footprint:?By eliminating unnecessary elements,?sparsity leads to a smaller model size, reducing the memory required to store it.?This is crucial for deploying LLMs on devices with limited memory resources.
Reduced IO transfer: Inference is memory-bound, Sparsity allows skipping unnecessary rows, thereby reducing I/O.
Reduced computational workload:?Sparsity helps during inference because of zeroed out weights leading to more hardware friendly matrix multiplication.

One popular sparsity technique is?magnitude-based sparsity.?Here,?elements in a tensor (a multidimensional array) are pruned based on their absolute values (magnitudes).?Elements with very small magnitudes are considered less important and are removed.?This approach can achieve remarkable compression rates,?significantly reducing the model size with minimal impact on accuracy.

In practice, applying sparsity before quantization yields a much better model accuracy.? For example, at a 40% sparsity level, 8-bit quantization can result in a total reduction in memory footprint by 6.5X.

4. LoRA (Low-Rank Adaptation)

Fine-tuning LLMs can be computationally expensive due to the massive number of parameters involved.?LoRA addresses this challenge by significantly reducing the number of parameters that need to be trained during fine-tuning. Instead of modifying the entire weight matrix directly (billions of parameters),?it introduces a pair of much smaller,?low-rank matrices.?These low-rank matrices capture the essential information needed for adaptation,?significantly reducing the number of trainable parameters. LoRA leads to reduced training time because only a small subset of parameters needs update leading to faster convergence during fine-tuning.?LoRA also significantly reduces the memory footprint required during fine-tuning.?This makes it particularly beneficial for resource-constrained environments.

5. QLoRA (Quantized Low-Rank Adaptation)

QLoRA builds upon the success of LoRA for efficient LLM fine-tuning.?While LoRA significantly reduces the number of trainable parameters by using low-rank matrices,?QLoRA takes it a step further by incorporating?quantization. By using lower precision formats for the low-rank matrices,?QLoRA significantly minimizes memory usage compared to full fine-tuning with floating-point parameters.?This allows fine-tuning on devices with limited memory resources. Quantization also reduces the computational cost associated with training and inference.?Since calculations with lower precision numbers are simpler,?QLoRA can achieve faster training and potentially faster model execution. Despite the reduction in precision,?QLoRA aims to maintain the model's ability to adapt to new fine-tuning tasks while preserving its overall performance.?This balance between efficiency and accuracy is crucial for practical applications.

6. 1-bit LLM

1-bit LLM needs a special mention because of?the?amazing compression it offers?and matches?16-bit Floating?Point LLM baseline?accuracy (according to?claims?made by?the authors of?the?BitNet b1.58 paper). It focuses on optimizing the storage and processing requirements of?language models,?where?every single?weight?of the LLM is?ternary?{-1, 0, 1}. A 7B parameter model using 32bit precision requires 26GB of memory,?which?rules out deployment on a single device in many cases, e.g. mobile phones.?The?same model with?a?1-bit LLM will need only 0.815?bytes?of memory.?I am uncertain?when?the?model will be available or?if someone has?been able to train it similar to the methodology used in the BitNet b1.58 paper. It will be particularly useful for IoT devices because of?their?extremely limited memory and processing power. It can also benefit real-time applications?where?real-time responses?are?crucial,?such as?Real-Time Text Prediction?that requires?advanced text prediction and auto-correction?features.

I?have not yet seen a working implementation;?it?remains conceptual.?My team?attempted to replicate the results with our limited resources, but?we?couldn't achieve the accuracy described in the paper,?such as?matching?the?16-bit floating point LLM?accuracy.

7. FlashAttention and FlashAttention-2

FlashAttention, as introduced in the FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness paper, is an I/O-aware exact attention algorithm that leverages established techniques such as tiling (see earlier about Spatial Blocking) to minimize the number of memory reads and writes between GPU high-bandwidth memory (HBM) and GPU on-chip SRAM. By utilizing tiling, Flash Attention efficiently partitions the computation to keep most of the data on-chip, thereby reducing the need to access slower off-chip memory.

Additionally, Flash Attention employs a recompute strategy during the backward pass. Instead of reading the intermediate attention matrix from HBM, it recalculates attention on-chip. The recomputation, despite the increase in floating-point operations (FLOPs), it significantly speeds up the process compared to the standard approach, which relies on extensive memory accesses.

According to the authors, Flash Attention not only runs up to 7.6 times faster on models like GPT-2 but also uses considerably less memory. The algorithm achieves linear memory usage with respect to sequence length, in contrast to the quadratic memory complexity of standard attention mechanisms. This efficiency is primarily due to the dramatically reduced HBM access, making Flash Attention both faster and more memory-efficient.

FlashAttention-2 builds upon the success of FlashAttention by reducing non-matmul operations and addressing inefficiencies in areas like parallelization, thread partitioning,?and shared memory access on GPUs.?This is crucial because matrix multiplications,?thanks to Tensor Cores,?can be significantly faster (up to 16X) compared to non-matmul operations. FlashAttention-2 parallelizes both the forward and backward passes across the entire sequence length in addition to FlashAttention’s batch and head parallelization. It also achieves better computation partitioning between GPU thread block warps to reduce shared memory access. All this leads to a 2X speed improvement over FlashAttention.

FlashAttention has been adopted widely by the AI community and is integrated with popular ML frameworks. PyTorch 2.2 supports both FlashAttention and FlashAttention-2 using scaled_dot_product_attention, see torch.backends.cuda.enable_flash_sdp

8. PagedAttention

The paper titled Efficient Memory Management for Large Language Model Serving with PagedAttention addresses the inefficiency of static memory allocation in LLM serving systems. These systems typically pre-allocate a fixed amount of contiguous memory to store data needed for generating a response. However, this approach suffers from several limitations:

Resource Wastage: If the user's response is shorter than the allocated size, the remaining memory remains unused for other requests until the initial request completes.
Fragmentation: Requiring contiguous memory blocks leads to fragmentation, where memory stranded between allocated areas becomes unusable, further reducing efficiency.

PagedAttention introduces the concept of logical and physical memory blocks stored in the KV Cache. Similar to virtual memory in operating systems, this design provides the illusion of contiguous RAM even though data is physically scattered across memory. Logical blocks represent the data the model accesses, while physical blocks indicate the actual memory locations.

An illustration from the paper demonstrates storing the KV cache of two requests simultaneously. With PagedAttention, the KV cache is no longer stored in contiguous memory space.

To leverage PagedAttention in a distributed environment, the paper introduces vLLM. vLLM achieves up to 24 times higher throughput compared to the Hugging Face Transformers library (HF) and up to 3.5 times higher throughput compared to Hugging Face Text Generation Inference (TGI).

vLLM seamlessly supports most popular open-source models, including Llama and Mistral.

9. Computational Parallelism

Tensor Parallelism and Pipeline Parallelism are techniques used to distribute computational workloads across multiple devices during the training and inference of Large Language Models (LLMs). Both methods enhance the efficiency and scalability of processing large models but in distinct ways.

Tensor Parallelism?involves sharding the model weights across different GPUs, typically connected through a high-performance interconnect like NVLink. Each GPU holds a portion of the model's weights and performs operations on its assigned part, freeing up memory to accommodate larger contexts. The results are then combined to produce the final output. This method is effective for operations that can be divided into smaller, independent tasks, such as matrix multiplications, leveraging the parallel processing power of multiple GPUs.

Pipeline Parallelism?divides the model into different segments (or stages), assigning each to a separate device. Each device processes its segment sequentially, passing intermediate results to the next device in the pipeline. To maximize efficiency, pipeline parallelism can overlap the processing of different batches; while one batch is processed by the first segment, another batch can be processed by subsequent segments. This approach is suitable for very deep neural networks where the model can be logically divided into stages, reducing memory usage on individual devices and enabling the handling of large models that cannot fit entirely on a single GPU.

In practice, these methods are often combined to optimize the training and inference of large-scale models. Tensor Parallelism handles the parallelization of operations within each segment, while Pipeline Parallelism manages the sequential processing of these segments across devices. This hybrid approach maximizes the utilization of available hardware resources, ensuring efficient and scalable model training and inference.

10. Continuous Batching

While Tensor Parallelism and Pipeline Parallelism tackle the model itself, continuous batching?focuses on optimizing how individual requests are processed during inference.? Traditional inference treats each LLM request as a separate job.?This means the model needs to reload its massive parameters for every request,?leading to wasted processing power and increased latency.

Incoming requests are placed in a queue. The system monitors this queue and forms batches based on predefined criteria such as maximum batch size or time limits. Continuous batching works at the?token level.?It processes one token at a time from each request in the batch. New requests can now be added to the batch as soon as a slot becomes available when a previous request finishes generating its output.?This keeps the GPU constantly working on something. When a request finishes generating its output (identified by an EOS token),?its slot in the batch is freed up.?A new request can then be inserted and begin processing its first token. See illustration below,

11. Pre-Attention Input Tensor Batching

Pre-attention batching in LLMs is a novel technique that builds upon continuous batching to further improve efficiency,?particularly when dealing with requests of varying lengths or complexities.?Here is the concept:

Continuous batching processes requests on a token-by-token basis,?improving efficiency by keeping model parameters loaded and maximizing GPU utilization.?However,?it is only applicable if requests are in the same phase and have same length.

Let's say we have two requests

R1 : I am going to the

R2: Where is

R1 has gone through an execution phase and current token is 'the'. R2 is a new request and has two input tokens. R1 and R2 are in different phases and have different input tensor shapes of [1, H] and [2, H] respectively. We can't coalesce the two for Attention operation. Attention needs to be separate and model must consider all relevant information within a single request before generating the next token.

The pipeline becomes

By batching independent operations,?Pre-Attention batching further optimizes GPU utilization and can lead to faster processing compared to continuous batching alone. It allows for efficient handling of requests with different lengths and complexities.

12. Speculative Decoding

Optimizing the inference cost of large language models (LLMs) is crucial for reducing the expenses of generative AI and enhancing its adoption. The standard method for generating text from a language model involves autoregressive sampling, where decoding N tokens requires N serial model runs. Although GPUs theoretically can compute all future tokens in parallel, each token's generation is dependent on the previous one, resulting in sequential processing.

Speculative decoding is an optimization technique for inference that makes informed guesses about future tokens while generating the current token within a single forward pass. It includes a validation mechanism to ensure the accuracy of these speculated tokens, guaranteeing that the overall output matches that of regular decoding.

In speculative decoding, a small and cost-effective draft model generates a candidate sequence of K tokens. This draft sequence is then batch processed through the larger model, nearly as efficiently as processing a single token. During evaluation of logits from the large model, tokens are sampled from left to right. Matching tokens between the sampled and draft sequences allow skipping to the next token immediately. In cases of mismatch, the draft is discarded, and the additional computational cost is incurred for subsequent tokens.

This method is effective because most draft tokens are accepted, being straightforward and easily predictable by the smaller model. Accepted tokens enable rapid progression through these parts. For challenging tokens where the large model disagrees, the process reverts to its original speed, slightly slower due to increased computations.

Speculative decoding benefits significantly from sparsity (covered earlier and see Activation function below), which reduces computational overhead, enhances memory efficiency, enables parallel processing, facilitates selective activation, and improves overall scalability.

In my experience with Llama3 8B, speculative decoding accelerated inference by 1.8X. Although it requires additional compute and incurs some wasted computation, the savings in HBM round trips more than compensate for these costs.

This technique, consisting of about 50 lines of code, is available on the PyTorch website.

A recent advancement eliminates the need for a draft model. The open-source Medusa adds a speculator to the last layer of the base model, training multiple decoding heads within the same model. This approach removes the complexity of a separate model. While I haven't personally tested it yet, the results from the paper show promising outcomes, offering a simpler implementation alternative to the draft model approach.

13. Key-value caching

This optimization is standard in most ML libraries/frameworks, but it's important to understand the concept.

As discussed earlier, during the decode phase, each token is generated sequentially, with each token depending on the key and value tensors from all previous tokens. To avoid recomputing these tensors for every token at each time step, they can be cached in GPU memory. With each iteration, newly computed elements are added to the running cache, which is then used in subsequent iterations.

Monitoring and efficiently managing the KV cache is crucial for smooth inference. For instance, with a Llama2 7B model in 16-bit precision, a batch size of 3, and a context length of 4K, the size of the KV cache would be approximately 6GB.

14. Grouped Query Attention (GQA)

Before diving into GQA, let's briefly discuss a related optimization called Multi-query attention (MQA).

In traditional multi-head attention (MHA), as proposed in the original Attention paper, the key and value elements, which hold crucial context information, must be reloaded for each new token. This becomes a bottleneck for long sequences due to repeated memory access, slowing down processing.

Multi-query attention (MQA) streamlines inference by sharing a single set of key and value elements across all attention heads. While maintaining multiple heads for the "query" elements, MQA uses one set of keys and values for all heads. This approach significantly reduces memory access, improves processing speed, especially for long sequences, and reduces the size of the KV-cache, allowing for larger batch sizes. However, while MQA enhances speed, sharing keys and values across all heads may lead to a drop in accuracy compared to MHA.

GQA acts as a middle ground between MHA and MQA,?addressing MQA's limitations:

GQA divides the query heads from MHA into groups. Each group shares a single set of key-value elements. This creates a balance between speed and accuracy. Within each group, queries can still attend to different parts of the key-value pairs, allowing for some level of focused attention compared to MQA's single set for all queries. From an implementation standpoint, multiple GPUs are used during inference, with the KV cache and query heads distributed across them. Each GPU runs an attention layer with a single KV head and a group of query heads. Thus, from the perspective of a single GPU, the grouped-query attention (GQA) component functions similarly to multi-query attention (MQA).

By allowing some differentiation within groups,?GQA maintains a better ability to capture complex relationships compared to MQA's single key-value set for all queries. The approach leads to faster-processing and reduced memory requirements for inference.

If you're well-versed in multi-head attention code, enhancing it for GQA is pretty straightforward. It's important to have number of groups as a tunable to experiment.

Finding the optimal number of groups (and in turn number of GPUs needed) for your use case is crucial.?Too few groups might lead to accuracy issues like MQA,?while too many might negate the speedup benefit.

15. Activation function

Activation functions are crucial in neural networks, introducing non-linearity and enabling the modeling of complex patterns within data.

The original Transformer architecture employed the ReLU activation function. Subsequent variants such as SwiGLU, GELU, and SiLU have been introduced, claiming faster convergence and improved accuracy. However, non-ReLU activations often incur higher computational costs due to reduced sparsity. In contrast, ReLU induces activation sparsity, leading to significant savings in weight transfer (I/O) between GPUs and CPUs. This sparsity affects approximately 90-95% of rows in projection layer weights, enhancing computational efficiency.

A recent paper from Apple, titled ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models, effectively highlights this concept. The authors provide activation sparsity measurements for various activation functions.

ReLU induces over 90% activation sparsity, while with SiLU and GELU it barely moves the needle on activation sparsity. The authors demonstrated that when trained from scratch, there is no significant difference in terms of performance between different activation functions.

Even if non-ReLU activations offer some advantages in accuracy and convergence, in my experience, these benefits are offset by the increased computational cost (FLOPS) required for inference, ultimately negating the net benefit.

For memory and computationally optimized models from training to inference, I recommend starting with ReLU (or Leaky ReLU if necessary). When experimenting with non-ReLU activations, it's crucial to establish clear success criteria across multiple dimensions such as cost, performance, and latency, and evaluate comprehensively.

Key System Metrics for Inference

Time To First Token (TTFT): Measures the time taken from receiving an input prompt to generating the first token of the response, indicating the model's responsiveness. Key factors include the model partitioning, scheduling algorithm, GPU FLOPs, and interconnect latency.

Time Per Output Token (TPOT): Also referred to as Time Between Tokens (TBT). Measures the average time taken to process and generate each token after the initial token has been generated. This is directly related to keeping the user engaged. For example, a TPOT of 50 milliseconds/token would be 300 tokens per minute. This translates to about 225 words per minute.

The total latency for the response is the sum of TTFT and the TPOT * number of tokens.

Throughput: Total number of output tokens per second the inference setup can generate across all user requests.

Tail Latency: Represents the worst-case response times (e.g. at the 99th percentile) that only a small fraction of requests experience. For real-time use cases, high tail latency will lead to poor user experiences.

FLOPS Utilization: Ratio of the observed FLOPS to the GPUs theoretical max.

Cost Efficiency: The cost associated with running inferences, often measured in terms of compute hours or dollars per inference.

Closing thoughts

This concludes Part 2, where we explored various system and model optimizations for enhancing training and inference in the development of high-performing, cost-optimized AI systems. By being aware of and implementing these techniques based on their specific needs, organizations can effectively leverage advancements in AI to build efficient and scalable solutions.

To get started, I recommend focusing on some of the more accessible and impactful optimizations. Begin with Quantization, FlashAttention, Key-value caching, and LoRA.

These techniques are readily available and supported by most machine learning frameworks, making them relatively straightforward to integrate into your existing workflows.

Quantization can significantly reduce the computational load and memory footprint of your models, enabling faster inference and lower power consumption without substantial loss of accuracy. FlashAttention optimizes the attention mechanism in transformer models, leading to speed improvements in training and inference. Key-value caching helps manage memory more efficiently during sequential data processing, which can be especially beneficial for tasks involving long context windows. Lastly, LoRA offers a method for fine-tuning large models with minimal additional parameters, making it an efficient approach for domain-specific adaptations.

Once these foundational techniques are in place, you can explore more advanced optimizations tailored to your unique requirements to push the boundaries of performance and efficiency.

Iteration and experimentation pave the way to unlocking the full potential of AI and achieving strategic goals.

GPU level optimizations

1. Operator Fusion

2. Kernel Fusion

3. Recompute

4. Spatial Blocking

5. Compress Data

6. Occupancy

Model level Optimizations

1. Intermixing prefill with decode tokens

2. Quantization

2.1. Symmetric quantization

2.2. Asymmetric quantization

2.3. Quantization parameters

Key Parameters of Quantization:

2.4. Quantization Grid and Error Management:

2.5. Choosing Quantization Parameters:

2.6. Quantization considerations

2.7. Handling Outliers in Quantization

2.8. Quantization Techniques: Post-Training vs. Quantization-Aware Training

领英推荐

SmoothQuant: Enhancing Activation Quantization

SPQR: Hybrid Compression for Model Size Reduction

GPTQ: Layer-by-Layer Quantization for Precision Optimization

3. Sparsity

4. LoRA (Low-Rank Adaptation)

5. QLoRA (Quantized Low-Rank Adaptation)

6. 1-bit LLM

7. FlashAttention and FlashAttention-2

8. PagedAttention

9. Computational Parallelism

10. Continuous Batching

11. Pre-Attention Input Tensor Batching

12. Speculative Decoding

13. Key-value caching

14. Grouped Query Attention (GQA)

15. Activation function

Key System Metrics for Inference

Closing thoughts

The Subtractive Mindset: Embracing Simplicity for Greater Impact

2024年8月13日

Modern Data Stack or Mayhem Data Stack

2024年5月29日

4 pillars of GenAI Product Strategy

2024年4月30日

Unsung heroes of AI revolution

2024年4月4日

Macro-Economics of GenAI

2024年3月13日

Constraints, Guardrails, and Redlines

2024年1月30日

Enterprise Architecture - Gen AI multi-modal Applications

2024年1月10日

LLM is the computer

2023年11月27日

社区洞察

其他会员也浏览了

Making a Better AI to Solve the World’s Greatest Problems - How Much 2022 Training Compute (in FLOP) and Data Is Needed? – What is Quetta (Q)

Short Circuit — Let AI Design Your Chips

Estimating the Infrastructure and Training Costs for Massive AI Models

Edge AI: Paving the Path Forward

Exploring the Capabilities of Gemma: Google Cloud's Latest AI Innovation

The Future of AI, Autonomous Systems, Science, Strategic Decision-Making, and National Security Lies in Next Generation Supercomputers

AiThority Daily Newsletter

GenAI Weekly — Edition 5

How Does GPU Technology Help In Machine Learning?