LLM Inference - HW/SW Optimizations
This article follows my previous articles on large language model (LLM) training. In it, I explain the details of the LLM inference workflow, how it differs from training, the many hardware/software optimizations that make inference efficient, and the Inference hardware landscape.
In the context of LLMs, inference refers to the process of getting a response from the trained LLM model for the user's query or prompts. Inference is a critical step in deploying LLMs. However, a lot goes on before the model is deployed for production.
First, the trained model undergoes several optimizations to reduce its memory footprint and computational intensity. The optimized model is compiled for the specific hardware (GPUs or inference accelerators). The compiled models are stored in file servers of the inference serving systems. An inference-serving system refers to the entire infrastructure and software ecosystem designed to manage and serve AI/ML models for inference. They vary in complexity and features but contain several key components.
The load balancers (software applications) distribute the user requests among many inference servers, which are SW applications hosted on the CPUs. A cluster of GPUs or inference accelerators is associated with each inference server. Inference servers copy the compiled models from the file system to accelerator memories. When they receive the user requests directed to them, they batch multiple requests together to improve the overall throughput and send the requests to the inference accelerator clusters to be executed. These servers also pre-process the input data and post-process the results. Other functions include monitoring the system's throughput, security, etc. The Triton Inference server from Nvidia is an example of an inference server.
Inference-serving systems aim to provide fast, reliable, and scalable responses to inference requests in public clouds/data centers while keeping the total cost of ownership (TCO) low. This is a challenge for LLM models due to their gigantic size and the auto-regressive nature of LLM inference, where tokens are generated sequentially for every query.
As LLM inference workloads increase in public clouds and enterprises seek to have inference systems local in their data centers (to avoid paying premiums to public cloud providers for each query), there is a frenzy of activity in academia, start-ups, and hyperscaler research labs to optimize all facets of inference. This article tries to capture the latest on this topic.
Disclaimer: I am a Juniper employee, but the opinions expressed in this blog are my own and do not necessarily reflect those of my employer.
LLM Training Refresher
A quick refresher on LLM training (borrowed from my previous article) is given below.
To train an LLM using natural language text, large amounts of data are typically gathered using web scrape, Wikipedia, GitHub, stack exchange, Arixiv, etc. The vast amount of text from these data sets is first tokenized, often using methods like byte-pair encoding. Tokenization translates the raw text from the internet into a sequence of integers called tokens. A token generally represents a word. But it could be a subword, too. For example, the word "unhappy" might get broken into two tokens - one for the subword "un" and the other for the subword "happy."
Depending on the dataset, there could be tens of thousands of unique tokens, and the dataset itself could map to hundreds of billions of tokens. Sequence or context length is the number of consecutive tokens the model will look at when predicting the next token during the training. The sequence length is ~2K in GPT-3 and LLaMA2 (LLM from Meta).
To train the model, the tokens are broken into the array of size batch_size (B) x sequence length, and these batches are fed to large neural networks with transformer layers. The model is trained to predict the next word given the input sequence. The training often takes weeks, if not months, and requires large clusters of GPUs. Once the base model is trained, it typically goes through Supervised Fine Tuning (SFT). This important step tricks LLMs into acting as assistants, answering questions to human prompts! In supervised fine-tuning, human contractors create a curated dataset in the form of a prompt followed by a response, and the base model is retrained with this dataset. The trained SFT model now becomes an assistant capable of giving human-like responses to user prompts.
Attention is all you need!
All LLM models contain several layers of transformer blocks. The "attention" mechanism, described in the seminal "attention is all you need" paper, is at the core of these transformers. This allows the model to focus on the most relevant parts of the input sequence while predicting the output token. It enables the model to capture long-range dependencies and better understand the context.
"Attention," in human interactions, generally refers to our ability to focus on a few things and ignore other things that seem irrelevant at the time to get the context quickly. For example, "bank" could mean food bank, word bank, or financial institution. But, when we hear the sentence "I am going to the bank to deposit my paycheck," the brain immediately understands the word "bank" as a financial institution by taking clues from different parts of the sentence. "going to" implies the bank is a physical location. "deposit my paycheck," tells the place is a financial institution. We do this contextualization naturally, a skill without which we would not have been where we are today!
The transformer blocks in the LLM models teach the model to do the same. They were originally introduced for machine translation—translating a sentence from one language to another using the relevant context. Later, these transformer blocks literally transformed the world (no pun intended) by being able to predict the next word in the sequence with the context they had learned from the input sequences.
Understanding and appreciating the transformer architecture fully requires a deep machine learning/AI background, which is beyond the scope of this document. However, a general overview of the computational complexity during training/inference of the models with transformer layers is needed to understand some inference optimizations discussed later.
LLM Inferencing
As mentioned earlier, a trained LLM model is essentially a next-token predictor given an input sequence of tokens. To generate full responses to users' queries, the model (inference) server takes the output token from one iteration of inference, concatenates it to the user input sequence, and feeds it back into the model as the new input sequence to predict the next token. This process, where the output is fed back to the input to predict the next output, is called "auto-regressive" computation. This process repeats until it reaches a predefined stopping criteria.
Steps to Predict the Next Token
Let's first look at the steps involved in predicting an output token from an input sequence using a trained model.
Key Vector (K): The key vector for a token determines the influence or 'attention' that token should have on other tokens in the sequence. It encodes information about the position and context of that token in the input sequence. This "key" space" of the input sequence will be compared against the queries.
Value Vector (V): The value vector represents the actual information or content of the token, carrying the information that will be used in the output.
Query vector (Q): The query vector for each token is used to probe all the keys in the input sequence to determine how much focus or attention each part (token) of the input sequence should receive for this specific token.
The attention function can be simultaneously computed in the hardware on a set of queries, packed into a matrix Q. The keys and values are also packed into matrices K and V.
This is an oversimplified explanation of the LLM inference. The actual LLM models are huge, with multiple transformer blocks that do multi-head attention modules and feed-forward layers.
Steps to Generate the Full Sequence
Now, let's look at how the inference serving system generates the full response to the user.
When a user prompt exceeds the sequence length on which the model was trained, the inference server automatically truncates the input, typically keeping the most recent part of the prompt. This approach assumes that the most recent context is often the most relevant for generating the response. As LLMs advance, there's a growing need to handle longer prompts. Prompt compression techniques have emerged to address this challenge. One method used in LLMLingua involves compressing the prompt using a smaller language model to identify and remove less important tokens. This enables efficient inference from compressed prompts while maintaining the essence of the original input.
Note that only one output token is generated during every iteration of the decode phase. The matrix-vector operations to generate a single token at a time underutilize the GPU compute ability compared to the prefill phase. In the decode phase, the speed at which the parameters and the key-value pairs are read from the GPU memory dominates the throughput. Thus, the decode phase is memory bound, compared to the prefill phase, which is compute bound.
LLM Inferencing System's Metrics
Key metrics are often used to compare/evaluate different LLM serving systems.
Time To First Token (TTFT): This measures the time taken from receiving an input prompt to generating the first token of the response. It's an important indicator of the model's responsiveness, particularly in real-time user-facing applications. This depends heavily on the scheduling algorithm the model server uses to feed the user inputs to the model, the partitioning of the model across the inference accelerators in the cluster, the accelerator's performance (FLOPs), and the interconnect latency.
Time Per Output Token (TPOT): This is the average time the model takes to generate individual tokens in response to the input prompt. This metric assesses how each user will perceive the model's speed. For example, a TPOT of 100 milliseconds/token would be 600 tokens per minute. Since some tokens could also represent partial words, 600 tokens typically contain ~450 words. 450 words per minute (WPM) is faster than the rate at which most of us can read (200-300 WPM).
TPOT and TTFT are important metrics to keep users engaged with the LLM applications. The total latency for the response is the sum of TTFT and the TPOT * number of tokens generated.
Throughput: While the prefill phase is compute-intensive, the decode phase is memory-bound, where parameters and key-value pairs are read from the GPU memory to compute the next token. To increase the efficiency of GPUs, the inferencing systems often batch together multiple user prompts for inference. By batching, the memory read cost can be amortized across all the user requests. In addition, GPU computing can be used more efficiently by doing parallel computing across large batches.
The throughput of an inference server is the total number of output tokens per second the server can generate across all user requests.
Tail Latency: Refers to the latency at the high percentiles (e.g., the 99th percentile). It represents the worst-case response times that only a small fraction of requests experience. In user-facing real-time applications like chatbots/co-pilots, tail latency is critical. High tail latency can lead to poor user experiences, even if the average latency is low.
Model FLOPS Utilization(MFU) is the ratio of the observed throughput to the hardware accelerator's theoretical maximum throughput.
The lowest TTFT, TPOT per user, and highest throughput are desired in any inference serving system. However, increasing the throughput using larger batches (more users) is not always feasible. There are inference accelerator memory limitations, as each user's inference process takes up a significant chunk of GPU memory to store the key-value pairs. Also, each user's prompts are of different lengths, creating complexities in scheduling.
Accelerator Memory Requirements
For the best inference metrics, ideally, all the weights of the trained model should be in the inference accelerator's external memory (typically, HBM memory). GPT-4, the largest LLM model deployed today, is rumored to have ~1.7T parameters. If the parameters are stored in 16-bit floating point representations, this translates to 3400 GB of memory.
If Nvidia's H100 GPUs, with 80GB memory capacity per GPU, are used as inference accelerators, the inferencing system would require ~40 GPUs (or 8 GPU servers) to store the parameters alone.
In addition to parameters, inferencing systems also store the key-value pairs to save on computing costs. The equation for the maximum memory size for the key-value (KV) pairs:
Total size of KV cache in bytes = (batch_size) x 2 x (sequence_length) x (num_layers) x (hidden_size) x SizeOf(FP16)
The sequence_length is the maximum number of tokens (sum of input and output tokens) allowed per user request for the model. Factor 2 is for each key and value in the KV pair. These are usually stored in 16-bit precision format (2Bytes each). num_layer is the number of transformer attention layers. The hidden_size is the dimension of the model-specific attention layer.
Since each user request in the batch has its own KV cache, the memory required is directly proportional to the number of requests in the batch or the batch size.
Since it is impossible to know a priori how many output tokens will be generated, inferencing systems can take a conservative approach of allocating the space for the maximum sequence length number of KV pairs for each user in the batch and each attention layer.
As shown in Table 2, with this conservative approach, the KV cache dominates the total memory required at large batch sizes.
The more requests in a batch, the more inference accelerators are needed to distribute the KV-pairs and the parameters.
The batch size used by inference serving systems depends heavily on the number of inference accelerators each model server can use, the latency, and the TCO metrics the system targets.
Reducing the memory footprint of the model without compromising the inference metrics is a hot topic. The next section covers all the techniques (used/emerging) to improve inference efficiency and reduce memory footprint.
Optimizing the Inference Cost
Model Architecture Optimizations
Memory Optimizations
Throughput Optimizations
Hardware Optimizations
The following sections detail each of the above topics.
Model Architecture Optimizations
Attention Enhancements
All LLM models do multi-head attention in their transformer blocks. It allows the model to simultaneously attend to different aspects of the input sequence, leading to a more comprehensive understanding. A transformer with two attention heads has two sets of learnable parameters (weights) for queries, keys, and values.
Each attention head operates independently and calculates its own attention scores. This allows each head to focus on different types of relationships, like long-range dependencies, syntactic structures, or semantic similarities. The attention function output of each head is then concatenated, creating a single vector that captures the combined information from all the different perspectives. This then goes through the rest of the processing to predict the next token.
Mutli-head attention with N heads increases the computational and memory complexity N times. But it generates more coherent, consistent, relevant, and meaningful outputs.
To overcome the increase in computing and memory costs, the attention layer dimension could be reduced linearly by the number of heads to keep the computational costs similar to single-attention head models (without significant performance loss). Multi-head attention models use tensor parallelism, where each head of a transformer block could be computed in separate tensor-parallel GPUs to reduce the inference time.
Multi-query attention is a variant of multi-head attention that gets similar results but with only one set of key-value pairs. It creates multiple "queries" (questions) for each token but uses the same set of keys and values. With shared keys/values, the parameters and KV cache values must be loaded only once from the memory, thus reducing the memory requirements.
Optimizing the attention mechanisms is a hot area of research, and new mechanisms are being developed to address complex relationships in the input data, especially with multi-modal inputs, etc. It is beyond the scope of this document to cover all the details.
Model Distillation
Model distillation is the process of transferring the "distilled" knowledge from larger trained models (teacher model) to a smaller and more efficient model (student model) so that the student model retains the performance of the teacher model while reducing the compute and memory resources needed for inference.
The student model is usually a truncated version of the teacher model where multiple transformer layers are removed. It can be trained using the output probabilities (soft targets) generated by the teacher model. Soft targets provide more information per training example, such as the relative probabilities of incorrect answers. In some cases, the student model is also trained to replicate intermediate representations (like activations of certain layers) of the teacher model.
The model distillation allows the deployment of language models on devices like smartphones and resource-constrained embedded systems compared to the high-end GPU/accelerators found in the data centers. This can also be deployed in edge applications where rapid response times are crucial.
However, distillation is an emerging field. Some smaller models like BERT (100-200M parameters) have shown good results with distillation, with 40% fewer parameters in the student model and almost similar performance as the original BERT model. This is yet to be proven for hundreds of billions of parameter LLM models.
Speculative Execution
During the inference's decode phase, the output tokens per user request are generated one token at a time. The Nth token depends on all the (N-1) tokens before it. This process is heavily memory-bound. It also adds to the total latency of inference.
What if a model can generate multiple output tokens? That is where speculative execution or speculative sampling can help. A smaller and faster "draft" model is used to generate K consecutive output tokens. Then, using the main model, these generated tokens are verified parallelly. For example, suppose the draft model generates 3 tokens, and the first two tokens match what the main model would have predicted. In that case, the next iteration starts after appending these two tokens to the input and repeating the steps. This adds additional compute overhead, but the assumption is that when the execution is memory-bound, the inference accelerators have idle cycles, and using them to do more parallel computation should help overall.
Memory/Compute Optimizations
Optimizing KV Cache
Batching is critical to improving the throughput of LLM inference systems. However, as seen in the previous sections, the KV cache size increases linearly with the batch size and the conservative memory allocation for the KV cache. While the scheduler knows the size of the input sequence, it does not know the length of the output sequence generated, and doing this conservative allocation ensures that the memory does not overflow.
While this approach is straightforward, it overprovisions the inference accelerator's memory for caching. As a result, the system could use more engines than needed or smaller batches and have reduced throughput.
PagedAttention addresses this problem using the operating system's well-known solution for memory fragmentation, i.e., virtual memory with paging. Instead of allocating a contiguous space in the memory for a request's KV cache, the memory is allocated in blocks dynamically. The blocks are not necessarily contiguous in the inference cluster's memory. The scheme flexibly manages the KV cache. It allows the sharing of memory between multiple requests and reduces the overall memory requirements by 2- 4 times for the same batch size. This increases the total throughput 2- 4x over KV caches with fixed allocations.
This scheme doesn’t guarantee perfect memory utilization. Still, it significantly reduces the wastage from ahead-of-time allocation schemes used widely by all the inference frameworks before this scheme was published.
As with a virtual memory scheme, when no physical blocks are left in the accelerator's memory, it selects a few user requests and evicts their KV-pair values to the server's CPU memory. After that, it stops processing those evicted requests and accepting new ones. Once any active request completes execution, its memory blocks are freed, and the preempted requests are returned to the GPU memory to continue processing. All major LLM serving systems (including Nvidia's TensorRT-LLM) have adopted this method for throughput gains.
Quantization
Nowadays, most models are trained with either 32-bit or 16-bit precision floating point numbers for weights and intermediate activations. Quantization refers to a technique that reduces the model's size and computational requirements by decreasing the precision of its parameters and/or activations. There are two forms of quantization: post-training quantization and quantization-aware training.
Post-training quantization (PTQ) applies quantization to the model weights (parameters) after it has been fully trained. This is done before the model is deployed for inference. This method converts the weights to lower precision, like 8-bit integers (INT8) or 8-bit floating points (FP8). Quantization is done by first calculating the scale factor from the range of weight values for each layer. The model weights for that layer are scaled using the scaling factor. The scaled values are then rounded to the nearest integer (when converting to INT8). Finally, these rounded values are cast to 8-bit integers.
Activations can also be quantized by passing representative data through the model, recording the activation values, and computing the ranges. Then, these ranges are mapped to a lower precision format similar to the weights.
As the model size continues to grow to hundreds of billions of parameters, outlier features of high magnitude start to emerge in all transformer layers, causing the failure of this simple per-layer quantization technique. Methods like quantization at different precisions for weights vs. activation, using mixed precision where some activations use 16-bit, and others use 8-bit precision, are proposed.
Not all quantization techniques would result in memory/compute savings if the underlying hardware can't exploit it. For example, if the inference accelerator does not support INT4 format in matrix multiplications, then casting a weight to INT4 does not help.
Dynamic Quantization (Post Training Dynamic Quantization): In this method, the weights of the model (either all layers or selective layers) are quantized to lower precision numbers before the inference run. This is done by the inference server after it fetches the model from the storage system and before it uploads the model into the accelerator clusters.
In addition, the inference accelerator can also quantize the activations natively on the fly during the model execution. This method can adapt to the varying range of activation values, leading to better accuracy. Several inference accelerators support quantization in the hardware to speed up the dynamic quantization of the activations.
Quantization-Aware Training (QAT): In this method, after the initial training, the model is fine-tuned (re-trained) using lower precision weights. This technique is robust to quantization effects, leading to better accuracy. QAT is compute-intensive as it requires retraining. It also burdens the inference serving systems to save/manage many quantized versions of the same models, as the user requirements can vary.
Almost all inference frameworks support post-training quantization to speed up computation time and reduce memory footprint. I barely scratched the surface of this topic. For a deeper understanding of the many flavors of quantizations, refer to the documentation from various inference frameworks.
Logarithmic Numbering format
At the Hot Chips 2023, Dr. Bill Dally from Nvidia discussed a 4-bit log number format to continue scaling past INT8. With log numbers, multiplications and divisions essentially become additions and subtractions. This could reduce the energy needed to do complex matrix multiplications. But addition/subtraction is more involved. Some implementations use lookup tables. However, those are expensive and do not scale well for AI inference workloads. In his presentation, Bill Dally showcased Nvidia's novel technique for multiply-accumulation. Nvidia may use a logarithmic number system in some of its next-gen inference accelerators. A slew of start-ups, as well as research labs, are exploring various 4-bit number formats!
Pruning
In the trained LLM models with billions of parameters, some are more critical than others for performance. Pruning a network involves identifying and keeping these significant weights while discarding the less important ones. By pruning the weights, the model becomes compact, takes up less memory space, and needs fewer computational resources. The challenge is to remove as many weights as possible without impairing the network's ability to make accurate predictions.
Pruning is typically done after the model is trained. In this post-training pruning, the model needs to be fine-tuned after the pruning to regain the performance.
The techniques used are either weight magnitude-based pruning or structured pruning. The neural network weights are ranked based on their absolute values in weight magnitude-based pruning. Weights with the smallest absolute values (closest to zero) are considered the least important and are pruned or zeroed out. While the network becomes sparser, this sparsity is unstructured. This may or may not lead to actual computational efficiency improvements in the hardware as the hardware is not designed to skip over random zeros in the weights when doing matrix multiplications!
To benefit from the unstructured sparsity in the large weight matrices, the common inference frameworks support sparse matrix multiplications, where they can algorithmically map large sparse matrix multiplications to smaller dense matrix multiplications in the hardware. For example, in one approach, a large weight matrix can be broken into several smaller blocks, and the compiler identifies blocks that are entirely zero, skipping the storage and computations on these blocks.
Even if the framework does not support sparse matrix multiplications, sparse weight matrices can be compressed and stored in the accelerator memory using standard compression algorithms, and the hardware could natively decompress the weights before use. This reduces the memory requirements at the expense of decompression logic in the hardware.
Structured pruning involves removing structural components of the large models, like entire layers or attention heads. In some types of structured pruning, two out of every four weights in a weight matrix are pruned to zero to enable compressed weights to be stored in the accelerator memory with additional 2-bit indices for each weight, as shown in the diagram below.
Pruning can also be done before training. In this technique, weights are initialized randomly before training starts. A certain proportion of weights are selected to be pruned randomly. The pruned weights are held at zero during training. This pruning at initialization (PAT) is inspired by the "Lottery Ticket Hypothesis," which suggests that within a large, randomly initialized network, there may exist smaller subnetworks ("winning tickets") that can achieve comparable performance to the full network when trained in isolation from the start. However, this method has not yet been used on LLM models.
Sparse attention
This is a technique used to reduce the computational complexity of attention functions. It limits the attention of each token to only a subset of previous tokens rather than attending to all tokens in the input sequence. For example, in linear sparse attention, each token attends to a fixed-size window of nearby tokens. By doing this, the quadratic complexity involved in computing the output is reduced to linear complexity. However, there is a delicate balance between model accuracy and compute optimization.
Throughput Optimizations
Model Partitioning
In my previous blog, "GPU Fabrics for GenAI Workloads," I extensively covered model partitioning for training. Training is a much more computationally intensive task with large data sets. It requires many model copies across thousands of GPUs (for data parallelism) that work in sync for each iteration. The gradients must be updated across all the model copies before the next iteration can start.
Inference is a less resource-intensive task compared to that. Each iteration of token generation goes through the forward pass of the model to compute the next token and update the KV cache. There is no gradient aggregation or parameter updates for billions of parameters that span large clusters before the next iteration. A rule of thumb for FLOPs for each iteration of inference is one to two times the number of parameters in the model.
Thus, the size of the accelerator cluster is orders of magnitude less than what is required for training. For example, Table 3 states that we need about 4 A100 GPUs (or one GPU server) to host a GPT-3 model.
The internal details of GPT-4, the largest foundational model today, are not publicly available. But assuming linear scaling (GPT-4 is 10x of GPT-3), we may require ~38 A100 GPUs or 5 servers. More GPUs could be added than minimally required to improve the system's latency and throughput.
Many open-source and commercially available models for enterprises have less than 100B parameters. Using A100/H100 might be an overkill, as shown in Table 3, where the LLaMA2 7B can generate >300 tokens/second per request. For optimal user experience in real-time applications, 20-100 tokens/second is good enough.
LLaMA2 models can use less-power-hungry accelerators like L4 with ~24GB memory per accelerator. As seen in Table 4, Nvidia's L4 accelerator with 24Gb memory is good enough for LLaMA2 7B/13B models. We need a five-accelerator cluster for 70B model inference.
Pipeline and tensor parallelism are used to partition the model across the accelerators when the model needs more than one accelerator. Tensor parallelism is critical for inference as it decreases the latency by breaking up the computation in each layer across multiple GPUs. Attention blocks and multi-layer perceptron (MLP) layers are major components of transformers that can take advantage of tensor parallelism. In multi-head attention blocks, each head or group of heads can be assigned to a different device to be computed in parallel.
The model partitioning for inference does not need to match the partitioning done during training. Each inference serving system has its cluster topology and hardware. The topology/hardware-aware compilers in these systems partition the models to meet their performance, power, and latency targets. For example, when using the GPU servers from Nvidia, the compiler (TensorRT-LLM) tries to keep the high bandwidth tensor parallel partitions of a model layer within a GPU server where the GPUs communicate with each other through high-speed links.
Continuous Batching
Batching improves the utilization of the accelerators. In static batching, the simplest technique, new requests can't be added to the batch until the inference on all the requests is complete. In other words, the scheduler works at the granularity of user requests. This is illustrated in the below figure. This technique is extremely inefficient for LLM inferencing as each request in the batch is unique and may need a different number of iterations through the model to generate the responses. So, some requests in the batch finish earlier than others. If new requests are not scheduled in those idle slots, the system becomes inefficient, with GPUs underutilized.
Static batching is not efficient for auto-regressive inferencing. This problem is not present during the training as all requests in the training batch are of the same sequence length, and the training involves a single iteration of predicting the next token in the sequence.
Iteration-level scheduling, as described in the Orca paper, overcomes this. Some frameworks refer to this as continuous or dynamic batching. The batch size is constant here, but the inference server's scheduler works at the iteration level granularity. At the end of an iteration of the new token generation, if the scheduler detects that one request has completed execution (all tokens for that request are generated), then it immediately returns the tokens of that request to the client, picks a new request and starts processing that request in the same slot as the completed request. Thus, this scheduling uses GPU resources more efficiently, and latency is also improved for user requests.
The above description is an oversimplified explanation for iteration-level scheduling. The actual implementation needs to account for differences in computing requirements of pre-fill versus the decoding phases and several other cases that are too deep for this blog.
The iteration-level batching can still create head-of-line blocking (HOL) for new requests, as a new user request can not enter the execution phase until one of the current requests in the batch finishes execution. There is a valid reason to do this way. Orca's scheme needs to maintain the KV cache only for the ongoing jobs, which is strictly equal to the batch size. If jobs are interleaved at the iteration level (a new job takes the slot of the previous job for the next iteration even if the previous job is not complete), while it can give better TTFT for the new requests, the GPU memory requirements shoot up as the inference cluster now need to keep the KV cache values for all the active jobs.
Hardware Optimizations
Custom Hardware vs GPUs
The choice of hardware for inference depends heavily on where the inference is being performed. Currently, most LLM inference happens within data centers and public clouds due to easy access to thousands of powerful GPUs and robust network infrastructure provided by cloud service providers. However, LLM inferencing on edge (where the endpoint devices are) holds significant promise for the future as processing data closer to users can significantly decrease latency with better privacy and security. The cost/power savings of edge inference could be greater with hardware accelerators optimized specifically for LLM inferencing.
Nvidia GPUs remain dominant in data center/public cloud inference, offering a mature ecosystem, high performance, and broad software support. Nvidia offers a wide range of GPUs with a trade-off between performance and power. While their high-end H100/H200 GPUs can also be used for high throughput inference on trillion parameter models like GPT-4, they do offer lower-end GPUs like A100/L4/L40S/T4 targeted for low-cost/low-power inference on medium-sized models. As they introduce next-generation GPUs for AI training, they continue to offer previous-generation GPU servers for inference.
A GPU server with 8 x H100 or A100 GPUs has a total memory of 80GB x 8 = 640GB. As seen from Table 3, this can hold many medium-sized models with decent batch sizes. If more than 8 GPUs are needed, one option is to use the DGX GPU pods (up to 256 GPU systems) built by Nvidia. But those are super expensive, costing millions of dollars.
An alternate option is to store most of the model parameters in the server (CPU) memory and stream them to GPUs before use. This could cause large latencies in a typical server where PCIe links are used for communication between GPU and CPUs. Nvidia's G200 solves this problem by having high-speed NVlinks between the "Grace" CPU and "Hopper" GPU in GH200. This enables the GPUs to access the ~1TB LPDDR5X memory from the CPU using the 900GBps high-speed links. On top of the 96-144GB of HBM memories attached to the Hopper GPU, this CPU memory gives plenty of memory for inferencing using a single GH200 server card.
Nvidia GPU's processing engines (SM) contain tensor cores for matrix-multiply-accumulate operations, which can perform massively parallel matrix operations and provide speedup and efficiency over standard floating-point/integer units. The high-end GPUs come with transformer engines that can analyze each layer of a transformer model and automatically choose the optimal precision format for that layer's activations.
Nvidia's TensorRT-LLM is an open-source high-performance inference optimizer that incorporates most of the techniques for inference run-time optimizations (continuous batching, paged attention, quantization, layer fusions, and many more).
AMD is also becoming a significant player in the GPU solutions space for LLM inference, offering a mix of powerful GPUs and tailored software. The company's Instinct series MI300X and MI300A accelerators are strong contenders to Nvidia's GPUs. AMD's SW stack has also improved significantly in recent years.
However, a good portion of the high-end GPU's die area is dedicated to graphics processing units like texture and raster engines, which are idling during inference. This logic not only adds to the die area but also the die cost and overall power. Even if the logic remains idle during LLM inference, it still consumes leakage power.
Also, GPUs contain powerful standard arithmetic units (not part of the tensor cores) capable of handling double/single precision floating point numbers and INT32. This logic is mostly unused during inference. Increasingly, most LLM models are quantized to 16-bit floating points (TF16) and 8-bit integers for tensor operations during inference. Since the same GPUs are used for training and inference of many different types of models, the tensor cores in GPUs also continue to support 64-bit/32-bit matrix operations.
Dedicated hardware exclusively for AI inference can optimize the die area/cost by not having unused logic and supporting only the minimum numbering formats needed for its target inference applications. In addition to highly parallel matrix and vector processing units, the hardware can support weight matrix decompression, structural pruning, dynamic quantization, native support for linear transformation functions found in the LLM models, new 4-bit number formats, etc., to improve the overall efficiency of inference.
Almost all the hyperscalers are building high-performance AI accelerators to replace GPUs. Many start-ups are also targeting standalone/affordable inference system solutions with custom accelerators for data centers. The following sections briefly preview the landscape of non-GPU-based inference accelerators. It is not exhaustive by any means, as more players are entering/coming out of stealth mode almost every month!
Any custom hardware should still have some flexibility built into it through processing engines that can execute instructions - either a standard ISA like RISC-V or custom instruction sets. Without this flexibility, the hardware can't keep up with continuous innovations in the model landscape.
Google's TPUs
Google is at the forefront with its TPUs (Tensor Processing Units). TPUs contain thousands of matrix multiply-accumulators that are directly connected to form a large physical matrix. This is called a systolic array architecture. In addition, they also have vector processing units with flexible instruction sets.
The compiler first partitions and maps the large weight matrices of a trained model to different TPUs. Then, the inference server (host) transfers them to the TPU's high-bandwidth memory using specialized communication protocols. Each host typically connects to 4-8 TPUs.
During the inference, the TPU loads the parameters and the input data from HBM memory into the matrix and multiplies the units to perform the matrix operations. After processing sub-matrices, intermediate results are communicated to other matrix and vector processing units within the TPU or across chips in a pod using dedicated high-speed interconnect networks for further matrix processing. Partial results are accumulated within these units to form the final output matrix and sent back to the Host. TPUs can outperform GPUs when dealing with large input batches and matrices found in foundational models.
The "lite" version of TPUv5 (TPUv5e) is targeted towards inference by optimizing the die area (halving the number of tensor cores), doubling the interconnect throughput, and running the cores faster. Google claims they get better power and performance per dollar when inferencing with a "lite" version for medium and latency-sensitive inference workloads.
Amazon's Trainium2/Inferentia2
Amazon builds custom high-performance AI accelerator chips for training and inference and deploys them in their AWS cloud.
Trainium2 is the second generation of their custom-designed chip built for training LLMs and other deep-learning models. The chip can also be used for high-performance inference. Its core (called NeuronCore) contains tensor, vector, and scalar processing units for matrix/vector and scalar processing. It also has a general-purpose single instruction multiple data (SIMD) engine with custom ISAs for added flexibility in executing the models. Inferentia2 is a scaled-down version of Trainium2 with half the number of cores.
These chips have custom high-speed links (NeuronLink) to connect with each other. Trainium2 chips can be connected together in 2D or 3D torus topology (similar to TPU pods) to make clusters of hundreds of thousands of these accelerator chips for foundational model training.
A node consisting of 12 inferentia2 chips (192GB total memory) connected with NeuronLinks can inferencing many large language models. Amazon deploys these inference modules in EC2 clusters.
Meta's MTIA
Meta unveiled details about its AI accelerator, MTIA, mid-last year. The ASIC has up to 128GB of LDDR5 DRAM for off-chip memory and 64 processing engines that support heavily customized RISC-V instruction sets and hardware logic for vector/matrix and non-leaner transformation processing. In the inference server, 12 of these accelerators are connected through a hierarchy of PCIE switches - probably not as fast/efficient as the custom links used by Amazon/Nvidia.
Meta claimed to show better performance per watt for low-complexity DLRM models, which are smaller and quite different from LLM models. GPU outperformed MTIA for larger models, which Meta attributes to software inefficiency and memory/interconnect bandwidth limitations. The results from MTIA are not bad, considering this is the first version of their architecture. It usually takes a few generations for the architecture to mature and address the workloads for which the chip is targeted. Although Mark Zuckerberg is loading up on thousands of Nvidia H100 GPUs, I believe Meta will continue to invest in high-performance AI training and inference chips and target LLM inference acceleration in their next-generation chipsets.
Intel's Gaudi2
Intel provides AI acceleration engines inside Xeon processors (CPUs) for small AI workloads. Its second-generation Gaudi2 AI accelerator chips are for high-performance training and inference. Gaudi2 has custom hardware for matrix multiplications and VLIW SIMD processing engines to accelerate other operations. Gaudi2 integrated RDMA over Converged Ethernet (RoCEv2) engines and has 24 x 100GE ethernet connections for chip-to-chip interconnect. This native integration of RoCE allows customers to use the same interconnect, both inside the server and rack (scale-up), and to scale across racks (scale-out) using standard ethernet switches.
Qualcomm also offers data center inference cards using custom AI engines. However, not many details of their chip are available.
Recently, Microsoft joined the race with their Maia AI Accelerator for generative AI training/inference workloads. Their announcement suggests that Maia supports < 8-bit numbering formats (most probably using the MX data types unveiled at OCP 2023). They plan to deploy these accelerators in their Azure cloud this year.
Startup Ecosystem
Several startups offer inference chips/systems for data centers and low-power IOT applications. In this article, my main focus is on data center-grade inference accelerators.
The startup Groq has taken an interesting approach to inference by removing the high-latency external memory accesses altogether and using only on-chip SRAMs to store the model parameters. It uses a data flow architecture akin to a very long fixed pipeline where there is no reordering, arbitration, or scheduling anywhere. And the functional units execute in lockstep with fixed latencies. Their compiler knows which tensors are stored in which SRAM and where the data will be in the pipeline, so it schedules the instructions in such a way as to intercept the data with the instruction that is executed on it. These chips are less expensive as they don't have HBM integration and complex packaging. Multiple chips in a node connect to each other through custom C2C interconnects in Dragonfly topology to make longer pipelines. The lock-step execution is maintained across the chips by synchronizing the chip-to-chip links.
A rack contains 9 nodes with 8 chips in each node. It needs 8 racks (~576 chips) to do the inference for the LLaMA2 70B model! While the hardware seems like a lot, the company "claims" it can get 300 tokens per second at 1/10th the power compared to an H100 GPU server. This power estimate could be for a batch size of one. This architecture will have a hard time scaling larger foundational models. However, as an edge inference system running open-source LLaMA models with low power, this could be an attractive solution for enterprises looking to have the systems on their premises for lower TCO and where rack space is not a concern.
Graphcore IPU chip architecture is also loosely based on this philosophy of avoiding external memory in favor of distributed on-chip SRAMs near the processing cores to reduce the memory bottleneck and make computing more efficient.
Sambanova, on the other hand, uses three layers of memory ( DDR/HBM and on-chip SRAMs) to have access to 1.6TB of memory with a single accelerator chip so that inference on trillion parameter models could be done with a handful of these accelerators. However, they didn't publish the inference metrics. It is still being determined how well their inference execution can hide the long latencies of the DDR memories.
In these systems built by startups, SW stacks and compilers play a critical role in translating and mapping the trained models to the accelerator's HW architecture. The challenge for these compilers is keeping up with the ever-changing landscape of LLM models and versions. Their custom inference servers must incorporate the latest model optimizations and batching/dynamic memory allocation techniques.
I've barely covered the inference system startups, focusing only on those with openly available hardware details. Several other startups, like Recogni, SimaAI, and Sapeon, to name a few, build custom hardware and SW stacks for inference.
In Memory Compute/PIM
Processing in memory (PIM) and in-memory computation have been topics of interest for a while. They got a fresh momentum again in the context of LLMs. In the matrix-vector or matrix-matrix multiplications, which are the majority of computations in LLM inference, the parameters are read from memory, and the intermediate state is saved back in the memory. Samsung, Hynix, and a few startups claim that moving the matrix multiplications and other transformer operations onto the memory die itself would improve inference performance and power consumption as there is less data movement between the external memory and the accelerator's core. Initial results from a few prototypes were presented at the Hot Chip 2023 conference.
But, by adding logic to DRAM dies, we will give up close to half the memory capacity of the DRAM for the logic area. Memory capacity is a critical factor for large models. Further, adding the logic in the DRAM process node means the logic is not super-optimized for power, performance, or area. The power savings depend heavily on how much the workload can be parallelized across all the banks. It is also unclear how easy it is to integrate this software into the common AI/ML frameworks. Finally, both memory vendors do their own proprietary processing engines and software stacks. Even within the vendor, there is no consistency in the SW stacks for HBM versus GDDR/LPDDR memories. This technology might see more traction in the inference accelerators for mobile and IoT devices, which host smaller models and where saving every milliwatt matters.
Scale-Up/Scale-Out
Unlike LLM training, inference does not need large GPU clusters with thousands of GPUs where the GPUs need to work in lock-step for every iteration!
The cluster size depends on the underlying inference accelerator, the models it supports, how well the models were optimized, how efficiently the compiler can map the models to the underlying hardware, dynamic memory management, and the overall throughput the system must support.
Most GPUs/accelerators offer nodes with up to 8/12 chips connected to each other through their proprietary high-speed interconnects. As shown in Table 3, a single server is plenty for most inference workloads.
These nodes can connect to standard ethernet or InfiniBand fabric through NICs. When more than one node is needed, the scale-up/scale-out choices using GPUs are:
This is where the cost/power advantage of the non-GPU inference accelerators with proprietary high-speed interconnects comes into play. These interconnects provide higher bandwidth in scale-out/scale-up systems at lower cost, smaller latencies, and better congestion control. For example, TPUv5es in 2D torus topology in a POD. Not having to use standard ethernet or Infiniband switches, which come with their price tag and power, could help the TCO overall. However, Google's TPUs and Amazon's Trainium/inferentia chips can only be used inside their clouds.
For enterprises building their own inference systems, options beyond GPUs largely come from startups offering potentially better cost and power efficiency systems. However, one can argue that even if the cost is lower, if the software stack is not mature enough to keep the hardware fully utilized across all different models/versions and workloads, it could offset any cost savings. The inference landscape is evolving quickly, with many new optimizations that even larger companies are having difficulty catching up with. The inference ecosystem needs to mature so that many startups can catch up on all the latest techniques to create mature SW stacks comparable to what Nvidia possesses.
Necessity is the mother of invention. If the GPU-based server costs continue to increase with monopoly from the two GPU vendors, the accelerators from some of these startups may find their way into public clouds, data centers, and enterprises in a few years.
Summary
In this article, I reviewed the LLM inference workflow and the many optimizations that go into reducing the memory footprint and computational complexity. I also reviewed a few inference accelerators and the pros/cons of using custom/non-GPU accelerators for inferencing. Although I tried to make this article comprehensive, I feel I barely touched on all the recent advances. And the edge inference on IoT/mobile devices is not covered.
LLM inference chatbots are fast replacing Google search and are becoming essential tools that we can't live without. And there is no longer any dispute that enterprises, small and large, can benefit immensely from deploying LLMs that have access to internal data.
With LLM workloads growing exponentially, more enterprises may want to own the LLM fine-tuning and inferencing systems on their premises or data centers rather than pay hefty sums to public cloud operators. Even service providers might start offering inference in the network to get some of the market share from the public clouds! The public clouds will continue to invest in custom hardware solutions to reduce their dependencies on GPUs and to scale inference workloads cost-effectively. With this exploding demand, there will be more innovations on all fronts, including hardware accelerators and software optimization, to make inference sustainable and economical. Exciting times ahead!
References
CEO - Vedya Labs
8 个月Excellent write-up. Thanks for writing and sharing this.
Design Eng Director at Xilinx || Datacenter SmartNIC || Networking || 5G || FPGA || ASIC
8 个月There has to be consideration also on improvements in energy efficiency, inference speed, and mobile deployment.
Impressed by your articles, Sharada Yeluri. Your depth is invaluable. Please consider exploring LLM Safety against sensitive data leakage and hallucinations next—it's vital for confidence in LLM outputs. Happy to offer insights or case studies to support this.
Senior Security Consultant at Nokia
9 个月Sharada Yeluri That's a fantastic article and I am amazed with the flow of the knowledge and the breadth of the concepts covered within a single article while maintaining the clarity. Being a newbie in the field, I had no difficulty in understanding the topic.
Artificial Intelligence Engineer @ Samsung Electronics | Computer Science, LLM, NLP
9 个月Thanks for great article!