Maximising GPU Utilisation for LLM Inference: A Comprehensive Guide
Introduction
Maximising the utilisation of GPUs during LLM inference phases is pivotal for enhancing both performance and cost efficiency. This guide offers a deep dive into the essential strategies and calculations necessary to determine whether your inference workload is compute-bound or memory-bound, thus allowing for more targeted optimisations.
Unpacking GPU Specs with the A10 Example
The A10 GPU straddles the performance and cost efficiencies between the T4 and A100 models, making it an ideal case study for this exploration.
?
Key specifications such as:
- FP16 Tensor Core performance, compute bandwidth. We have 125 TFLOPS (teraflops, or a trillion float point operations per second).
- GPU memory, we can calculate the size of model in GB by multiplying the model parameters by 2. Therefore, a 7B parameter model, for instance, will take up approximately?14 GB of memory.
- GPU memory bandwidth can move 600 GB/s from GPU memory to on-chip processing unit.
Calculating Operations per Byte and Arithmetic Intensity:
Using above specification we can calculate ops:byte ratio for our hardware. This computation reveals whether your system's performance is limited by compute capabilities or memory throughput, guiding your choice of GPU and optimisation techniques.
Ops to byte = compute bw / memory bw?= 125 TF / 600 GB/S???= 208.3 ops / byte
This means to take full advantage of our compute resources, we have to complete 208.3 floating point operations for every byte of memory access.
If our system completes fewer than 208.3 operations per byte, it is memory bound, indicating that our system's speed and efficiency are limited by the rate of data transfer or the input-output operations it can manage.
Conversely, if we exceed 208.3 floating point operations per byte, our system becomes compute bound. Here, the performance and effectiveness are limited not by memory capacity, but by the quantity of compute units on our chip.
To ascertain if our system is memory bound or compute bound, we must evaluate the arithmetic intensity of Mistral 7B and compare it to the ops:byte ratio we determined above for A10 GPU.
Arithmetic intensity measures the ratio of compute operations required by an algorithm to the bytes it accesses, providing a measurement that is independent of hardware.
Case Study: Optimising Mistral 7B Model
The attention layers, responsible for weighting next token predictions according to the relevance of previous tokens, are the most computationally intensive parts of Mistral 7 B LLM. Given that these layers demand the most computational resources during inference, we will focus on calculating our arithmetic intensity in these areas.?
In the algorithm:
- N?is the sequence length of the LLM, which sets the context window. For Mistral 7B,?N = 4096.
- d?is the dimension of a single attention head. For Mistral 7B,?d = 128.
- Q,?K, and?V?are all matrices used to compute attention. Their dimensions are?N?by?d, or in our case?4096x128.
- S?and?P?are both matrices calculated during the equation. Their dimensions are?N?by?N, or in our case?4096x4096.
- O?is the output matrix with the results of the attention calculation. O?is an?N?by d matrix, or in our case?4096x128.
- HBM?is high bandwidth memory. From the data sheet, we know that we have 24 GB of HBM on the A10 operating at 600 GB/s.
Based on above specifications, arithmetic intensity for mistral is 62 ops/byte, which is way less than our A10’s ops:byte ratio of 208.3.
Therefore, model is constrained by memory. Essentially, in the time it takes to transfer a single byte from memory to the processor, we could have performed numerous additional calculations on that byte.
This poses a significant issue. We're investing substantial resources to maintain our GPUs, yet we're not fully utilising the computational power available to us.
Optimisation Strategy
One approach is to delay processing for a few hundred milliseconds to accumulate multiple requests and handle them simultaneously, rather than processing each one as it comes. By batching, we enhance the model’s arithmetic intensity since more computations are done with the same number of memory loads and stores, effectively reducing the model’s dependency on memory.
Now, the question is how many sequences can we fit in that spare GPU memory at once?
- ?d, which can be notated as?d_head, is the dimension of a single attention head. For Mistral 7B,?d = 128.
- n_heads?is the number of attention heads. For Mistral 7B,?n_heads = 32.
- n_layers?is the number of times the attention block shows up. For Mistral 7B,?n_layers = 32.
- d_model?is the dimension of the model.?d_model = d_head * n_heads. For Mistral 7B,?d_model = 4096.
At half precision (FP16), each floating point number takes 2 bytes to store. There are 2 matrices, and to calculate the KV cache size, we multiple both by?n_layers?and?d_model, yielding the following equation:
kv_cache_size = (2 2 n_layers d_model) bytes/token
= (4 32 * 4096) bytes/token
= 524288 bytes/token ~ 0.00052 GB/token
Given that the KV cache requires 524288 bytes per token, how large can the KV cache be in terms of tokens?
kv_cache_tokens = 10 GB / 0.00052 GB/token = 19,230 tokens
KV cache can easily hold 19,230 tokens. Therefore, given Mistral’s 7B typical sequence length of 4096 tokens, our system can concurrently process a batch of 4 sequences.
Conclusion
In conclusion, to fully utilize the compute capacity we're investing in, we aim to batch 4 requests at a time during inference to maximise KV cache utilisation, thereby enhancing the throughput. If we're using large language models (LLMs) to asynchronously process a large queue of documents, batching is highly beneficial. By processing multiple items together, we can clear the queue much faster than handling each item individually, and we can time our inference calls to rapidly fill these batches, thus minimising latency effects.
However, this approach isn’t feasible for latency-sensitive applications. For projects with strict latency demands, we can use similar calculations to determine which GPUs will meet our requirements.
While these theoretical calculations are useful, it’s always critical to corroborate them with real-world benchmarks, which take into account additional factors such as communication costs and network delays.