ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Maximising GPU Utilisation for LLM Inference: A Comprehensive Guide

Somya Rai

Principal Architect | AI Advisor | Innovation & Engineering Head

å‘å¸ƒæ—¥æœŸ: 2024å¹´9æœˆ15æ—¥

Introduction

Maximising the utilisation of GPUs during LLM inference phases is pivotal for enhancing both performance and cost efficiency. This guide offers a deep dive into the essential strategies and calculations necessary to determine whether your inference workload is compute-bound or memory-bound, thus allowing for more targeted optimisations.

Unpacking GPU Specs with the A10 Example

The A10 GPU straddles the performance and cost efficiencies between the T4 and A100 models, making it an ideal case study for this exploration.

Key specifications such as:

FP16 Tensor Core performance, compute bandwidth. We have 125 TFLOPS (teraflops, or a trillion float point operations per second).
GPU memory, we can calculate the size of model in GB by multiplying the model parameters by 2. Therefore, a 7B parameter model, for instance, will take up approximately?14 GB of memory.
GPU memory bandwidth can move 600 GB/s from GPU memory to on-chip processing unit.

Calculating Operations per Byte and Arithmetic Intensity:

Using above specification we can calculate ops:byte ratio for our hardware. This computation reveals whether your system's performance is limited by compute capabilities or memory throughput, guiding your choice of GPU and optimisation techniques.

Ops to byte = compute bw / memory bw?= 125 TF / 600 GB/S???= 208.3 ops / byte

This means to take full advantage of our compute resources, we have to complete 208.3 floating point operations for every byte of memory access.

If our system completes fewer than 208.3 operations per byte, it is memory bound, indicating that our system's speed and efficiency are limited by the rate of data transfer or the input-output operations it can manage.

Conversely, if we exceed 208.3 floating point operations per byte, our system becomes compute bound. Here, the performance and effectiveness are limited not by memory capacity, but by the quantity of compute units on our chip.

To ascertain if our system is memory bound or compute bound, we must evaluate the arithmetic intensity of Mistral 7B and compare it to the ops:byte ratio we determined above for A10 GPU.

Arithmetic intensity measures the ratio of compute operations required by an algorithm to the bytes it accesses, providing a measurement that is independent of hardware.

Case Study: Optimising Mistral 7B Model

The attention layers, responsible for weighting next token predictions according to the relevance of previous tokens, are the most computationally intensive parts of Mistral 7 B LLM. Given that these layers demand the most computational resources during inference, we will focus on calculating our arithmetic intensity in these areas.?

In the algorithm:

N?is the sequence length of the LLM, which sets the context window. For Mistral 7B,?N = 4096.
d?is the dimension of a single attention head. For Mistral 7B,?d = 128.
Q,?K, and?V?are all matrices used to compute attention. Their dimensions are?N?by?d, or in our case?4096x128.
S?and?P?are both matrices calculated during the equation. Their dimensions are?N?by?N, or in our case?4096x4096.
O?is the output matrix with the results of the attention calculation. O?is an?N?by d matrix, or in our case?4096x128.
HBM?is high bandwidth memory. From the data sheet, we know that we have 24 GB of HBM on the A10 operating at 600 GB/s.

Based on above specifications, arithmetic intensity for mistral is 62 ops/byte, which is way less than our A10â€™s ops:byte ratio of 208.3.

Therefore, model is constrained by memory. Essentially, in the time it takes to transfer a single byte from memory to the processor, we could have performed numerous additional calculations on that byte.

This poses a significant issue. We're investing substantial resources to maintain our GPUs, yet we're not fully utilising the computational power available to us.

Optimisation Strategy

One approach is to delay processing for a few hundred milliseconds to accumulate multiple requests and handle them simultaneously, rather than processing each one as it comes. By batching, we enhance the modelâ€™s arithmetic intensity since more computations are done with the same number of memory loads and stores, effectively reducing the modelâ€™s dependency on memory.

Now, the question is how many sequences can we fit in that spare GPU memory at once?

?d, which can be notated as?d_head, is the dimension of a single attention head. For Mistral 7B,?d = 128.
n_heads?is the number of attention heads. For Mistral 7B,?n_heads = 32.
n_layers?is the number of times the attention block shows up. For Mistral 7B,?n_layers = 32.
d_model?is the dimension of the model.?d_model = d_head * n_heads. For Mistral 7B,?d_model = 4096.

At half precision (FP16), each floating point number takes 2 bytes to store. There are 2 matrices, and to calculate the KV cache size, we multiple both by?n_layers?and?d_model, yielding the following equation:

kv_cache_size = (2  2  n_layers  d_model) bytes/token 
= (4  32 * 4096) bytes/token 
= 524288 bytes/token ~ 0.00052 GB/token

Given that the KV cache requires 524288 bytes per token, how large can the KV cache be in terms of tokens?

kv_cache_tokens = 10 GB / 0.00052 GB/token = 19,230 tokens

KV cache can easily hold 19,230 tokens. Therefore, given Mistralâ€™s 7B typical sequence length of 4096 tokens, our system can concurrently process a batch of 4 sequences.

Conclusion

In conclusion, to fully utilize the compute capacity we're investing in, we aim to batch 4 requests at a time during inference to maximise KV cache utilisation, thereby enhancing the throughput. If we're using large language models (LLMs) to asynchronously process a large queue of documents, batching is highly beneficial. By processing multiple items together, we can clear the queue much faster than handling each item individually, and we can time our inference calls to rapidly fill these batches, thus minimising latency effects.

However, this approach isnâ€™t feasible for latency-sensitive applications. For projects with strict latency demands, we can use similar calculations to determine which GPUs will meet our requirements.

While these theoretical calculations are useful, itâ€™s always critical to corroborate them with real-world benchmarks, which take into account additional factors such as communication costs and network delays.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Somya Raiçš„æ›´å¤šæ–‡ç«

Demystifying Sliding Window & Grouped Query Attention: A Simpler Approach to Efficient Neural Networks

2024å¹´4æœˆ1æ—¥

Demystifying Sliding Window & Grouped Query Attention: A Simpler Approach to Efficient Neural Networks

In the realm of neural networks, particularly those that deal with sequences like text or time series data, the conceptâ€¦
Large Language Models and Hardware: A Comparative Study of CPUs, GPUs, and TPUs

2023å¹´8æœˆ3æ—¥

Large Language Models and Hardware: A Comparative Study of CPUs, GPUs, and TPUs

What is CPU? CPU stands for Central Processing Unit. It is often referred to as the "brain" of a computer because itâ€¦
Unpacking the Power of Generative AI: Insights from GCC X... Summit

2023å¹´7æœˆ23æ—¥

Unpacking the Power of Generative AI: Insights from GCC X... Summit

As an AI enthusiast, I recently attended a leading analytics conference, which turned out to be an exhilaratingâ€¦

3 æ¡è¯„è®º
Digital Transformation using AI â€” Insurance Industry

2023å¹´3æœˆ27æ—¥

Digital Transformation using AI â€” Insurance Industry

The insurance industry is no stranger to the power of digital transformation. In recent years, insurance companies haveâ€¦

2 æ¡è¯„è®º
SetFit?-?Few Shot Learning Model with?Ray

2023å¹´2æœˆ14æ—¥

SetFit?-?Few Shot Learning Model with?Ray

Usually it is required to have large labelled data to build a model and thatâ€™s not the case always. And we have to goâ€¦

2 æ¡è¯„è®º
Working with Contact Centre Data

2023å¹´1æœˆ29æ—¥

Working with Contact Centre Data

We all have come across the scenarios where we had to use the conversation data between our customers and agents toâ€¦
Descriptive Statistics

2022å¹´5æœˆ7æ—¥

Descriptive Statistics

â€œI couldnâ€™t claim that I was smarter than sixty-five other guys â€” but the average of sixty-five other guys, certainly!â€â€¦

See all articles

Somya Raiçš„æ›´å¤šæ–‡ç«

Demystifying Sliding Window & Grouped Query Attention: A Simpler Approach to Efficient Neural Networks

Large Language Models and Hardware: A Comparative Study of CPUs, GPUs, and TPUs

Unpacking the Power of Generative AI: Insights from GCC X... Summit

Digital Transformation using AI â€” Insurance Industry

SetFit?-?Few Shot Learning Model with?Ray

Working with Contact Centre Data

Descriptive Statistics