Free GPUs Are Hiding In Your Data Center: Unlocking the GPU Resources You Don’t See
You've invested heavily in GPUs, the engines of modern AI. You diligently monitor NVIDIA SMI, that ubiquitous command-line tool, feeling like you have a handle on utilization. But what does that output really tell you? The uncomfortable truth is: likely far less than you think. While NVIDIA SMI offers a glimpse, it's akin to glancing at a car's dashboard fuel gauge and declaring you understand engine performance. It's a start, but it barely scratches the surface of true GPU utilization. Understanding and maximizing your GPU investment requires going far beyond basic readouts and diving deep into the complexities of resource allocation and dynamic workloads.
The NVIDIA SMI Illusion: A Time Slice in the Dark
NVIDIA SMI “GPU util”? is the go-to utility, and at first glance, it seems to provide insights. But consider this: SMI primarily shows you a time slice – a snapshot of whether your GPU is performing any calculations at that precise moment. As Arthur Chiao mentioned in his insightful report, "tools such as nvidia-smi or other nvml-based tools may indicate that the device is fully occupied, which is rather confusing for users.” Imagine a city planner deciding if they need to buy more buses by measuring how often there’s at least one passenger on the bus. They would buy way more buses than needed!? Seeing activity in SMI's output may be reassuring, but it's a far cry from understanding the percentage of your GPU's actual resources being used during that time slice.
Diving into NVIDIA SMI Limitations
Here is an example of SM utilizations in a GPU for typical LlaMA-3 inference runs. This shows that SMs are not fully utilized in a GPU, but GPU utilization has peaked, at least according to the GPU util tool.?
Rapt AI’s unique core/SM level allocations maximize usage of SMs inside the GPU and in turn maximizes the true GPU utilizations. With SM and core level allocations, Rapt can pack more AI jobs in a single GPU enabling auto sharing of GPUs across GPU memory and GPU cores and SMs, whether for training, fine-tuning or inference.
Rapt analyzes AI model workloads, predicts precise GPU resources and allocates required amount of SMs and GPU memory based on model workloads automatically without user intervention. This enables more AI models to share a GPU & maximize true GPU utilization and ROI for your AI models.
Below is a snapshot of Rapt’s SM allocations per user or AI model running LLaMA-3 8B model and how Rapt can pack both the jobs in a single GPU.
For an A100(40GB) GPU, there are a total of 108 SMs, out of which currently ~40 SMs are used, it means there are still around 65% of SMs free and can accommodate more AI models in a GPU even though GPU utilization shows 95% via NVIDIA SMI.?
As we can see from nvitop command below, there are two(2) AI model runs sharing the GPU. The output shows that GPU utilization is at maximum, but we know in reality the SM utilizations are low which means there is scope of sharing the GPU across AI models.
Beyond SMI: Too Much Data, Too Few Actionable Insights?
More advanced techniques and management utilities offer a deluge of data – details on Streaming Multiprocessors (SMs), CUDA cores, memory usage, and even real-time power consumption. You can be bombarded with charts, graphs, and constantly updating metrics, painting a seemingly detailed picture of your GPU activity. But even with this wealth of information, the crucial challenge remains: what actionable insights can you truly derive and, critically, implement to improve your GPU efficiency?
The sheer volume and complexity of this data becomes overwhelming. And these metrics are not static; they are in constant flux, changing continuously and dynamically as AI workloads execute. Trying to manually analyze this torrent of real-time data and then translate it into optimized GPU resource allocation is practically impossible for a human operator.
This contrasts sharply with traditional CPU containerization. In CPU environments, workloads and resource allocation are often more static and predictable. Traditional CPU container orchestration can be effective because it often deals with relatively static resource envelopes. However, ML fine-tuning and inference are fundamentally different. They are characterized by highly dynamic and fine-grained resource requirements that change continuously throughout the workload lifecycle. Even with a deep understanding of the complex GPU metrics provided by advanced tools, the crucial missing piece remains: the automated, real-time mechanism to translate that data into granular and dynamically adjusted resource allocation that truly optimizes GPU utilization for these constantly evolving AI workloads.
The Granularity Gap: Virtualization's Limited View
Accessing and managing GPU resources at a granular level is inherently challenging. Virtualization techniques offer virtual GPUs (vGPUs), promising finer control. However, vGPU profiles primarily influence resource assignment on a time-slice basis. Imagine roommates looking to share an apartment but by dividing access by time - one can access it during odd hours, the other during even hours. That wouldn’t work well for either! Far better would be for each to have a room and share central resources like a kitchen. This time-slicing, while providing some level of isolation, is inherently suboptimal for demanding AI workloads, often leading to increased context switching overhead, which ironically reduces efficiency.
The Prediction Puzzle: Sizing for the Unknown
Another critical challenge is predicting the precise resources needed for each AI workload. Traditional sizing often relies on rules of thumb, primarily focusing on memory requirements. But memory is only one piece of the puzzle. AI workloads, especially fine-tuning and inference, are also heavily compute-bound. Sizing based solely on memory often results in significant misallocation – reserving too many resources in some areas while starving others, leading to overall underutilization. Furthermore, the dynamic nature of fine-tuning and inference runs means resource demands fluctuate wildly throughout a job's lifecycle. Modern LLM architectures like Mixture of Experts (MoE) amplify this variability, making resource prediction an even more complex puzzle.Think of a hotel conference center. A traditional hotel with fixed size conference rooms would not be able to accommodate as many meetings as one with adjustable walls, that can be moved to make fewer, larger rooms or more, smaller rooms based on demand. Just like a hotel maximizes revenue by dynamically resizing conference rooms to match demand, true GPU efficiency requires dynamically resizing GPU resource allocation to match the fluctuating needs of AI workloads. Being able to adjust to variable demand and pack more, non-overlapping groups into their space would allow more groups and more revenue, just as dynamically allocating GPU resources allows you to pack more AI workloads onto your GPUs and maximize your infrastructure ROI.
Rapt: Precision Orchestration for True GPU Utilization
Recognizing these fundamental challenges years ago fueled the creation of Rapt. We understood that true GPU efficiency demands a holistic solution – one that moves beyond basic monitoring and tackles the core issues of accurate resource prediction, granular resource allocation, and dynamic workload management. Rapt was built to address these challenges head-on, providing a platform for allocating GPU resources with surgical precision and dynamically adjusting allocations throughout the lifespan of an AI run.
AI-Powered Prediction: The Compute Recommendation Engine
Rapt goes beyond simple rules and heuristics. At its heart is the AI-powered Compute Recommendation Engine?, trained using reinforcement learning on hundreds of thousands of real-world AI runs. This engine learns the intricate resource consumption patterns of diverse AI workloads, continuously updating its profiles based on the latest models and frameworks. This AI-powered prediction capability allows Rapt to accurately forecast the resources needed for new fine-tuning and inference jobs before they even begin.
Granular Precision: Core-Level Allocation and Packing
Rapt operates at a level of granularity unmatched by traditional methods. For example, on NVIDIA H100 GPUs, Rapt can allocate resources core by core – assigning individual FP32 CUDA cores, Tensor Cores, and precise amounts of HBM memory down to the byte, tailoring resource allocation precisely to each job's needs.Think back to the hotel conference center analogy. Rapt is like a hotel conference center that can allocate precisely the number of chairs, stage area, and catering to meet the exact needs per group, and adjust as people arrive and leave the meeting. Just as this ideal hotel could dynamically resize rooms to perfectly fit different sized meeting groups, Rapt can dynamically carve out and allocate precisely the right amount of GPU resources – down to individual cores and memory segments – to perfectly fit the specific needs of each AI workload. This granular control enables optimal packing of multiple AI runs onto a single GPU, ensuring no resource contention and eliminating performance-degrading context switches. When workloads demand more than a single GPU, Rapt scales seamlessly to manage tens of thousands of GPUs, maintaining efficiency even at massive scale.
Automated Orchestration: Seamless Deployment, Any Environment
Rapt automates the entire orchestration process, deploying and managing AI workloads seamlessly across your infrastructure – whether on-premises, in the cloud, or a hybrid combination. All of this complexity is managed from a single instance, simplifying your AI operations and freeing your teams to focus on innovation, not infrastructure management.
Illuminating the Black Box: Clear Visibility, Skyrocketing Utilization
The net result is a dramatic transformation in GPU utilization and ROI. Rapt customers move from a "black box" environment, where utilization for fine-tuning struggles to reach 20% and inference often languishes below 10%, to a state of near-peak efficiency. With Rapt, GPU utilization skyrockets to the high 90% range. This translates directly into a 3-5x increase in parallel fine-tuning jobs and a 10x or greater boost in production inference throughput – all on the same GPU infrastructure.
Unlocking Millions in ROI: Your GPUs Working Harder, For You
Imagine you've invested $10 million in GPU resources. Rapt doesn't just optimize utilization; it amplifies your investment. By unlocking previously hidden GPU capacity, Rapt can make your $10 million GPU cluster perform like a $30 million or even a $100 million cluster, significantly accelerating your AI initiatives and delivering a tangible, massive return on your AI infrastructure investment.
Stop Guessing, Start Utilizing – Unlock Your GPUs with Rapt
Stop navigating in the dark with misleading readouts and guesswork. True GPU efficiency demands clarity, precision, dynamic management, and AI-powered intelligence. Rapt provides the solution to open the black box, unlock the hidden potential of your GPUs, and finally achieve the utilization and ROI your AI investments deserve. It's time to stop guessing and start truly utilizing your GPUs – with Rapt.
About the Author
An industry veteran with over 23 years of experience in building products from the ground up and architecting enterprise systems. Most recently, he was the Technical Director at Data Domain, which was acquired by EMC. Anil found his passion in solving challenges related to systems' adaptability to changing workloads, which is the genesis of rapt AI’s fungible infrastructure for AI workloads. He has authored over 15 patents in areas of system software, schedulers, storage, and virtualization.
?????? ?????? ???? ???????????????? ???????? ???? ?????????? for every model…in real time! Tim Tyler Cesar Meza Kevin Muerle Scott Clark, ALM Steve Fingerhut Shailesh Manjrekar Pavan Mirla Rob May