Serverless GPU Computing: A Technical Deep Dive into CloudRun
At DevFest Montreal 2024, I presented a talk on scaling GPU workloads using Google Kubernetes Engine (GKE), focusing on the complexities of load-based scaling. While GKE provided robust solutions for managing GPU workloads, we still faced the challenge of ongoing infrastructure costs, especially during periods of low utilization. Google's recent launch of GPU support in Cloud Run marks an exciting development in serverless computing, potentially addressing these scaling and cost challenges by offering GPU capabilities in a true serverless environment.
Cloud Run GPU: The Offering
Cloud Run is Google Cloud's serverless compute platform that allows developers to run containerized applications without managing the underlying infrastructure. The serverless model offers significant advantages:
However, it also comes with trade-offs, such as cold starts when scaling up from zero and maximum execution time limits.
The recent addition of GPU support to Cloud Run opens new possibilities for compute-intensive workloads in a serverless environment. This feature provides access to NVIDIA L4 GPUs, which are particularly well-suited for:
The L4 GPU, built on NVIDIA's Ada Lovelace architecture, offers 24GB of GPU memory (VRAM) and supports key AI frameworks and CUDA applications. These GPUs provide a sweet spot between performance and cost, especially for inference workloads and graphics processing.
Understanding Cold Starts and Test Results
Having worked with serverless infrastructure for nearly a decade, I've encountered numerous challenges with cold starts across different platforms. With Cloud Run's new GPU feature, I was particularly interested in understanding the cold start behavior and its implications for real-world applications.
To investigate this, I designed an experiment to measure response times under different idle periods. The experiment consisted of running burst tests of 5 consecutive API calls to a GPU-enabled Cloud Run service at different intervals (5, 10, and 20 minutes). Each test was repeated multiple times to ensure consistency. The service performed a standardized 3D rendering workload, making it an ideal candidate for GPU acceleration.
Our testing revealed three distinct patterns:
领英推荐
Here's a summary of our findings:
| Interval | First Request (ms) | Subsequent Requests (ms) | Instance State |
|----------|-------------------|------------------------|----------------|
| 5 minutes | 6,800-7,000 | 1,400-1,800 | Warm Start |
| 10 minutes | 105,000-107,000 (Cold) | 1,400-1,700 | Full Cold Start |
| 10 minutes | 6,800-7,200 (Warm) | 1,400-1,700 | Warm Start |
| 20 minutes | 105,000-120,000 | 1,400-1,800 | Full Cold Start |
Cloud Run's GPU support introduces an exciting option for organizations looking to optimize their GPU workloads without maintaining constant infrastructure. Our testing revealed interesting behavior at the 10-minute interval mark, where the instance sometimes remained warm (~7 seconds startup) and sometimes required a full cold start (~105-107 seconds). This variability suggests that Cloud Run's instance retention behavior isn't strictly time-based and might depend on other factors such as system load and resource availability.
While these cold start characteristics make it unsuitable for real-time applications requiring consistent sub-second response times, Cloud Run GPU excels in several scenarios:
Best suited for:
Not recommended for:
For teams working with periodic GPU workloads - whether it's scheduled rendering jobs, ML model inference, or development testing - Cloud Run GPU offers a compelling balance of performance and cost-effectiveness, especially when compared to maintaining always-on GPU infrastructure. Understanding these warm/cold start patterns is crucial for architecting solutions that can effectively leverage this serverless GPU capability.
The key to success with Cloud Run GPU is matching your workload patterns to the platform's characteristics. For workloads that can tolerate occasional cold starts, the cost savings and zero-maintenance benefits make it an attractive option in the GPU computing landscape.
Thanks Oleksiy Savytskyy
Thomas Jelonek I am glad you liked it
Thanks Pascal MANIRAHO, EMBA