This is why Kubernetes isn't enough for scaling LLM inference workload ??
How to autoscale LLM inference workload properly? Autoscaling is critical for online LLM inference workload, as it helps ???????????? ?????? ?????????????? ?????????????? ????????-???????????????????????? ???????? ???????? ?????? ??????'?? ????????. But implementing autoscaling for LLM inference is not as straightforward as you may think. Traditional container orchestration platform (like Kubernetes) that only have access to resource utilization and simple request metrics don't cut: ? ? ?????? ???????????? (DCGM_FI_DEV_FB_USED): Amount of GPU memory used. Doesn't apply for workload that preallocate GPU memory (e.g. vLLM KV cache). ? ? ?????? ?????????????????????? (DCGM_FI_DEV_GPU_UTIL): Amount of time GPU is active. Does not measure how much effective compute is being done (e.g. Batch size). ? ? ? ?????? (Query Per Second): A simple request based scaling metrics. Not applicable to LLM workloads due to processing time per request varies depending on input and output token length, or cache hit. ? ? ?????????? ????????: Number of requests pending in external queue before they're processed. This is easy to implement for workloads without batching. For LLM workloads with continuous batching, additional guardrails on concurrency control is required. What's proven to work effectively, is ? ??????????????????????-?????????? ??????????????????????, which represent the # of active requests being processed (See image below). It accurately reflects system load, scales precisely, and is easy to configure based on batch size. The only downside - it requires specialized infrastructure and serving stack, which can be complex and time consuming to build and optimize: ? Workload-specific metrics is required for gaining visibility into batch size, queue size, inference latency, and request concurrency. AI teams should ship AI-specific containers that includes those metrics, and pair it with infrastructure that leverage those metrics for scaling. ? Cold start acceleration is necessary for efficient scaling. Pulling large container image and loading large models can drastically slow down the scaling up process, leading to failed requests or slow responses. ? Scaling to zero: reduce cost by scaling down to zero replica for inactive models to free up compute resources. And spin up the model only when a request is received. At BentoML, we've optimized every layer in the inference and serving stack, to ensure efficient scaling of private LLM inference workload, while allowing developers to easily fine-tune the scaling behaviors tailored to their specific needs. Check out our team's learning in scaling AI inference at BentoML by Sean Sheng: https://lnkd.in/gZvNgE3i BentoML documentation on Concurrency and autoscaling: https://lnkd.in/gmVpgW93 #LLM #Autoscaling #Inference #OpenLLM