AMD Instinct MI300X can achieve 2-5x higher throughput than NVIDIA H200 at the same latency with these optimizations!
At DeepSeek-R1 inference, MI300X is straight-up OP ?? delivering 2× to 5× higher throughput at the same end-to-end latency over Nvidia H200, powered by the latest SGLang optimizations! How does MI300X pull this off? ? Massive ROCm kernel upgrades via AITER (AI Tensor Engine): ????? Up to 2× faster GEMM ops & 3× faster MoE execution ????? Up to 17× faster MLA decode & 14× faster MHA prefill ? Chunked prefill tuning: Optimizes prefill efficiency by smartly batching input sequences, leveraging MI300X’s large VRAM ? Real-world impact: Customers often require sub-50ms inter-token latency (ITL). MI300X crushes it there, serving 8× more concurrent requests (128 vs. 16), as shown in Figure 2 of the blog. What do you get (vs. H200)? ?? 2×–5× higher throughput at same latency ?? Up to 75% higher throughput & 60% lower latency at same concurrency Link to blog: https://lnkd.in/gpRt5zNB #AMD #MI300X #SGLang #AI #Inference #GPU #Performance #DeepSeek #ROCm