How to Estimate Your GPU's LLM Token Generation Speed
Ever wondered how many tokens per second (TPS) your AI model can generate on your GPU(s)? Let’s walk through a simple, step-by-step approach to estimate throughput using your GPU’s specs. While this won’t be perfect for every scenario, it’ll give you a ballpark number—enough to guide early decisions on hardware and scaling.
Note: This TPS formula is just a starting point to gauge rough throughput. Real-world performance requires benchmarking under conditions that mirror your actual workload. If your estimates seem too low, consider optimizations like lower-precision quantization (4-bit), better GPU interconnects (NVLink, InfiniBand), and efficient parallel strategies.
1 - Figure Out Your Model Size
Suppose your model has XXX billion parameters.
If quantization is using 8-bit (Q8) precision (i.e., 1 byte per parameter), your total model size is roughly:
Model?Size?(GB) ≈ X (since?X?billion?parameters × 1?byte?each).
Example: A 13B-parameter model at 8 bits ~ 13 GB
2. Check Your GPU’s Memory Bandwidth
Memory Bandwidth (GB/s) is typically listed in your GPU’s specs (e.g., 600–900 GB/s for many high-end GPUs).
If you spread the model across N GPUs evenly, each GPU handles
Model Size / N
3 - A Rough Formula for Tokens per Second
In an extremely simplified scenario, you can approximate:
领英推荐
where:
Memory Bandwidth is in GB/s (e.g. B = 672 GB/s)
Model Size per GPU is in GB (e.g. M = 2.3 GB)
Let's put all of this together:
Reality Check - This formula assumes you’re 100% memory-bandwidth bound, that all parameters must be read each time you generate a token, and that there’s no other overhead. In real deployments, you’ll see a lower TPS due to latency, partial compute-bound kernels, attention overhead, and other bottlenecks.
4 - Scaling Across Multiple GPUs
If you have N identical GPUs and distribute your model + computation evenly:
In a real word scenario, communication overhead, synchronization, and load balancing typically reduce the real-world multiplier to something less than N.
5. Consider the overhead