How to Estimate Your GPU's LLM Token Generation Speed

How to Estimate Your GPU's LLM Token Generation Speed

Ever wondered how many tokens per second (TPS) your AI model can generate on your GPU(s)? Let’s walk through a simple, step-by-step approach to estimate throughput using your GPU’s specs. While this won’t be perfect for every scenario, it’ll give you a ballpark number—enough to guide early decisions on hardware and scaling.

Note: This TPS formula is just a starting point to gauge rough throughput. Real-world performance requires benchmarking under conditions that mirror your actual workload. If your estimates seem too low, consider optimizations like lower-precision quantization (4-bit), better GPU interconnects (NVLink, InfiniBand), and efficient parallel strategies.


1 - Figure Out Your Model Size

Suppose your model has XXX billion parameters.

If quantization is using 8-bit (Q8) precision (i.e., 1 byte per parameter), your total model size is roughly:

Model?Size?(GB) ≈ X (since?X?billion?parameters × 1?byte?each).

Example: A 13B-parameter model at 8 bits ~ 13 GB

2. Check Your GPU’s Memory Bandwidth

Memory Bandwidth (GB/s) is typically listed in your GPU’s specs (e.g., 600–900 GB/s for many high-end GPUs).

If you spread the model across N GPUs evenly, each GPU handles

Model Size / N

3 - A Rough Formula for Tokens per Second

In an extremely simplified scenario, you can approximate:


where:

Memory Bandwidth is in GB/s (e.g. B = 672 GB/s)

Model Size per GPU is in GB (e.g. M = 2.3 GB)

Let's put all of this together:


Reality Check - This formula assumes you’re 100% memory-bandwidth bound, that all parameters must be read each time you generate a token, and that there’s no other overhead. In real deployments, you’ll see a lower TPS due to latency, partial compute-bound kernels, attention overhead, and other bottlenecks.

4 - Scaling Across Multiple GPUs

If you have N identical GPUs and distribute your model + computation evenly:


In a real word scenario, communication overhead, synchronization, and load balancing typically reduce the real-world multiplier to something less than N.

5. Consider the overhead

  1. Each new token references the entire context, so the more tokens generated, the higher the cost per token.
  2. Sometimes, matrix multiplications (tensor cores) can be the real bottleneck, not memory bandwidth.
  3. Grouping multiple inference requests can boost overall throughput but may add latency to single queries.
  4. Parallelisms (Pipeline, Tensor, etc.) can help scale across GPUs but add communication overhead.


要查看或添加评论,请登录

Brian Clark的更多文章

  • Example LLM Agent Tool Framework

    Example LLM Agent Tool Framework

    You can't open up a website or social media feed and not see the phase AI Agent somewhere. I figured I would do my part…

  • Powershell Crypto Currency Trading API for Bittrex

    Powershell Crypto Currency Trading API for Bittrex

    powershell-bittrex-api Powershell Wrapper for the Bittrex Crypto Currency Exchange API Written By: Brian Clark - AKA…

  • Powershell Script Settings System

    Powershell Script Settings System

    It has been a short while since I have posted something here. I have been working, and have almost completed, an…

    2 条评论
  • Automating Dell Equallogic Guest LUN Failover

    Automating Dell Equallogic Guest LUN Failover

    One of my most recent work related endeavors is creating an end-to-end solution for Automating the failover of Virtual…

  • Internet Explorer 8, 9, & 10 reach End of Life January 12th

    Internet Explorer 8, 9, & 10 reach End of Life January 12th

    A security patch, which goes live on January 12th, will implement a nag screen for any Internet Explorer version 8, 9…

  • Powershell Automation - Example Exchange Mailbox Export

    Powershell Automation - Example Exchange Mailbox Export

    If you deal with Windows servers, one thing you might run into quite a bit is needing to carry out some…

    2 条评论
  • PowerShell Generic Objects

    PowerShell Generic Objects

    Hey All, In an effort to learn PowerShell better I have decided to write a full single player RPG / Adventure gaming…

社区洞察

其他会员也浏览了