登录查看更多内容

How to Estimate Your GPU's LLM Token Generation Speed

Brian Clark

Management Consultant | Former CTO & CISO | Thought Leader | AI/LLM Advisor

发布日期: 2025年1月29日

Ever wondered how many tokens per second (TPS) your AI model can generate on your GPU(s)? Let’s walk through a simple, step-by-step approach to estimate throughput using your GPU’s specs. While this won’t be perfect for every scenario, it’ll give you a ballpark number—enough to guide early decisions on hardware and scaling.

Note: This TPS formula is just a starting point to gauge rough throughput. Real-world performance requires benchmarking under conditions that mirror your actual workload. If your estimates seem too low, consider optimizations like lower-precision quantization (4-bit), better GPU interconnects (NVLink, InfiniBand), and efficient parallel strategies.

1 - Figure Out Your Model Size

Suppose your model has XXX billion parameters.

If quantization is using 8-bit (Q8) precision (i.e., 1 byte per parameter), your total model size is roughly:

Model?Size?(GB) ≈ X (since?X?billion?parameters × 1?byte?each).

Example: A 13B-parameter model at 8 bits ~ 13 GB

2. Check Your GPU’s Memory Bandwidth

Memory Bandwidth (GB/s) is typically listed in your GPU’s specs (e.g., 600–900 GB/s for many high-end GPUs).

If you spread the model across N GPUs evenly, each GPU handles

Model Size / N

3 - A Rough Formula for Tokens per Second

In an extremely simplified scenario, you can approximate:

领英推荐

DDN Expands Support for NVIDIA Technology to Enable AI…

DDN 10 个月前

Training Large Vision Models (LVMs): Benchmarking AMD…

LandingAI 11 个月前

The Future of AI Infrastructure: Architectural…

VAST Data 1 周前

where:

Memory Bandwidth is in GB/s (e.g. B = 672 GB/s)

Model Size per GPU is in GB (e.g. M = 2.3 GB)

Let's put all of this together:

Reality Check - This formula assumes you’re 100% memory-bandwidth bound, that all parameters must be read each time you generate a token, and that there’s no other overhead. In real deployments, you’ll see a lower TPS due to latency, partial compute-bound kernels, attention overhead, and other bottlenecks.

4 - Scaling Across Multiple GPUs

If you have N identical GPUs and distribute your model + computation evenly:

In a real word scenario, communication overhead, synchronization, and load balancing typically reduce the real-world multiplier to something less than N.

5. Consider the overhead

Each new token references the entire context, so the more tokens generated, the higher the cost per token.
Sometimes, matrix multiplications (tensor cores) can be the real bottleneck, not memory bandwidth.
Grouping multiple inference requests can boost overall throughput but may add latency to single queries.
Parallelisms (Pipeline, Tensor, etc.) can help scale across GPUs but add communication overhead.

要查看或添加评论，请登录

Brian Clark的更多文章

Example LLM Agent Tool Framework

2025年2月13日

Example LLM Agent Tool Framework

You can't open up a website or social media feed and not see the phase AI Agent somewhere. I figured I would do my part…
Powershell Crypto Currency Trading API for Bittrex

2017年7月27日

Powershell Crypto Currency Trading API for Bittrex

powershell-bittrex-api Powershell Wrapper for the Bittrex Crypto Currency Exchange API Written By: Brian Clark - AKA…
Powershell Script Settings System

2016年7月15日

Powershell Script Settings System

It has been a short while since I have posted something here. I have been working, and have almost completed, an…

2 条评论
Automating Dell Equallogic Guest LUN Failover

2016年1月28日

Automating Dell Equallogic Guest LUN Failover

One of my most recent work related endeavors is creating an end-to-end solution for Automating the failover of Virtual…
Internet Explorer 8, 9, & 10 reach End of Life January 12th

2016年1月6日

Internet Explorer 8, 9, & 10 reach End of Life January 12th

A security patch, which goes live on January 12th, will implement a nag screen for any Internet Explorer version 8, 9…
Powershell Automation - Example Exchange Mailbox Export

2015年11月23日

Powershell Automation - Example Exchange Mailbox Export

If you deal with Windows servers, one thing you might run into quite a bit is needing to carry out some…

2 条评论
PowerShell Generic Objects

2015年10月8日

PowerShell Generic Objects

Hey All, In an effort to learn PowerShell better I have decided to write a full single player RPG / Adventure gaming…

See all articles

How to Estimate Your GPU's LLM Token Generation Speed

Brian Clark

Management Consultant | Former CTO & CISO | Thought Leader | AI/LLM Advisor

1 - Figure Out Your Model Size

2. Check Your GPU’s Memory Bandwidth

3 - A Rough Formula for Tokens per Second

领英推荐

4 - Scaling Across Multiple GPUs

5. Consider the overhead

Brian Clark的更多文章

社区洞察

其他会员也浏览了

"Intel Strikes Back: Can Xeon 6 Rival Nvidia in AI Domination?"

Should UEC and UAL Merge?

NVIDIA H100 vs. H200: What is the Difference and Which Should You Buy?

NVIDIA sets a new standard for AI Ecosystem at GTC 2025

Supercharge Your AI Tasks with the NVIDIA GB200! ??

And now the GB200!

Comparing NVIDIA GPUs for AI: T4 vs A10

With NVIDIA's stock falling so drastically, does it mean AI computing power is becoming useless?

GTC Panel: AI Wants More Than Just Storage—It Wants Speed, Smarts, and Scale

?? How I Built My Own LLM-Powered PDF Extractor

1 - Figure Out Your Model Size

2. Check Your GPU’s Memory Bandwidth

3 - A Rough Formula for Tokens per Second

领英推荐

4 - Scaling Across Multiple GPUs

5. Consider the overhead

Brian Clark的更多文章

Example LLM Agent Tool Framework

Powershell Crypto Currency Trading API for Bittrex

Powershell Script Settings System

Automating Dell Equallogic Guest LUN Failover

Internet Explorer 8, 9, & 10 reach End of Life January 12th

Powershell Automation - Example Exchange Mailbox Export

PowerShell Generic Objects

社区洞察

其他会员也浏览了

"Intel Strikes Back: Can Xeon 6 Rival Nvidia in AI Domination?"

Should UEC and UAL Merge?

NVIDIA H100 vs. H200: What is the Difference and Which Should You Buy?

NVIDIA sets a new standard for AI Ecosystem at GTC 2025

Supercharge Your AI Tasks with the NVIDIA GB200! ??

And now the GB200!

Comparing NVIDIA GPUs for AI: T4 vs A10

With NVIDIA's stock falling so drastically, does it mean AI computing power is becoming useless?

GTC Panel: AI Wants More Than Just Storage—It Wants Speed, Smarts, and Scale

?? How I Built My Own LLM-Powered PDF Extractor