Tokens Per Second is Not All You Need

Tokens Per Second is Not All You Need

The growing battle around high tokens per second (t/s) continues to heat up on X. But for the enterprise, is t/s the only metric that matters? We believe that first token time is equally, if not more crucial for complex, document-heavy, and agentic use cases.

The Hype and Limitations of Tokens per Second

Pursuing high t/s, some AI vendors cache model weights on SRAM and employ numerous chips in their inference systems. While this yields impressive decoding speeds, it brings significant drawbacks - learn more.

In real-world scenarios, these SRAM-based inference engines can take over 1.4 seconds to generate the first token for a 3K token input (see table), which makes the first token latency 85% of total inference time. For enterprises dealing with document intelligence, search, retrieval, and retrieval augmented generation, that's simply not fast enough.

In a recent Artificial Analysis AI report, a 900 tokens/s solution actually performs worse than the 200-300 tokens/s solution even for a medium input of 1000 tokens and output of 100 tokens simply simply because of the large TTFT.?

The SambaNova Solution: Designed for Enterprise Needs

We've taken a different approach to address the unique challenges of enterprise LLM inference. Our goal is to deliver the fast, flexible, and scalable inference. Our unique architecture delivers the best of two worlds:

  • Blazing 0.2s 1st token
  • Fastest 8-socket 450 t/s gen speed

The secret? RDU’s unique 3-tier memory hierarchy with reconfigurable dataflow:

This setup also enables:

  • Stellar perf with minimal sockets
  • Orchestration of expert models beyond 8B

SambaNova's solution also:

Conclusion

At SambaNova, first token time is equally - if not more - crucial for the complex, document-heavy, and agentic use cases that businesses rely on.

To explore why t/s doesn't paint the full picture of enterprise LLM inference performance, read our latest blog.

Terrific work!

回复
Justin Kinsey

President at SBT | 19 years of advising leaders in the semiconductor industry and architecting teams from startups to F500 companies

10 个月

Insightful comparison. It would be interesting to see how cost per token also factors into the bigger picture. With many layers to consider for business use-cases, model developers and AI leaders have a big task on their shoulders.

Kirk Compton

Enterprise AI Executive: Leadership | Business Development | Strategic Alliances | Channel Sales | Global Partnerships | Ex NTT-Dimension Data-The Revere Group

10 个月

great read, thank you team!!

Nice analysis! For certain use cases, time to first token is very important!

Walid AlHarbi MBA, AMP

Transformational Executive

10 个月

Very promising!

要查看或添加评论,请登录

SambaNova Systems的更多文章

社区洞察

其他会员也浏览了