登录查看更多内容

Tokens Per Second is Not All You Need

SambaNova Systems

Supercharge AI apps with SambaNova Cloud! Accelerate your AI journey. Unlock lightning-fast inference on Llama 3.2.

发布日期: 2024年5月1日

The growing battle around high tokens per second (t/s) continues to heat up on X. But for the enterprise, is t/s the only metric that matters? We believe that first token time is equally, if not more crucial for complex, document-heavy, and agentic use cases.

The Hype and Limitations of Tokens per Second

Pursuing high t/s, some AI vendors cache model weights on SRAM and employ numerous chips in their inference systems. While this yields impressive decoding speeds, it brings significant drawbacks - learn more.

In real-world scenarios, these SRAM-based inference engines can take over 1.4 seconds to generate the first token for a 3K token input (see table), which makes the first token latency 85% of total inference time. For enterprises dealing with document intelligence, search, retrieval, and retrieval augmented generation, that's simply not fast enough.

In a recent Artificial Analysis AI report, a 900 tokens/s solution actually performs worse than the 200-300 tokens/s solution even for a medium input of 1000 tokens and output of 100 tokens simply simply because of the large TTFT.?

The SambaNova Solution: Designed for Enterprise Needs

We've taken a different approach to address the unique challenges of enterprise LLM inference. Our goal is to deliver the fast, flexible, and scalable inference. Our unique architecture delivers the best of two worlds:

领英推荐

TAI #107:What do enterprise customers need from LLMs?

Towards AI 8 个月前

Artificial Intelligence #211

Andriy Burkov 1 年前

Artificial Intelligence #211

Andriy Burkov 1 年前

Blazing 0.2s 1st token
Fastest 8-socket 450 t/s gen speed

The secret? RDU’s unique 3-tier memory hierarchy with reconfigurable dataflow:

Larger SRAM than typical accelerators
HBM for fast access
DRAM for spillover params as model sizes get bigger
Flexible dataflow optimize for compute intensive and memory intensive graphs

This setup also enables:

Stellar perf with minimal sockets
Orchestration of expert models beyond 8B

SambaNova's solution also:

Avoids 8-bit quantization pitfalls
Scales seamlessly with batch size

Conclusion

At SambaNova, first token time is equally - if not more - crucial for the complex, document-heavy, and agentic use cases that businesses rely on.

To explore why t/s doesn't paint the full picture of enterprise LLM inference performance, read our latest blog.

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

The Full Stack

8,507 位关注者

Vijay Tatkar

10 个月

Terrific work!

Justin Kinsey

President at SBT | 19 years of advising leaders in the semiconductor industry and architecting teams from startups to F500 companies

10 个月

Insightful comparison. It would be interesting to see how cost per token also factors into the bigger picture. With many layers to consider for business use-cases, model developers and AI leaders have a big task on their shoulders.

1 次回应

Kirk Compton

10 个月

great read, thank you team!!

1 次回应

Hop Labs

10 个月

Nice analysis! For certain use cases, time to first token is very important!

1 次回应

Walid AlHarbi MBA, AMP

Transformational Executive

10 个月

Very promising!

1 次回应

查看更多评论

要查看或添加评论，请登录

Tokens Per Second is Not All You Need

SambaNova Systems

Supercharge AI apps with SambaNova Cloud! Accelerate your AI journey. Unlock lightning-fast inference on Llama 3.2.

The Hype and Limitations of Tokens per Second

The SambaNova Solution: Designed for Enterprise Needs

领英推荐

Conclusion

The Full Stack

8,507 位关注者

SambaNova Systems的更多文章

社区洞察

其他会员也浏览了

Artificial Intelligence #198

Tom Siebel Reflects on C3 Transform 2023

Data is the New Gold.

Revolutionizing IT: How AI and OpsRamp are Supercharging the Tech World

Navigating the Transformative Journey from Data to Knowledge

SOLVING THE RIDDLE OF THE SPHINX by W H Inmon

Digital transformation mindcandy 18 July 2024

Welcome to this month’s insights on AI and Data Center Innovations.

Digital transformation mindcandy 26 October 2024

Digital transformation mindcandy 24 September 2024

The Hype and Limitations of Tokens per Second

The SambaNova Solution: Designed for Enterprise Needs

领英推荐

Conclusion

The Full Stack

8,507 位关注者

SambaNova Systems的更多文章

Three Things To Expect From This Week’s Llama 3 405B Release

What to Know About Enterprise-grade AI

Public-Private Partnerships in AI Hold Key to Societal & Economic Benefits

A Note From Chad Gardner, VP Sales, North America

社区洞察

其他会员也浏览了

Artificial Intelligence #198

Tom Siebel Reflects on C3 Transform 2023

Data is the New Gold.

Revolutionizing IT: How AI and OpsRamp are Supercharging the Tech World

Navigating the Transformative Journey from Data to Knowledge

SOLVING THE RIDDLE OF THE SPHINX by W H Inmon

Digital transformation mindcandy 18 July 2024

Welcome to this month’s insights on AI and Data Center Innovations.

Digital transformation mindcandy 26 October 2024

Digital transformation mindcandy 24 September 2024