登录查看更多内容

Calculating LLM Costs - Verbosity Index

Michael Papadopoulos

发布日期: 2025年2月4日

Introduction: The Hidden Cost of Chain of Thought Reasoning in LLMs

The rise of Chain of Thought (CoT) reasoning in Large Language Models (LLMs) has introduced a significant shift in how models process and generate text. While CoT improves reasoning capabilities by breaking down complex problems into step-by-step explanations, it also drastically increases the number of tokens produced. This presents a major challenge for cost estimation when using API-based models, as the traditional pricing metric—cost per token—fails to account for varying verbosity across models.

Previously, estimating API usage costs was relatively straightforward since most LLMs produced comparable output token counts for a given input prompt. However, as CoT adoption grows, the difference in verbosity between models can lead to unexpected costs. A model with a lower per-token price may end up being more expensive overall due to excessive token generation.

This article explores the impact of CoT on cost estimation, compares verbosity across models like OpenAI’s o1, o3, and DeepSeek R1, and introduces a Verbosity Index—a new metric to improve cost predictability when working with LLM APIs. Finally, we discuss the implications for sustainability, as verbosity is directly linked to computational efficiency and energy consumption.

The Problem: Chain of Thought and Cost Estimation

With increasing competition in the Chain of Thought (CoT) space, models such as OpenAI's o1, o3, and o3-mini, as well as DeepSeek R1, are challenging the cost structures of traditional LLMs. Unlike previous models where output token usage was relatively predictable, CoT reasoning introduces significant variability in token generation.

Historically, most LLMs would produce a similar number of output tokens for a given input, with only minor variations due to differences in model architecture and response style. However, CoT changes this landscape entirely by dramatically increasing verbosity, making API cost calculations much more complex.

For example, consider the following prompt:

"Write a Python function that solves the traveling salesman problem (TSP) using dynamic programming with memorization. The function should accept an adjacency matrix and return the shortest path. Explain your approach step by step."

When tested across multiple LLM APIs, the token counts for responses used to be fairly uniform across models. However, with the introduction of CoT-based reasoning, token generation varies significantly across models.

Empirical Findings: Token Usage Across LLMs

Baseline Token Consumption

For standard API-based chat responses (without CoT reasoning), output token counts remained relatively stable across major LLMs, as illustrated below:

Image 1 - Output Tokens for TSP Prompt (200 runs)

Impact of Chain of Thought on Token Usage

However, once Chain of Thought reasoning is introduced, token consumption skyrockets. Certain models—particularly DeepSeek R1—demonstrate significantly higher verbosity than others.

Image 2 - Output Tokens for TSP Prompt (200 runs) with CoT models.

This poses a serious challenge for businesses relying on API pricing based on cost per token, as verbosity can outweigh the per-token savings offered by cheaper models.

For instance, despite DeepSeek R1’s lower per-token price, its excessive verbosity results in significantly higher costs compared to OpenAI’s o3 models.

领英推荐

Multilingual RAG, Algorithmic Thinking, Outlier…

Towards Data Science 9 个月前

The Big O notation and its significance in LLMs

Tarry Singh 3 个月前

AI eats software

Azeem Azhar 2 年前

Introducing the Verbosity Index

Since CoT significantly alters cost calculations, a new metric is needed to better predict actual API expenses. This is where the Verbosity Index comes into play.

Definition of the Verbosity Index

The verbosity of a model can be quantified using the following formula:

where:

Tokens Output: Total number of tokens generated by the model (including reasoning).
Tokens Input: Number of tokens in the prompt.

A model with a higher verbosity ratio generates significantly more tokens per input token. For example, if a model returns 300 tokens for a 100-token prompt, its verbosity score is 3.0.

Empirical Verbosity Measurement

To systematically calculate verbosity across models, we propose the following methodology:

Use a Standardised Prompt Set: Create a dataset of prompts, varying in length and complexity. Prompts should be categorized into different types (Summary, Creative Writing, Coding, Problem Solving, Research, etc.).
Measure Response Length: Query different LLMs and log both input and output token counts.
Compute the Verbosity Ratio (V).
Compare Across Models: Rank models from least to most verbose.

Once the verbosity index is available, cost calculations can be adjusted accordingly:

Estimated?Cost = 
( Input?Tokens×Price?per?InputToken ) + ( Input?Tokens x Price?per?OutputToken x V )

Sustainability Implications

Beyond cost modelling, verbosity also plays a crucial role in energy efficiency. The extreme hype surrounding DeepSeek R1 largely stems from its claimed efficiency improvements over other models. However, the verbosity index allows us to measure actual efficiency not just in FLOPs or energy usage per token, but in terms of how efficiently a model reaches a solution.

A model that generates excessive tokens—even if it is cheaper per token—will ultimately consume more computational resources. Therefore, by factoring in verbosity, we gain a better understanding of the true cost of computation and energy consumption.

Models from the same generation with lower verbosity scores will generally be more energy efficient, as they produce fewer tokens while maintaining accuracy. This makes the verbosity index not just a tool for cost estimation, but also a key metric for sustainability evaluations.

Conclusion

The introduction of Chain of Thought (CoT) reasoning has fundamentally changed how LLMs consume and generate tokens, making traditional cost estimation methods ineffective. Our analysis shows that CoT can dramatically increase verbosity, leading to higher-than-expected costs, even for models with lower per-token pricing.

The Verbosity Index provides a new, empirical way to measure and predict LLM costs, helping businesses make informed decisions about model selection based on actual output efficiency rather than just token pricing. Additionally, this metric contributes to sustainability discussions, as more concise models will generally be more energy-efficient.

With LLMs evolving rapidly, understanding verbosity is crucial for API cost management, model selection, and environmental impact assessments.

Eystein Thanisch

Senior Technologist @ ADL Catalyst

1 个月

Pertinent and practical analysis at this juncture - the try-it-yourself guide is particularly appreciated ?? I'm wondering if some attempt should be made at comparing CoT models with others in terms of tokens per task accomplished. CoT models might be verbose but if they solve in one response what would have taken a non-CoT model several iterations, then they might work out as more comparable. My impression from actual examples of chains of thought, however, is that they meander considerably more than would interactions with a user. So your conclusions seem still sound.

Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

1 个月

While verbosity is undeniably a factor in CoT model costs, attributing increased expenses solely to token count overlooks the potential for efficiency gains through specialized hardware and algorithmic optimizations. The recent surge in open-source CoT models suggests a trend towards democratizing access despite cost concerns. How might your Verbosity Index account for the evolving landscape of open-source contributions and community-driven cost reductions?

查看更多评论

要查看或添加评论，请登录

Michael Papadopoulos的更多文章

Fine-Tuning Mistral 7B

2025年3月18日

Fine-Tuning Mistral 7B

Introduction Mistral 7B is an efficient open-source model that can deliver exceptional performance when fine-tuned for…
Small but Mighty: Why SLMs with RAG Can Outperform Commercial LLMs in Business Applications

2025年3月15日

Small but Mighty: Why SLMs with RAG Can Outperform Commercial LLMs in Business Applications

In recent years, commercial and open-weight/source Large Language Models (LLMs), such as GPT4, Llama 3, DeepSeek, have…
The AI Hiring Conundrum: Should We Allow AI Assistants in the Recruitment Process?

2025年3月13日

The AI Hiring Conundrum: Should We Allow AI Assistants in the Recruitment Process?

The next wave of university graduates is about to enter the job market, and for the first time, they have had access to…

2 条评论
The Evolving Mindset of Software Engineers in the AI Era

2025年3月11日

The Evolving Mindset of Software Engineers in the AI Era

The integration of AI into software development has fundamentally altered not just what engineers can accomplish, but…

1 条评论
The Double-Edged Sword of AI Anthropomorphism

2025年3月8日

The Double-Edged Sword of AI Anthropomorphism

In our rush to embrace artificial intelligence, we've fallen into a familiar human pattern: anthropomorphism. Just as…

4 条评论
Ingesting Codebases into LLMs

2025年3月6日

Ingesting Codebases into LLMs

Large Language Models (LLMs) have dramatically changed how developers can interact with and understand codebases…
Rethinking AI's Impact on Code Quality: Beyond the Surface Metrics

2025年3月5日

Rethinking AI's Impact on Code Quality: Beyond the Surface Metrics

When examining GitClear's extremely useful recent research on AI's impact on code quality, we need to look past…

2 条评论
The Vibe Coding Workflow

2025年3月4日

The Vibe Coding Workflow

Introduction In the last few months I've been building a lot of small projects using what the kids now call vibe…

1 条评论
From Performance to Proximity: Why Data Integration is the New Frontier for LLMs

2025年2月21日

From Performance to Proximity: Why Data Integration is the New Frontier for LLMs

Over the past few years, the rapid evolution of large language models (LLMs) has transformed the AI landscape. Today…

2 条评论
Another Thing About GenAI to Keep You Up at Night

2025年2月20日

Another Thing About GenAI to Keep You Up at Night

Generative AI (GenAI) is transforming how we work, supercharging productivity across industries. Yet, beneath the…

4 条评论

See all articles

Calculating LLM Costs - Verbosity Index

Michael Papadopoulos

Introduction: The Hidden Cost of Chain of Thought Reasoning in LLMs

The Problem: Chain of Thought and Cost Estimation