Calculating LLM Costs - Verbosity Index
Credit : AI Generated Image, DALL-E

Calculating LLM Costs - Verbosity Index

Introduction: The Hidden Cost of Chain of Thought Reasoning in LLMs

The rise of Chain of Thought (CoT) reasoning in Large Language Models (LLMs) has introduced a significant shift in how models process and generate text. While CoT improves reasoning capabilities by breaking down complex problems into step-by-step explanations, it also drastically increases the number of tokens produced. This presents a major challenge for cost estimation when using API-based models, as the traditional pricing metric—cost per token—fails to account for varying verbosity across models.

Previously, estimating API usage costs was relatively straightforward since most LLMs produced comparable output token counts for a given input prompt. However, as CoT adoption grows, the difference in verbosity between models can lead to unexpected costs. A model with a lower per-token price may end up being more expensive overall due to excessive token generation.

This article explores the impact of CoT on cost estimation, compares verbosity across models like OpenAI’s o1, o3, and DeepSeek R1, and introduces a Verbosity Index—a new metric to improve cost predictability when working with LLM APIs. Finally, we discuss the implications for sustainability, as verbosity is directly linked to computational efficiency and energy consumption.


The Problem: Chain of Thought and Cost Estimation

With increasing competition in the Chain of Thought (CoT) space, models such as OpenAI's o1, o3, and o3-mini, as well as DeepSeek R1, are challenging the cost structures of traditional LLMs. Unlike previous models where output token usage was relatively predictable, CoT reasoning introduces significant variability in token generation.

Historically, most LLMs would produce a similar number of output tokens for a given input, with only minor variations due to differences in model architecture and response style. However, CoT changes this landscape entirely by dramatically increasing verbosity, making API cost calculations much more complex.

For example, consider the following prompt:

"Write a Python function that solves the traveling salesman problem (TSP) using dynamic programming with memorization. The function should accept an adjacency matrix and return the shortest path. Explain your approach step by step."

When tested across multiple LLM APIs, the token counts for responses used to be fairly uniform across models. However, with the introduction of CoT-based reasoning, token generation varies significantly across models.


Empirical Findings: Token Usage Across LLMs

Baseline Token Consumption

For standard API-based chat responses (without CoT reasoning), output token counts remained relatively stable across major LLMs, as illustrated below:

Image 1 - Output Tokens for TSP Prompt (200 runs)

Impact of Chain of Thought on Token Usage

However, once Chain of Thought reasoning is introduced, token consumption skyrockets. Certain models—particularly DeepSeek R1—demonstrate significantly higher verbosity than others.


Image 2 - Output Tokens for TSP Prompt (200 runs) with CoT models.


This poses a serious challenge for businesses relying on API pricing based on cost per token, as verbosity can outweigh the per-token savings offered by cheaper models.

For instance, despite DeepSeek R1’s lower per-token price, its excessive verbosity results in significantly higher costs compared to OpenAI’s o3 models.


Image 3 - Cost to Finish Run


Introducing the Verbosity Index

Since CoT significantly alters cost calculations, a new metric is needed to better predict actual API expenses. This is where the Verbosity Index comes into play.

Definition of the Verbosity Index

The verbosity of a model can be quantified using the following formula:


Image 5 - Verbosity Index Formula

where:

  • Tokens Output: Total number of tokens generated by the model (including reasoning).
  • Tokens Input: Number of tokens in the prompt.

A model with a higher verbosity ratio generates significantly more tokens per input token. For example, if a model returns 300 tokens for a 100-token prompt, its verbosity score is 3.0.

Empirical Verbosity Measurement

To systematically calculate verbosity across models, we propose the following methodology:

  1. Use a Standardised Prompt Set: Create a dataset of prompts, varying in length and complexity. Prompts should be categorized into different types (Summary, Creative Writing, Coding, Problem Solving, Research, etc.).
  2. Measure Response Length: Query different LLMs and log both input and output token counts.
  3. Compute the Verbosity Ratio (V).
  4. Compare Across Models: Rank models from least to most verbose.

Once the verbosity index is available, cost calculations can be adjusted accordingly:

Estimated?Cost = 
( Input?Tokens×Price?per?InputToken ) + ( Input?Tokens x Price?per?OutputToken x V )        


Image 4 - Cost Per Output Token

Sustainability Implications

Beyond cost modelling, verbosity also plays a crucial role in energy efficiency. The extreme hype surrounding DeepSeek R1 largely stems from its claimed efficiency improvements over other models. However, the verbosity index allows us to measure actual efficiency not just in FLOPs or energy usage per token, but in terms of how efficiently a model reaches a solution.

A model that generates excessive tokens—even if it is cheaper per token—will ultimately consume more computational resources. Therefore, by factoring in verbosity, we gain a better understanding of the true cost of computation and energy consumption.

Models from the same generation with lower verbosity scores will generally be more energy efficient, as they produce fewer tokens while maintaining accuracy. This makes the verbosity index not just a tool for cost estimation, but also a key metric for sustainability evaluations.

Conclusion

The introduction of Chain of Thought (CoT) reasoning has fundamentally changed how LLMs consume and generate tokens, making traditional cost estimation methods ineffective. Our analysis shows that CoT can dramatically increase verbosity, leading to higher-than-expected costs, even for models with lower per-token pricing.

The Verbosity Index provides a new, empirical way to measure and predict LLM costs, helping businesses make informed decisions about model selection based on actual output efficiency rather than just token pricing. Additionally, this metric contributes to sustainability discussions, as more concise models will generally be more energy-efficient.

With LLMs evolving rapidly, understanding verbosity is crucial for API cost management, model selection, and environmental impact assessments.


Eystein Thanisch

Senior Technologist @ ADL Catalyst

1 个月

Pertinent and practical analysis at this juncture - the try-it-yourself guide is particularly appreciated ?? I'm wondering if some attempt should be made at comparing CoT models with others in terms of tokens per task accomplished. CoT models might be verbose but if they solve in one response what would have taken a non-CoT model several iterations, then they might work out as more comparable. My impression from actual examples of chains of thought, however, is that they meander considerably more than would interactions with a user. So your conclusions seem still sound.

回复
Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

1 个月

While verbosity is undeniably a factor in CoT model costs, attributing increased expenses solely to token count overlooks the potential for efficiency gains through specialized hardware and algorithmic optimizations. The recent surge in open-source CoT models suggests a trend towards democratizing access despite cost concerns. How might your Verbosity Index account for the evolving landscape of open-source contributions and community-driven cost reductions?

回复

要查看或添加评论,请登录

Michael Papadopoulos的更多文章

社区洞察

其他会员也浏览了