Next Stage in Gen AI for LLMs, Emerging Cloud vs. On-premise Economics, and Implications for the Enterprise
Srinivasa (Chuck) Chakravarthy
Managing Director, West Lead for HiTech XaaS Practice at Accenture
Next Stage in Gen AI for LLMs, Emerging Cloud vs. On-premise Economics, and Implications for the Enterprise
Over the last 18+ months, I have been an active retail user of ChatGPT, Claude, and Gemini. Earlier this week I also started using DeepSeek and Grok.? I am using these tools extensively to research specific topics, valuing stocks, summarizing long documents,? draft 1st version of a proposal, etc.? And I am using them in what some would call “MultiAgentic” mode where I ask the same question to the 3 LLMs and use the answer of one LLM to be reviewed by another LLM until I get enough of a “convergence”.? Regardless, this has led me to question the economics of these models as I very quickly exhaust the number of prompts that can be maintained in a context leading me to believe we may be hitting a saturation point.? During this time some of my Clients have also adopted the technology in several forms including CoPilot and now are being bombarded by Agentic solutions from ISVs.? Clearly an Enterprise cannot afford to buy CoPilots and/or Agentic Solutions from multiple ISVs as the costs start adding up pretty quickly.??
This led me to start thinking about the (a) flattening of incremental gains from newer versions of the various LLMs, (b) what does this mean for future LLM development, and finally (c) economics of Cloud versus On-premise LLM Inference, Fine Tuning, Prompt Engineering, and RAG and implications for the Enterprise. In this blog, I attempt to do all of the above including economics and cost calculations that may not be accurate; however, I am hoping they provide a view that is directionally correct.?
Are Large Language Models (LLMs) Reaching a Point of Diminishing Returns?
Large Language Models (LLMs) have revolutionized artificial intelligence, but recent trends suggest we may be approaching a plateau in returns from making models larger and training them on ever-expanding datasets. The following analysis, supported by data and trends in the industry, sheds light on this phenomenon and what it means for the future of LLMs and their enterprise applications.
The Case for Diminishing Returns
A detailed analysis of various LLMs, including GPT and Claude models, reveals a clear trend: as model size and training costs increase, performance improvements become incrementally smaller. Below is a summary from the attached data - note some of these numbers are guesstimates while others are sourced from publicly available data.
Performance Metrics and Costs
As shown in the accompanying performance-cost graph, gains in performance (e.g., MMLU and LAMBADA scores) from larger models and larger datasets are not proportional to the exponential increase in cost. The marginal cost per unit of improvement skyrockets as models grow.
MMLU (Massive Multitask Language Understanding): Think of this as a giant standardized test covering everything from math to history to science to law. Scores show how well the AI can handle questions across many subjects. Like a student's SAT score, but across 57 different subjects. Scale is 0-100, like a typical test score
LAMBADA (Language Model Benchmark for Autoregressive Document-to-Answer): Think of this as a reading comprehension test. The AI reads a passage and has to predict the last word or complete the thought. Tests if the AI really understands context and meaning. Also scored 0-100.
Are We Running Out of Training Data?
A critical challenge in scaling LLMs is the availability of high-quality training data. Many models have already consumed freely available public datasets, including:
Remaining Data Challenges
Validation with Data
Below is a table with some rough estimates of data sources and costs:
Alternative Approaches: Grok and DeepSeek
The Future of LLM Development
Given the diminishing returns and limited data, the industry must pivot from brute force scaling to smarter strategies. Here are potential pathways:
New Architectures
1) Mixture of Experts (MoE): Uses sparse activation, enabling only a fraction of the model’s parameters to be active for a given input. This dramatically reduces computational overhead.
2) Sparsity at the Token Level: Focuses on sparse attention mechanisms to prioritize only the most relevant tokens.
3) Structured Sparsity: Introducing sparsity directly into neural network layers to reduce the number of parameters and FLOPs.
4) Low-Rank Adaptation (LoRA): Fine-tunes LLMs by freezing most parameters and training small, low-rank matrices, drastically reducing resource requirements.
5) Parameter-Efficient Fine-Tuning (PEFT): Methods like adapters and prefix-tuning focus on modifying only a small number of additional parameters while leveraging the frozen pre-trained model.
Domain-Specific Small Language Models (SLMs)
a) Enterprises are increasingly fine-tuning smaller models for specific domains like healthcare, legal, and customer support.
b) These models are cheaper, faster, and easier to deploy while retaining high task-specific accuracy.
Emerging Focus Areas
i) Prompt Engineering: Optimizing inputs to obtain desired outputs without altering model parameters.
ii) Fine-Tuning: Adapting general-purpose LLMs to domain-specific tasks using smaller datasets.
iii) RAG Systems: Combining pre-trained LLMs with proprietary knowledge bases to dynamically retrieve relevant information.
Cloud vs. On-premise Economics of Fine-Tuning, Prompt Engineering, and RAG
Using an earlier breakeven analysis, here are the economic implications:
Fine-Tuning Scenario:
1 million tokens is often used as an illustrative fine-tuning data size because:
Cloud Cost per Fine-Tuning Job
Annual Cloud Cost
Annual On-Prem Cost
Breakeven:
$220,000=$100×X?X=2,200??fine-tuning jobs in the first year
If you do fewer than ~2,200 fine-tuning runs (each at 1M tokens) in that first year, cloud is cheaper. That might be 200 jobs/month, which is extremely high for typical enterprise usage—on-prem HPC for fine-tuning generally only pays off if you’re constantly training or have multi-billion-token training runs.? A workload of 200 fine-tuning jobs/month would be rare, except for industries like pharmaceuticals or legal services with high-volume, specialized needs. For most enterprises, cloud-based fine-tuning makes economic sense due to its flexibility and pay-as-you-go model
Inference Scenario:
Month 1:
Amortize Setup Over 12 Months:
Breakeven:
$0.045 × Token_Volume = 18,300?Token_Volume = 18,300/0.045 ≈ 406,667 k Tokens = 407 million tokens/month
So the monthly break even is ~400M tokens if you spread the capital cost over a year. If you exceed 400M tokens every month, on-prem can be cheaper in the long run.
Note: If you actually max out your 1B tokens capacity, your monthly on-prem cost is $18.3K vs. $45K in the cloud—on-prem looks favorable in that scenario. But it takes time to recoup the initial $160K. Usually, you’d see an overall ROI around month 6–10 if you are consistently near full usage.
Is 400M tokens per month feasible for a typical enterprise. The short answer:
Illustrative Examples
Smaller Enterprises
For an organization with only a few hundred daily users or simpler tasks (like short text classification), reaching 400M tokens/month is less likely. They might hover in the range of a few million tokens monthly, making on-premises infrastructure less cost-effective in the near term.
RAG Systems
Assumptions:
Cloud Costs (Example using Pinecone):
On-Prem Costs (Example):
Clearly Cloud RAG is cheaper than On-premise RAG. On-prem is justified only for enterprises handling millions of documents or queries.
Prompt Engineering
Cloud Cost: Experimenting with 1,000 prompts (fits well with Enterprise use cases) on GPT-4, with 2,000 tokens each (input ~1000 tokens + output~1000 tokens), costs:
Total Cloud Cost: $180/day for heavy experimentation
On-Prem Cost: Assuming local inference with 8 GPUs and full utilization for 1 day:
Total On-prem Cost: ~$660/day for heavy experimentation.
The Enterprise Pivot: Fine-Tuning, Prompt Engineering, and RAG
For enterprises, the future of LLMs lies not in training ever-larger models but in:
Use Case Examples
Suggested Cloud Vs. On-premise Gen AI Framework
Key Decision Factors:
1. Volume Threshold
???- <100M tokens/month: Cloud
???- 100M-500M tokens/month: Hybrid
???- >500M tokens/month: Consider on-prem
2. Data Sensitivity
???- Regulated data: On-prem or private cloud
???- Public data: Any deployment model
3. Latency Requirements
???- <100ms: On-prem or edge
???- 100-500ms: Any deployment model
???- >500ms: Cloud optimal
4. Cost Structure Preference
???- CapEx heavy: On-prem; OpEx heavy: Cloud
Conclusion
As LLMs hit scaling limits, the future will be defined by efficiency, specialization, and enterprise value extraction. The shift from building larger general-purpose models to deploying cost-effective, task-specific solutions is not just a necessity—it’s an opportunity for innovation.
By strategically combining fine-tuning, prompt engineering, and RAG approaches, enterprises can unlock tremendous value while optimizing costs. Whether in the cloud or on-premise, the choice will depend on volume, latency needs, and data sensitivity. However, the days of purely scaling LLMs to ever-larger sizes may well be behind us.