登录查看更多内容

Next Stage in Gen AI for LLMs, Emerging Cloud vs. On-premise Economics, and Implications for the Enterprise

Srinivasa (Chuck) Chakravarthy

Managing Director, West Lead for HiTech XaaS Practice at Accenture

发布日期: 2025年1月2日

Over the last 18+ months, I have been an active retail user of ChatGPT, Claude, and Gemini. Earlier this week I also started using DeepSeek and Grok.? I am using these tools extensively to research specific topics, valuing stocks, summarizing long documents,? draft 1st version of a proposal, etc.? And I am using them in what some would call “MultiAgentic” mode where I ask the same question to the 3 LLMs and use the answer of one LLM to be reviewed by another LLM until I get enough of a “convergence”.? Regardless, this has led me to question the economics of these models as I very quickly exhaust the number of prompts that can be maintained in a context leading me to believe we may be hitting a saturation point.? During this time some of my Clients have also adopted the technology in several forms including CoPilot and now are being bombarded by Agentic solutions from ISVs.? Clearly an Enterprise cannot afford to buy CoPilots and/or Agentic Solutions from multiple ISVs as the costs start adding up pretty quickly.??

This led me to start thinking about the (a) flattening of incremental gains from newer versions of the various LLMs, (b) what does this mean for future LLM development, and finally (c) economics of Cloud versus On-premise LLM Inference, Fine Tuning, Prompt Engineering, and RAG and implications for the Enterprise. In this blog, I attempt to do all of the above including economics and cost calculations that may not be accurate; however, I am hoping they provide a view that is directionally correct.?

Are Large Language Models (LLMs) Reaching a Point of Diminishing Returns?

Large Language Models (LLMs) have revolutionized artificial intelligence, but recent trends suggest we may be approaching a plateau in returns from making models larger and training them on ever-expanding datasets. The following analysis, supported by data and trends in the industry, sheds light on this phenomenon and what it means for the future of LLMs and their enterprise applications.

The Case for Diminishing Returns

A detailed analysis of various LLMs, including GPT and Claude models, reveals a clear trend: as model size and training costs increase, performance improvements become incrementally smaller. Below is a summary from the attached data - note some of these numbers are guesstimates while others are sourced from publicly available data.

Performance Metrics and Costs

As shown in the accompanying performance-cost graph, gains in performance (e.g., MMLU and LAMBADA scores) from larger models and larger datasets are not proportional to the exponential increase in cost. The marginal cost per unit of improvement skyrockets as models grow.

MMLU (Massive Multitask Language Understanding): Think of this as a giant standardized test covering everything from math to history to science to law. Scores show how well the AI can handle questions across many subjects. Like a student's SAT score, but across 57 different subjects. Scale is 0-100, like a typical test score

LAMBADA (Language Model Benchmark for Autoregressive Document-to-Answer): Think of this as a reading comprehension test. The AI reads a passage and has to predict the last word or complete the thought. Tests if the AI really understands context and meaning. Also scored 0-100.

Are We Running Out of Training Data?

A critical challenge in scaling LLMs is the availability of high-quality training data. Many models have already consumed freely available public datasets, including:

Common Crawl, Wikipedia, and large-scale text datasets like BooksCorpus.

Remaining Data Challenges

Proprietary Data: Much of the remaining high-quality data is proprietary, locked behind enterprise firewalls or paywalls. Examples include:
Expensive and Fragmented: Accessing and cleaning proprietary data requires significant investment in licensing, curation, and compliance with privacy laws like GDPR and CCPA.

Validation with Data

According to studies from OpenAI and Google DeepMind, high-quality, unused public text data is projected to be fully utilized by 2025, leaving proprietary data as the primary frontier for future models.
The cost of high-quality proprietary datasets ranges from $1M to $10M per 100M tokens, making future training exponentially more expensive.

Below is a table with some rough estimates of data sources and costs:

Alternative Approaches: Grok and DeepSeek

Companies like Grok and DeepSeek have taken a different, more cost-effective approach to building LLMs. Instead of training monolithic, general-purpose models, they focus on: (i) Smaller, specialized architectures tailored to domain-specific tasks, (ii) Leveraging retrieval-augmented techniques and reusing publicly available smaller datasets combined with lightweight fine-tuning, and (iii) Emphasizing modularity, where smaller components collaborate for large-scale tasks.
These approaches suggest that future LLM development could emphasize efficiency over brute force scaling, opening opportunities for smaller players in the field.

The Future of LLM Development

Given the diminishing returns and limited data, the industry must pivot from brute force scaling to smarter strategies. Here are potential pathways:

New Architectures

1) Mixture of Experts (MoE): Uses sparse activation, enabling only a fraction of the model’s parameters to be active for a given input. This dramatically reduces computational overhead.

2) Sparsity at the Token Level: Focuses on sparse attention mechanisms to prioritize only the most relevant tokens.

3) Structured Sparsity: Introducing sparsity directly into neural network layers to reduce the number of parameters and FLOPs.

4) Low-Rank Adaptation (LoRA): Fine-tunes LLMs by freezing most parameters and training small, low-rank matrices, drastically reducing resource requirements.

5) Parameter-Efficient Fine-Tuning (PEFT): Methods like adapters and prefix-tuning focus on modifying only a small number of additional parameters while leveraging the frozen pre-trained model.

Domain-Specific Small Language Models (SLMs)

a) Enterprises are increasingly fine-tuning smaller models for specific domains like healthcare, legal, and customer support.

b) These models are cheaper, faster, and easier to deploy while retaining high task-specific accuracy.

Emerging Focus Areas

i) Prompt Engineering: Optimizing inputs to obtain desired outputs without altering model parameters.

ii) Fine-Tuning: Adapting general-purpose LLMs to domain-specific tasks using smaller datasets.

iii) RAG Systems: Combining pre-trained LLMs with proprietary knowledge bases to dynamically retrieve relevant information.

Cloud vs. On-premise Economics of Fine-Tuning, Prompt Engineering, and RAG

Using an earlier breakeven analysis, here are the economic implications:

Fine-Tuning Scenario:

Cloud: GPT-3.5 fine-tuning at $0.008 / 1K tokens for the training run, plus some overhead for data prep, iteration, etc. Let’s assume an effective cost of $0.01 / 1K tokens to simplify it.
Data Size: 1 million tokens per fine-tuning job (1M tokens = 1,000K tokens)

1 million tokens is often used as an illustrative fine-tuning data size because:

It represents a practical midpoint: large enough to show meaningful adaptation to a domain while being small enough to manage costs and computational time.
Many real-world datasets, such as domain-specific customer interactions or curated internal documents, naturally fall into this range.
Scaling calculations are easier—you can scale up or down costs linearly for smaller or larger datasets.

On-Prem: HPC server with 8× A100 at $160K total + $5K/month overhead.

Cloud Cost per Fine-Tuning Job

$0.01×1,000??= $10 in raw training usage
Let’s add overhead (multiple runs, data cleaning, engineering time) → call it $100 total per job to be more realistic.

Annual Cloud Cost

If you do X fine-tuning jobs per year, total cloud cost = $100×X

Annual On-Prem Cost

Year 1 capital: $160K
Monthly overhead: $5K → $60K/year
Total Year 1: $220K

Breakeven:

$220,000=$100×X?X=2,200??fine-tuning jobs in the first year

If you do fewer than ~2,200 fine-tuning runs (each at 1M tokens) in that first year, cloud is cheaper. That might be 200 jobs/month, which is extremely high for typical enterprise usage—on-prem HPC for fine-tuning generally only pays off if you’re constantly training or have multi-billion-token training runs.? A workload of 200 fine-tuning jobs/month would be rare, except for industries like pharmaceuticals or legal services with high-volume, specialized needs. For most enterprises, cloud-based fine-tuning makes economic sense due to its flexibility and pay-as-you-go model

Inference Scenario:

Cloud: GPT-4 at $0.03 / 1K tokens (input) + $0.06 / 1K tokens (output) = average $0.045 / 1K tokens.
On-Prem: 8× A100 HPC rig at $160K + $5K/month overhead.
GPU Throughput: 60 tokens/sec/GPU × 8 GPUs = 480 tokens/sec peak, with 80% utilization → ~384 tokens/sec.

Month 1:

On-prem cost = $160,000+$5,000=$165,000

Amortize Setup Over 12 Months:

Effective monthly cost ≈$160K/12+$5K=$18.3K
Let Token_Volume = monthly tokens in thousands.
Cloud cost = Token_Volume × $0.045
On-Prem cost = $18.3K.

Breakeven:

$0.045 × Token_Volume = 18,300?Token_Volume = 18,300/0.045 ≈ 406,667 k Tokens = 407 million tokens/month

So the monthly break even is ~400M tokens if you spread the capital cost over a year. If you exceed 400M tokens every month, on-prem can be cheaper in the long run.

Note: If you actually max out your 1B tokens capacity, your monthly on-prem cost is $18.3K vs. $45K in the cloud—on-prem looks favorable in that scenario. But it takes time to recoup the initial $160K. Usually, you’d see an overall ROI around month 6–10 if you are consistently near full usage.

Is 400M tokens per month feasible for a typical enterprise. The short answer:

Yes, it can be feasible, but it depends heavily on:

Illustrative Examples

Internal Knowledge Base Chatbot: An enterprise with 10,000 employees, each making ~20 queries/day, each query averaging 1,000 tokens (input + output) can reach 200M tokens/month. Doubling that usage pushes you to 400M.
Customer Support Chatbot: A global consumer-facing chatbot with tens of thousands of daily users could easily surpass 400M tokens/month, especially if each session involves multiple back-and-forth interactions.

Smaller Enterprises

For an organization with only a few hundred daily users or simpler tasks (like short text classification), reaching 400M tokens/month is less likely. They might hover in the range of a few million tokens monthly, making on-premises infrastructure less cost-effective in the near term.

RAG Systems

Cloud-based vector databases (e.g., Pinecone) are ideal for lower volumes (<10M queries/month).
On-prem systems with Milvus/FAISS are better suited for enterprises with large-scale document retrieval needs and longer horizons.

Assumptions:

Document Collection: 1 million documents
Average Document Size: Equivalent to 1,000 tokens
Queries per Month: 100,000
Embedding Model: OpenAI's text-embedding-ada-002 (or a comparable model)

Cloud Costs (Example using Pinecone):

class="italic">Vectorization (One-Time Cost):

On-Prem Costs (Example):

Hardware:
Server with sufficient RAM (e.g., 256GB+), CPU, and SSD storage: $5,000 - $15,000+ (depending on configuration)
Optional: GPU for faster embedding/inference (e.g., NVIDIA T4: $2,000 - $3,000)
Total Hardware: $7,000 - $18,000 (estimated range
Vectorization:
If using a local embedding model, the cost is primarily the energy cost of running the model. This will vary by hardware and efficiency. Let's estimate $20-$50.
Maintenance (Annual):
Power: $500 - $1,500 (depending on hardware and usage)
Cooling: $200 - $500
IT Support: $1,000 - $5,000 (can be much higher depending on expertise needed)
Software Licenses (if needed): $0 - $1,000+
Total Annual Maintenance: $1,700 - $8,000 (estimated range)
Setup and Deployment (One-Time):
Labor costs for setup, configuration, integration: $2,000 - $10,000+ (depending on complexity)
Total First-Year On-Prem Cost: $7,000 (low-end hardware) + $20 (low-end vectorization) + $1,700 (low-end maintenance) + $2,000 (low-end setup) = $10,720
Monthly Cost (Amortized over 3 years): Let's take the average of the hardware range ($12,500) + average maintenance ($4,850 * 3 years) + average setup ($6,000) = $33,050 / 36 months ≈ $918.06/month

Clearly Cloud RAG is cheaper than On-premise RAG. On-prem is justified only for enterprises handling millions of documents or queries.

Prompt Engineering

Cloud is typically better for prompt engineering due to its short bursts of experimentation and rapid iteration.
On-prem may be justified for extremely high-volume or latency-sensitive applications.

Cloud Cost: Experimenting with 1,000 prompts (fits well with Enterprise use cases) on GPT-4, with 2,000 tokens each (input ~1000 tokens + output~1000 tokens), costs:

Input: 1,000 × 2,000 / 1,000 × $0.03 = $60.
Output: 1,000 × 2,000 / 1,000 × $0.06 = $120.

Total Cloud Cost: $180/day for heavy experimentation

On-Prem Cost: Assuming local inference with 8 GPUs and full utilization for 1 day:

Power, cooling, and staff: $20–$50.
Depreciation: $18,300 monthly / 30 days = ~$610/day.

Total On-prem Cost: ~$660/day for heavy experimentation.

Prompt Engineering Comparison: Cloud is significantly cheaper for occasional prompt testing. On-prem is viable only for enterprises conducting continuous large-scale experimentation.

The Enterprise Pivot: Fine-Tuning, Prompt Engineering, and RAG

For enterprises, the future of LLMs lies not in training ever-larger models but in:

Leveraging pre-trained LLMs through:
Integrating RAG to unlock proprietary data behind firewalls.

Use Case Examples

Healthcare: Fine-tuning smaller models for diagnosis and compliance tasks.
Legal: RAG systems for real-time case law retrieval.
Customer Support: Prompt engineering on cloud-hosted LLMs for consistent answers.

Suggested Cloud Vs. On-premise Gen AI Framework

Key Decision Factors:

1. Volume Threshold

???- <100M tokens/month: Cloud

???- 100M-500M tokens/month: Hybrid

???- >500M tokens/month: Consider on-prem

2. Data Sensitivity

???- Regulated data: On-prem or private cloud

???- Public data: Any deployment model

3. Latency Requirements

???- <100ms: On-prem or edge

???- 100-500ms: Any deployment model

???- >500ms: Cloud optimal

4. Cost Structure Preference

???- CapEx heavy: On-prem; OpEx heavy: Cloud

Conclusion

As LLMs hit scaling limits, the future will be defined by efficiency, specialization, and enterprise value extraction. The shift from building larger general-purpose models to deploying cost-effective, task-specific solutions is not just a necessity—it’s an opportunity for innovation.

By strategically combining fine-tuning, prompt engineering, and RAG approaches, enterprises can unlock tremendous value while optimizing costs. Whether in the cloud or on-premise, the choice will depend on volume, latency needs, and data sensitivity. However, the days of purely scaling LLMs to ever-larger sizes may well be behind us.

要查看或添加评论，请登录

Srinivasa (Chuck) Chakravarthy的更多文章

The Impact of Generative AI Workloads on Power, Cooling, Water, and Data Center Design: A Global View

2025年3月8日

The Impact of Generative AI Workloads on Power, Cooling, Water, and Data Center Design: A Global View

For today’s blog I combined my last 10-years passion for AI with what I did early on (first 16 years) in my career in…

1 条评论
The Gen AI Revolution in SaaS: From "as a Service" to "Intelligent Outcomes"

2025年2月15日

The Gen AI Revolution in SaaS: From "as a Service" to "Intelligent Outcomes"

The SaaS industry is not just evolving—it is being reborn. We are transitioning from Software-as-a-Service (SaaS) to…
The Business Impact of DeepSeek: Chips, Data Centers, and AI Use Cases

2025年2月3日

The Business Impact of DeepSeek: Chips, Data Centers, and AI Use Cases

The DeepSeek Breakthrough and Why It Matters I had been using DeepSeek since mid-December and talked about it in one of…
The Pros & Cons of Buying Gen AI Agentic Solutions from ISVs: An In-Depth Look at AgentForce

2025年1月19日

The Pros & Cons of Buying Gen AI Agentic Solutions from ISVs: An In-Depth Look at AgentForce

Over the past 4 months, there have been a flurry of announcements by ISVs around Gen AI based Agentic solutions. Among…

4 条评论
The Hidden Infrastructure Crisis: How Gen AI is Forcing a Complete Rethink of Enterprise Data Centers

2025年1月10日

The Hidden Infrastructure Crisis: How Gen AI is Forcing a Complete Rethink of Enterprise Data Centers

Here's a scenario playing out in boardrooms across Fortune 100 companies: The CEO wants to deploy Gen AI solutions…

2 条评论
The Great AI Realignment: Why We Maybe Asking the Wrong Questions

2024年12月25日

The Great AI Realignment: Why We Maybe Asking the Wrong Questions

Lately I have been doing a lot of reading on global technology trends, economic transitions, trade/tariff wars, and…

9 条评论
Impact of Generative AI on Large Consulting Firms: The Dawn of a New Era

2024年11月18日

Impact of Generative AI on Large Consulting Firms: The Dawn of a New Era

Introduction Being in the Consulting/ProServ business, I have been thinking hard about the potential for disruption…

2 条评论
Banking Platforms Are on the Brink: Big Tech, Legacy Complacency, Gen AI Impact, and the Urgent Need for Radical Reinvention

2024年11月8日

Banking Platforms Are on the Brink: Big Tech, Legacy Complacency, Gen AI Impact, and the Urgent Need for Radical Reinvention

Executive Summary The banking industry stands at a critical juncture, teetering between evolution and extinction. Big…
The Great Data Reshape: How GenAI Will Destroy and Rebuild Data Architecture

2024年10月28日

The Great Data Reshape: How GenAI Will Destroy and Rebuild Data Architecture

A Business Leader's Guide to the Future Recently my son (who works at a Compliance DB company) and I have had lot of…
The Great Convergence: CRM, CSM, CCaaS, and CPaaS in the Age of Generative AI

2024年10月16日

The Great Convergence: CRM, CSM, CCaaS, and CPaaS in the Age of Generative AI

Over the last few months, I have been looking at seamless omnichannel experience across contact center agent, web…

2 条评论

See all articles

The Case for Diminishing Returns

Performance Metrics and Costs

Are We Running Out of Training Data?

Remaining Data Challenges

Validation with Data

The Future of LLM Development

Cloud vs. On-premise Economics of Fine-Tuning, Prompt Engineering, and RAG

Smaller Enterprises

RAG Systems

Cloud Costs (Example using Pinecone):

On-Prem Costs (Example):

Prompt Engineering

The Enterprise Pivot: Fine-Tuning, Prompt Engineering, and RAG

Use Case Examples

Conclusion

Srinivasa (Chuck) Chakravarthy的更多文章

The Impact of Generative AI Workloads on Power, Cooling, Water, and Data Center Design: A Global View

The Gen AI Revolution in SaaS: From "as a Service" to "Intelligent Outcomes"

The Business Impact of DeepSeek: Chips, Data Centers, and AI Use Cases

The Pros & Cons of Buying Gen AI Agentic Solutions from ISVs: An In-Depth Look at AgentForce

The Hidden Infrastructure Crisis: How Gen AI is Forcing a Complete Rethink of Enterprise Data Centers

The Great AI Realignment: Why We Maybe Asking the Wrong Questions

Impact of Generative AI on Large Consulting Firms: The Dawn of a New Era

Banking Platforms Are on the Brink: Big Tech, Legacy Complacency, Gen AI Impact, and the Urgent Need for Radical Reinvention

The Great Data Reshape: How GenAI Will Destroy and Rebuild Data Architecture

The Great Convergence: CRM, CSM, CCaaS, and CPaaS in the Age of Generative AI