登录查看更多内容

LLM-Powered Applications with prompt caching

Ravindra Kumar

Data Engineering | Data Governance | Analytics | AI | healthcare IT

发布日期: 2024年8月19日

You’ve probably used large language models (LLMs) like ChatGPT, Groke, Gemini, LLaMA, or Anthropic, and maybe you're even thinking about integrating one into your own applications to add more power to what you offer. Understanding how these models work, especially how to estimate their usage and manage costs, is key to getting the most out of them without breaking the bank. Here’s a simple breakdown of the important terms and how you can calculate your usage and costs when using LLMs in your projects.

I have included the calculation of the usage with "prompt caching" as well.

Prompt caching is a game-changer for managing the costs of using large language models (LLMs). By storing and reusing responses to frequently asked questions, you can dramatically cut down (reducing costs by up to 90% and latency by up to 85% for long prompts) on token usage. Instead of processing the same query thousands of times, you only do it once and serve the cached response to subsequent users. This not only slashes your monthly billing but also improves response times and reduces the load on your LLM, making it a smart strategy for any high-traffic application!

Let’s start with some basic terms:

Token: Think of this as a piece of text. It can be a word or even just a character. For example, "Hello, world!" counts as 4 tokens.
Prompt Tokens: These are the tokens you send to the model when asking it something. For instance, the phrase "What’s the weather today?" consists of the prompt tokens.
Completion Tokens: These are the tokens the model uses to give you an answer. If the model replies with "It’s sunny today," those are the completion tokens.
Context: This is the amount of text the model can consider at one time, including both what you ask and how it responds.
Context Window: The total number of tokens the model can handle at once. If you exceed this, the model might "forget" earlier parts of the conversation.
Pricing per 1,000 Tokens: The cost to process 1,000 tokens, with different rates for prompts and completions.
Interaction: A full exchange with the model, from your question to its answer.
Model Version: Different versions of the model offer different capabilities, context sizes, and pricing.
Usage Dashboard: A tool to track how many tokens you’ve used and what it’s costing you.
API Rate Limits: The maximum number of requests you can make to the model in a given time period.

How to Estimate AI Model Costs:

Determine Token Usage: Estimate how many prompt and completion tokens your project will use. For example, if you’re building a chatbot, think about the average number of tokens per interaction and multiply that by the number of expected interactions.
Apply the Pricing: Use the pricing model (e.g., $0.03 per 1,000 prompt tokens) to calculate costs based on your token usage.
Monitor and Adjust: Keep an eye on your spending using the Usage Dashboard and make adjustments as needed to stay within your budget.

领英推荐

Bypassing OpenAI's Structured Outputs: Another Simple…

The Cyber Security Hub? 6 个月前

Embracing Strict Mode in OpenAI: Revolutionizing…

PriceSenz 5 个月前

Build RAG applications using only APIs with Postman! ??

Clarifai 9 个月前

The Impact of Prompt Caching

Prompt caching is a real game-changer when it comes to managing the costs of using LLMs. Here’s how it works: instead of having the LLM process the same question over and over, you store (or “cache”) the response the first time it’s asked. Then, whenever another user asks the same question, the system serves up the cached response instead of processing a new one. This can reduce token usage by up to 90% and latency by up to 85% for long prompts.

For example, let’s say 10,000 users ask about their data usage. Without caching, you’d consume around 280,000 tokens. With caching, you drop that down to just 28 tokens for the initial query, plus a tiny cost for retrieving the cached response. This could turn an $84 cost into just a few cents, making it a no-brainer for high-traffic applications. Not only does caching slash your monthly billing, but it also speeds up response times and reduces the load on your LLM, making everything run more smoothly.

By understanding these concepts and leveraging strategies like prompt caching, you can optimize your LLM usage and keep your costs in check. That way, you can focus on delivering great experiences to your users without worrying about your budget.

The table below summarizes other use cases, where the context is bigger than the above example.

Table 1: Comparision of LLM API usage with and without prompt caching

Data Tales By Ravi

673 位关注者

Manoj Kumar Sharma

Engineering manager AIML at Virtusa

6 个月

Majority cost goes to knowledge base. Optimise your vector db/graph db. Look for better chunk size. Break down your user query. Turn vague into generic and generic into some what specific. Index your knowledge base. Use open source models pre trained for your use case. Can also use quantized embeddings. Last not the least identify your use case and go for model that fits best for you.

4 次回应

Braj Singh

Director, Software Engineering @ Mastercard

6 个月

Awesome!!!

1 次回应

查看更多评论

要查看或添加评论，请登录

Ravindra Kumar的更多文章

You Too Are an AI Expert! ????

2025年2月24日

You Too Are an AI Expert! ????

AI is everywhere—on your phone, in your workplace, in your car, even in your fridge (if you splurged on that fancy…

1 条评论
DeepSeek R1: The New Kid on the Language-Model Block

2025年1月29日

DeepSeek R1: The New Kid on the Language-Model Block

It looks like a late post already! but hey, I am still a lazy guy posting at my own speed and comfort!! DeepSeek R1…

3 条评论
A Review of : Mirror, Mirror on the Wall: How the Performance of the U.S. Health Care System Compares Internationally

2025年1月18日

A Review of : Mirror, Mirror on the Wall: How the Performance of the U.S. Health Care System Compares Internationally

recently , I read the 2014 (Don't judge me, I know it's old, but very relevant) edition of “Mirror, Mirror on the Wall:…
Is It Really AI or Just Intelligent Automation? How to Tell the Difference.

2025年1月6日

Is It Really AI or Just Intelligent Automation? How to Tell the Difference.

AI is everywhere—or at least, that’s what the marketing copy says. From workflow tools to customer service chatbots…

3 条评论
Embracing the Healing Power of Music: A Tribute to "Sitar for Mental Health"

2024年12月26日

Embracing the Healing Power of Music: A Tribute to "Sitar for Mental Health"

Music has always been a universal language of emotion, healing, and connection. Recently, I discovered an inspiring…

3 条评论
The Hidden Crisis in Clinically Integrated Networks: Why Data Integration Shouldn't Break the Bank

2024年11月4日

The Hidden Crisis in Clinically Integrated Networks: Why Data Integration Shouldn't Break the Bank

?? Hot Take: Most Clinically Integrated Networks are drowning in tech costs while starving for good data. After working…
The Evolution of Data Architectures: Are We Missing the Forest for the Trees?

2024年9月30日

The Evolution of Data Architectures: Are We Missing the Forest for the Trees?

In the past decade, we've seen a proliferation of data architecture concepts: Data Lakes, Modern Data Warehouses, Data…
Understanding TEFCA Through Real-World Scenarios: How Nationwide Interoperability Will Revolutionize Healthcare

2024年8月30日

Understanding TEFCA Through Real-World Scenarios: How Nationwide Interoperability Will Revolutionize Healthcare

In today's rapidly evolving healthcare landscape, the ability to securely exchange health information across different…

1 条评论
Outsmarting "ArtificaI Intelligence": Harnessing the Power of Blissful Ignorance(aka natural dumbness!!!!!!

2023年5月22日

Outsmarting "ArtificaI Intelligence": Harnessing the Power of Blissful Ignorance(aka natural dumbness!!!!!!

While folks are neck-deep in a frenzied debate about whether the ascendance of artificial intelligence – the ChatGPTs…

1 条评论
A quick snapshot of "Data Governance"

2023年3月1日

A quick snapshot of "Data Governance"

Hello subscribers, This "Data Tales By Ravi" edition is dedicated to Data Governance. As you know, effective data…

2 条评论

See all articles

LLM-Powered Applications with prompt caching

Ravindra Kumar

Data Engineering | Data Governance | Analytics | AI | healthcare IT

How to Estimate AI Model Costs:

领英推荐

The Impact of Prompt Caching

Data Tales By Ravi

673 位关注者

Ravindra Kumar的更多文章

社区洞察

其他会员也浏览了

Optimizing LLMs: The Dynamic Integration of LangChain and GPTCache

Turning browsers into smart agents with GPT + ARIA

Natural Language Query Generation for Faster Results—Launch week day 4

Migrating from V7/V8 to the new advanced V9

?? Agents for Time Series Analysis

?? The Downsides of Structured Outputs

Dissecting Llama 3.1: A Deep Dive

Echoes of the Forgotten Code: 21K Codebase Challenge – From GPT-3.5 to Google Gemini, Who Remembers Best?

Reasonings found in a bathtub

Building Retrieval Augmented Generation (RAG) from scratch - Feeding my Database Internal articles

How to Estimate AI Model Costs:

领英推荐

The Impact of Prompt Caching

Data Tales By Ravi

673 位关注者

Ravindra Kumar的更多文章

You Too Are an AI Expert! ????

DeepSeek R1: The New Kid on the Language-Model Block

A Review of : Mirror, Mirror on the Wall: How the Performance of the U.S. Health Care System Compares Internationally

Is It Really AI or Just Intelligent Automation? How to Tell the Difference.

Embracing the Healing Power of Music: A Tribute to "Sitar for Mental Health"

The Hidden Crisis in Clinically Integrated Networks: Why Data Integration Shouldn't Break the Bank

The Evolution of Data Architectures: Are We Missing the Forest for the Trees?

Understanding TEFCA Through Real-World Scenarios: How Nationwide Interoperability Will Revolutionize Healthcare

Outsmarting "ArtificaI Intelligence": Harnessing the Power of Blissful Ignorance(aka natural dumbness!!!!!!

A quick snapshot of "Data Governance"

社区洞察

其他会员也浏览了

Optimizing LLMs: The Dynamic Integration of LangChain and GPTCache

Turning browsers into smart agents with GPT + ARIA

Natural Language Query Generation for Faster Results—Launch week day 4

Migrating from V7/V8 to the new advanced V9

?? Agents for Time Series Analysis

?? The Downsides of Structured Outputs

Dissecting Llama 3.1: A Deep Dive

Echoes of the Forgotten Code: 21K Codebase Challenge – From GPT-3.5 to Google Gemini, Who Remembers Best?

Reasonings found in a bathtub

Building Retrieval Augmented Generation (RAG) from scratch - Feeding my Database Internal articles