How to cache LLM calls using Langchain.
Before we get into caching LLM calls we need to understand the reasons why we need to cache LLM calls.
LLM's like OpenAI charges users based on the number of?tokens?processed, these tokens are calculated from both the input prompt the user provides and the output the ChatGPT generates.
There might be scenarios where user might ask the same question something like “what's github?” or “ Can you explain what GitHub is?” or “ can you tell me more about GitHub?” we know the chatGPT will provide the same output for these 3 prompts. Number of tokens will be calculated in all the 3 scenarios and associated cost will be added to the application.
To address this problem if we can cache the similar LLM requests then for the subsequent requests will be fetched from Cache and not LLM models like ChatGPT hence no tokens will be calculated for the cached requests/response and there will be no cost associated with it.
Another reason of using caching is for speed and efficiency. As the demand for Large Language Models increases, there is a growing need to optimize these systems for speed and efficiency. One of the key ways to achieve this is through caching.
In this article, we will discuss how to cache Language Model (LLM) calls in LangChain, LangChain is a framework designed to simplify the creation of applications using large language models. LangChain provides an caching layer for Chat Models.
In Memory Cache
Following is simple code snippet for using in memory cache.
import o
import openai
openai.api_key? = os.environ['OPENAI_API_KEY']
import langchain?
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI()
from langchain.cache import InMemoryCache
langchain.llm_cache = InMemoryCache()
# The first time, it is not yet in cache, so it should take longer
# Now let's measure the time it takes to execute this line
start_time = timeit.default_timer()
llm.predict("Tell me a joke")
elapsed_time = timeit.default_timer() - start_time
print(f"Execution time: {elapsed_time} seconds")s
Running a joke request the first time will take longer as the result isn’t cached yet. The second time you run it, it should go faster as the result will be retrieved from the cache.
领英推荐
SQLite Cache
We can also use a SQLite cache, which is a lightweight disk-based storage that we can interact with using SQL syntax
import o
import openai
openai.api_key? = os.environ['OPENAI_API_KEY']
import langchain?
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI()
# We can do the same thing with a SQLite cache
from langchain.cache import SQLiteCache
langchain.llm_cache = SQLiteCache(database_path=".langchain.db")
# The first time, it is not yet in cache, so it should take longer
start_time = timeit.default_timer()
llm.predict("Tell me a joke on software engineer which has 3 paragraphs")
elapsed_time = timeit.default_timer() - start_time
print(f"Execution time: {elapsed_time} seconds")
Again, the first request will take longer as it isn’t cached. The subsequent calls will be faster as they retrieve the results from the cache.
GPTCache
GPTCache is another useful caching method that allows for exact match caching or caching results based on semantic similarity.
First we will see normal caching using GPTCache:
import time
def response_text(openai_resp):
? ? return openai_resp['choices'][0]['message']['content']
print("Cache loading.....")
# To use GPTCache, that's all you need
# -------------------------------------------------
from gptcache import cache
from gptcache.adapter import openai
cache.init()
cache.set_openai_key()
# -------------------------------------------------
question = "what's Linkedin"
for _ in range(2):
? ? start_time = time.time()
? ? response = openai.ChatCompletion.create(
? ? ? model='gpt-3.5-turbo',
? ? ? messages=[
? ? ? ? {
? ? ? ? ? ? 'role': 'user',
? ? ? ? ? ? 'content': question
? ? ? ? }
? ? ? ],
? ? )
? ? print(f'Question: {question}')
? ? print("Time consuming: {:.2f}s".format(time.time() - start_time))
? ? print(f'Answer: {response_text(response)}\n')
Cache loading....
Question: what's Linkedin
Time consuming: 11.57s
Answer: LinkedIn is a social networking platform designed
specifically for professionals and businesses.
It allows users to create a profile highlighting
their work experience and skills, connect with other professionals,
join industry-specific groups, and share content related to
their profession. It is widely used for job searching,
networking, and business development.
Question: what's Linkedin
Time consuming: 0.02s
Answer: LinkedIn is a social networking platform designed
specifically for professionals and businesses.
It allows users to create a profile highlighting their work
experience and skills, connect with other professionals,
join industry-specific groups,
and share content related to their profession.
It is widely used for job searching, networking,
and business development.
Following is an example of semantic search, where caching is performed based on semantic similarity. In other words, the cache can return results even for requests that are not exactly the same but have similar meanings.
import time
def response_text(openai_resp):
? ? return openai_resp['choices'][0]['message']['content']
from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
print("Cache loading.....")
onnx = Onnx()
data_manager = get_data_manager(CacheBase("sqlite"), VectorBase("faiss", dimension=onnx.dimension))
cache.init(
? ? embedding_func=onnx.to_embeddings,
? ? data_manager=data_manager,
? ? similarity_evaluation=SearchDistanceEvaluation(),
? ? )
cache.set_openai_key()
questions = [
? ? "what's github",
? ? "can you explain what GitHub is",
? ? "can you tell me more about GitHub"
? ? "what is the purpose of GitHub"
]
for question in questions:
? ? start_time = time.time()
? ? response = openai.ChatCompletion.create(
? ? ? ? model='gpt-3.5-turbo',
? ? ? ? messages=[
? ? ? ? ? ? {
? ? ? ? ? ? ? ? 'role': 'user',
? ? ? ? ? ? ? ? 'content': question
? ? ? ? ? ? }
? ? ? ? ],
? ? )
? ? print(f'Question: {question}')
? ? print("Time consuming: {:.2f}s".format(time.time() - start_time))
? ? print(f'Answer: {response_text(response)}\n')
Cache loading....
Question: what's github
Time consuming: 5.36s
Answer: GitHub is a web-based platform that provides hosting
for software development projects.
It offers tools for managing code repositories,
tracking changes to code, and collaborating with other developers.
It is popular with open-source projects,
but can also be used privately for team collaboration on
proprietary software projects. Users can access GitHub
through either a free or paid account and can interact with code
both through the web interface and with Git, a command-line
tool for managing code.
Question: can you explain what GitHub is
Time consuming: 0.15s
Answer: GitHub is a web-based platform that provides hosting for
software development projects. It offers tools for managing
code repositories, tracking changes to code, and collaborating
with other developers. It is popular with open-source projects,
but can also be used privately for team collaboration on
proprietary software projects. Users can access GitHub through
either a free or paid account and can interact with code both
through the web interface and with Git, a command-line tool
for managing code.
Question: can you tell me more about GitHubwhat is the purpose of GitHub
Time consuming: 0.18s
Answer: GitHub is a web-based platform that provides
hosting for software development projects. It offers tools for
managing code repositories, tracking changes to code, and
collaborating with other developers. It is popular with
open-source projects, but can also be used privately for team
collaboration on proprietary software projects. Users can
access GitHub through either a free or paid account and
can interact with code both through the web interface and with
Git, a command-line tool for managing code.
Choosing the right caching strategy can depend on your specific needs and constraints. In-memory caches are fast but limited by the size of available memory. Disk-based caches like SQLite can handle larger data sets but are slower. GPT caches add an extra layer of versatility by allowing caching of semantically similar requests.
By effectively using these caching strategies in our LangChain pipelines, we can greatly enhance their performance and efficiency of LLM based applications.