How to cache LLM calls using Langchain.

Before we get into caching LLM calls we need to understand the reasons why we need to cache LLM calls.

LLM's like OpenAI charges users based on the number of?tokens?processed, these tokens are calculated from both the input prompt the user provides and the output the ChatGPT generates.

There might be scenarios where user might ask the same question something like “what's github?” or “ Can you explain what GitHub is?” or “ can you tell me more about GitHub?” we know the chatGPT will provide the same output for these 3 prompts. Number of tokens will be calculated in all the 3 scenarios and associated cost will be added to the application.

To address this problem if we can cache the similar LLM requests then for the subsequent requests will be fetched from Cache and not LLM models like ChatGPT hence no tokens will be calculated for the cached requests/response and there will be no cost associated with it.

Another reason of using caching is for speed and efficiency. As the demand for Large Language Models increases, there is a growing need to optimize these systems for speed and efficiency. One of the key ways to achieve this is through caching.

In this article, we will discuss how to cache Language Model (LLM) calls in LangChain, LangChain is a framework designed to simplify the creation of applications using large language models. LangChain provides an caching layer for Chat Models.

In Memory Cache

Following is simple code snippet for using in memory cache.

import o
import openai

openai.api_key? = os.environ['OPENAI_API_KEY']

import langchain?
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI()

from langchain.cache import InMemoryCache
langchain.llm_cache = InMemoryCache()


# The first time, it is not yet in cache, so it should take longer
# Now let's measure the time it takes to execute this line
start_time = timeit.default_timer()
llm.predict("Tell me a joke")
elapsed_time = timeit.default_timer() - start_time


print(f"Execution time: {elapsed_time} seconds")s        

Running a joke request the first time will take longer as the result isn’t cached yet. The second time you run it, it should go faster as the result will be retrieved from the cache.

SQLite Cache

We can also use a SQLite cache, which is a lightweight disk-based storage that we can interact with using SQL syntax

import o
import openai

openai.api_key? = os.environ['OPENAI_API_KEY']

import langchain?
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI()

# We can do the same thing with a SQLite cache
from langchain.cache import SQLiteCache
langchain.llm_cache = SQLiteCache(database_path=".langchain.db")


# The first time, it is not yet in cache, so it should take longer
start_time = timeit.default_timer()
llm.predict("Tell me a joke on software engineer which has 3 paragraphs")
elapsed_time = timeit.default_timer() - start_time


print(f"Execution time: {elapsed_time} seconds")
        

Again, the first request will take longer as it isn’t cached. The subsequent calls will be faster as they retrieve the results from the cache.

GPTCache

GPTCache is another useful caching method that allows for exact match caching or caching results based on semantic similarity.

First we will see normal caching using GPTCache:

import time

def response_text(openai_resp):
? ? return openai_resp['choices'][0]['message']['content']


print("Cache loading.....")


# To use GPTCache, that's all you need
# -------------------------------------------------
from gptcache import cache
from gptcache.adapter import openai


cache.init()
cache.set_openai_key()
# -------------------------------------------------


question = "what's Linkedin"
for _ in range(2):
? ? start_time = time.time()
? ? response = openai.ChatCompletion.create(
? ? ? model='gpt-3.5-turbo',
? ? ? messages=[
? ? ? ? {
? ? ? ? ? ? 'role': 'user',
? ? ? ? ? ? 'content': question
? ? ? ? }
? ? ? ],
? ? )
? ? print(f'Question: {question}')
? ? print("Time consuming: {:.2f}s".format(time.time() - start_time))
? ? print(f'Answer: {response_text(response)}\n')

Cache loading....
Question: what's Linkedin
Time consuming: 11.57s
Answer: LinkedIn is a social networking platform designed 
specifically for professionals and businesses. 
It allows users to create a profile highlighting 
their work experience and skills, connect with other professionals, 
join industry-specific groups, and share content related to 
their profession. It is widely used for job searching, 
networking, and business development.

Question: what's Linkedin
Time consuming: 0.02s
Answer: LinkedIn is a social networking platform designed 
specifically for professionals and businesses. 
It allows users to create a profile highlighting their work 
experience and skills, connect with other professionals, 
join industry-specific groups, 
and share content related to their profession. 
It is widely used for job searching, networking, 
and business development.        

Following is an example of semantic search, where caching is performed based on semantic similarity. In other words, the cache can return results even for requests that are not exactly the same but have similar meanings.

import time

def response_text(openai_resp):
? ? return openai_resp['choices'][0]['message']['content']


from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation


print("Cache loading.....")


onnx = Onnx()
data_manager = get_data_manager(CacheBase("sqlite"), VectorBase("faiss", dimension=onnx.dimension))
cache.init(
? ? embedding_func=onnx.to_embeddings,
? ? data_manager=data_manager,
? ? similarity_evaluation=SearchDistanceEvaluation(),
? ? )
cache.set_openai_key()


questions = [
? ? "what's github",
? ? "can you explain what GitHub is",
? ? "can you tell me more about GitHub"
? ? "what is the purpose of GitHub"
]


for question in questions:
? ? start_time = time.time()
? ? response = openai.ChatCompletion.create(
? ? ? ? model='gpt-3.5-turbo',
? ? ? ? messages=[
? ? ? ? ? ? {
? ? ? ? ? ? ? ? 'role': 'user',
? ? ? ? ? ? ? ? 'content': question
? ? ? ? ? ? }
? ? ? ? ],
? ? )
? ? print(f'Question: {question}')
? ? print("Time consuming: {:.2f}s".format(time.time() - start_time))
? ? print(f'Answer: {response_text(response)}\n')

Cache loading....
Question: what's github
Time consuming: 5.36s
Answer: GitHub is a web-based platform that provides hosting 
for software development projects. 
It offers tools for managing code repositories, 
tracking changes to code, and collaborating with other developers. 
It is popular with open-source projects, 
but can also be used privately for team collaboration on 
proprietary software projects. Users can access GitHub 
through either a free or paid account and can interact with code 
both through the web interface and with Git, a command-line 
tool for managing code.

Question: can you explain what GitHub is
Time consuming: 0.15s
Answer: GitHub is a web-based platform that provides hosting for 
software development projects. It offers tools for managing 
code repositories, tracking changes to code, and collaborating 
with other developers. It is popular with open-source projects, 
but can also be used privately for team collaboration on 
proprietary software projects. Users can access GitHub through 
either a free or paid account and can interact with code both 
through the web interface and with Git, a command-line tool 
for managing code.

Question: can you tell me more about GitHubwhat is the purpose of GitHub
Time consuming: 0.18s
Answer: GitHub is a web-based platform that provides 
hosting for software development projects. It offers tools for 
managing code repositories, tracking changes to code, and 
collaborating with other developers. It is popular with 
open-source projects, but can also be used privately for team 
collaboration on proprietary software projects. Users can 
access GitHub through either a free or paid account and 
can interact with code both through the web interface and with 
Git, a command-line tool for managing code.
        

Choosing the right caching strategy can depend on your specific needs and constraints. In-memory caches are fast but limited by the size of available memory. Disk-based caches like SQLite can handle larger data sets but are slower. GPT caches add an extra layer of versatility by allowing caching of semantically similar requests.

By effectively using these caching strategies in our LangChain pipelines, we can greatly enhance their performance and efficiency of LLM based applications.

要查看或添加评论,请登录

Suman Mishra的更多文章

  • Mindfullness for Kids

    Mindfullness for Kids

    I absolutely love this Book! I am a big fan of encouraging mindfulness practices in children and this book is a…

  • Big Ideas for Curious Mind- An incredible book for young readers and in general for everyone.

    Big Ideas for Curious Mind- An incredible book for young readers and in general for everyone.

    Big Ideas for Curious Minds is a wonderful book to introduce young readers to philosophy. I gave this book to my 14…

    1 条评论
  • Shopify & NFTs

    Shopify & NFTs

    Yes, you heard it right! You can now sell NFTs on Shopify. When I heard about the news that we can mint and sell NFTs…

社区洞察

其他会员也浏览了