Prompt Caching
In Gen AI applications involving LLMs, API calls are made to LLM Providers to generate text. LLM providers charge you for the number of tokens sent/received in the requests/response to their api calls. Also from a latency perspective LLM providers will take a hit if the number of requests are more. So Prompt caching plays an important role in improving efficiency, lowering costs, and enhancing responsiveness of language model applications.
Prompt caching is a strategy that involves storing responses to prompts that have been previously queried. When a prompt is repeated, instead of sending a new API request and incurring extra computational cost and time, the cached response is retrieved and used. For applications where repeated queries occur frequently, prompt caching can provide substantial benefits, such as decreasing response latency, saving computational resources, and reducing API costs.
Prompt caching can be implemented in a variety of ways using variety of caching solutions. In this case we will use a redis server as a caching layer.
Redis server can be installed in Mac OS using the following commands
brew install redis
brew services start redis
import redis
import hashlib
import time
import os
from langchain_openai import ChatOpenAI