GenAI - The Cost of Context.
As I write this article, OpenAI released their latest model GPT 4 Turbo today - its OpenAI's most powerful AI yet. Numerous other organizations like Google, Huggingface, AWS, Meta, Anthropic etc. all have a very progressive view and their own differentiation on the LLMs catering to various needs of today and the future, however there is one thing common to all leading LLMs, the context. OpenAI's GPT-4 boasts an impressive context window of 128 thousand tokens, this is an excellent feature and a bait at the same time. For the uninitiated let's understand what context mean.
The term "context" in relation to LLMs such as OpenAI's GPT-4 Turbo refers to the relevant data provided to the AI to facilitate specific responses. Since LLMs lack persistent memory, they require context to be reintroduced with each interaction. GPT-4 Turbo's impressive context window can process up to 128 thousand tokens, enhancing its ability to understand and generate detailed and relevant responses.
From OpenAI Website:
A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ? of a word (so 100 tokens ~= 75 words).
Let's taken an example.
If one prompts ChatGPT to summarize a book, the model analyzes the text— 10,000 characters, for example—and produces a summary. Should there be subsequent questions, the model needs the text to be presented again to continue generating pertinent answers. This necessity is not readily apparent when using the ChatGPT web interface, where context management is seamless, but becomes very evident when interacting directly with OpenAI's APIs, impacting usage costs.
So what's the big deal?
The cost implication is significant, as each piece of data sent to or received from an LLM constitutes tokens, which are priced accordingly. Using OpenAI's tokenizer, one can estimate the token count for any text block, with a book summary example being approx. 3 cents per API request under the current pricing model for GPT-4 Turbo. This expense escalates with each additional prompt that includes the requisite context.
Moreover, certain applications might demand more context than the current maximum limit provided by these models. Below is the sample context length of popular LLMs
Enter Retrieval Augmented Generation (RAG)
Simply put, techniques like RAG provide methods to optimize the contextual information supplied to LLMs to make the responses more accurate and reduce token costs and limitations. There are a few approaches to implementing RAG and this is an evolving space where new tools and providers are emerging to provide the most comprehensive solution. One RAG approach I personally like is combining GenAI with Vector databases. This approach includes taking large context data and feeding it to Vector databases as embeddings. While embeddings are a topic of discussion in its own but for this article, think of embeddings as converting each word into a number and clustering similar or related numbers together to create a relationship between them. Special Vector databases like Pinecone, Weaviate, Redis etc. are some of the emerging platforms. This can be further optimized by using approaches like map reduce to make embeddings more granular.
With this approach a user first loads contextual data to a vector database, then creates a question (query) that is itself converted to embeddings and queried against a vector database and the result is a relevant chunk of data that contains the textual context that can be fed to LLMs for refined results. This may sound complex but its surprisingly easy to try out.
In the book example above, with the RAG approach, the follow up prompt need not carry with it the full 10K of context but rather a much smaller subset that a vector database will provide resulting in better results and lower token costs.
It remains to be seen whether these innovations will stand the test of time as LLMs continue to evolve and potentially integrate these capabilities natively but for now a lot of advancements seem to be happening in this space, take MemGTP as an example that is taking the approach of using LLMs as operating systems to give it long lasting memory. For now, it's clear that the AI playground is getting some serious upgrades – our silicon-based friends might just end up with better memories than us. Interesting times, indeed!
Senior Technical Writer | PhD Data Science | Ex DigitalOcean | Ex Naukri.com | Writer of 100+ Google ranking technical articles | Author of Elsevier and SPRINGER research papers | API documentation
8 个月?I have written an article on it. If you want to learn more, then read it. https://blog.paperspace.com/memgpt-with-real-life-example-bridging-the-gap-between-ai-and-os. Hey guys, if you are looking for affordable and powerful GPUs, I recommend checking Paperspace GPU. I used it personally and find it a good option for next-gen AI applications. Try NVIDIA H100 GPU on the Paperspace platform at affordable prices.https://www.paperspace.com/. https://bit.ly/3whNViA. Also, try their free GPU https://www.paperspace.com/gradient/free-gpu.