When do LLM-RAG applications do not scale for production grade AI systems?
Think about, what is the difference between referring to a book to solve a problem versus knowing the subject in the book to devise a solution. In the former, one needs to find the pages that are related to the topic of the problem, read those few pages, and then try to solve the problem. This may work for simpler problems and subjects but does not scale well to devising a solution. The latter on the other hand implies that you know the entire subject well, and therefore can devise a solution based on the entire knowledge. LLM-RAG is like the former, and does not scale when the application requires deep "thinking".
In real-life, to reason and respond to a query you need to have full context of the customer's situation and enterprise specific knowledge beyond the common industry know-how. This information is spread across customer data, product definitions, standardized processes, compliance requirements, data access APIs and nomenclature/ jargon specific to the company. In an LLM-RAG application, we must wisely choose the knowledge base relevant to the given enquiry, and then send it to the LLM, every single time. If we consider information in documents only, then vector similarity search can fetch a certain paragraph where that topic is discussed, or certain keywords have been used. This lacks the context of that paragraph, chapter, book and implied reasoning unrelated to the subject.?
Certain methods have been proposed for better prompt engineering to overcome these. These methods try to find a way for narrowing down the focus of LLM to specific information related to the query at hand. Some of these methods are listed below, just for completeness of this discussion. However, even when full information is provided using 1M tokens, the retrieval of information passed in context is still not 100% (its 90%). So, the model can miss a crucial piece of information from the context and gives a wrong answer. So, you need to provide grounding by listing down all the scenarios and provide a Chain-of-thought or SOP for each scenario. That might not still be exhaustive enough for conversation scenarios possible…?
Advanced methods for LLM-RAG to work:
领英推荐
Etc.
Note that for businesses even a single incorrect response generated can mean legal issues. For example: Insurance Chatbot agrees to pay a claim for an expired policy as it could not think deeply on the customer situation.
So LLM-RAG works quite well until information, its reasoning and reference examples can be provided accurately. Adding more information and thinking-steps begins to increase the response-time or the resource requirements for generating reasonable responses. Infact, in order to optimize the response time and response quality one can end up creating a rule-engine to serve only the right kind of information essential to answer a given query. Hence, creating a complex web of scenarios and large number of reference information chunks for each, lead to creation of a pseudo rule-engine around the LLM. Now, its common knowledge that rule-engines get restricted in dimensionality, become brittle and un-updatable at some point.
This sounds very counterintuitive, even unbelievable, because we have been shown that LLM’s can “think” and “reason” when asked to do so “step-by-step”. But that ability is derived statistically based on the data it is trained on. The thinking ability is a byproduct of a very high incidence rate of certain linguistic reasoning steps in the training data. LLM’s thinking ability is not a high-specificity numerical formula, like in physics or chemistry. We have been persuaded to believe in the reasoning abilities of LLM much more than their real skill level. In fact, their thinking ability is malleable and hence LLM-RAG works in the first place. LLM-RAG can be called leveraging “Recency-bias” of the models.
The vector representation of the text obtained from LLM is static in nature, it does not include this thinking ability at all. Therefore, RAG is simply text-augmentation and not thinking-augmentation. Referring to our original analogy of Textbook versus Teacher: LLM-RAG is the textbook provided to LLM in an open-book exam. You might be able to solve a few problems which require limited information in a few pages to find a solution. LLM training or Fine-tuning on the other hand is like learning the subject itself like a teacher. Therefore, it statistically derives the ability to “think” based on linguistic reasoning in the training data. If the fine-tuning data set represents the real-life distribution of queries, the model would ideally learn some reasoning abilities as well.
So, what next when we hit the limits of LLM-RAG?
Python | SQL | Power BI | Statistics | Machine Learning | Tableau | Alstom PGET | Azure AI Fundamentals Certified | Azure Data Fundamentals Certified | Customer Success |
8 个月Intresting writeup !!