9 Methods to Enhance the Performance of an LLM RAG Application
It is easy to prototype your first LLM RAG (Retrieval Augmented Generation) application, e.g. using this chat-langchain template with below architecture.
But it is hard to make it work well. In this article, I gather and share some approaches to enhance the performance of the LLM RAG application.
For more details, please refer to references I mention in each section.
1. Store message histories and user feedbacks
Chat histories and user feedbacks are important for the application analytics. We will use them later in a next session.
In an above schema, 1 collection should have multiple embeddings. 1 user can have many chat sessions, and in each chat session, we store the messages (between human and AI) and their analytical information such as generated questions (condense questions or questions after query transformations), retrieved chunks and corresponding distance scores, user feedback, .etc.
2. Start evaluating your application
A naive RAG app can have some challenges, e.g.: bad retrieval (low precision, low recall, outdated information), bad response generation (hallucination, irrelevance, toxicity/bias), .etc.
Before improving it, we need a way to measure its performance. We can use the following ragas metrics for the evaluation.
Because some scores need ground_truth, there are 2 ways to create the labeled evaluation dataset:
Eventually we can run through this labeled dataset with above pre-defined metrics and use GPT-4 Turbo as an evaluator/judge.
For a better observation, we should also use some LLMOps platforms such as LangSmith, MLflow or integrate DeepEval in CI/CD pipelines.
3. Multi-vector retriever
When splitting documents for retrieval, there are often conflicting desires:
Multi-vector retriever approaches below allow us to balance precise embeddings and context retention by splitting documents into smaller chunks for embedding but retrieving larger text information or even the whole original document for the prompt context, since many LLM models nowadays support long context window, e.g. GPT-4 Turbo supports 128,000 tokens.
Parent Document Retriever
Hypothetical Questions
Summaries
4. Query transformations
Because the original query can not be always optimal to retrieve for the LLM, especially in the real world. The user often doesn't provide the full context and thinks about the question from a specific angle.
Query transformation deals with transformations of the user's question before passing to the embedding model. Below are a few variations of query transform methods and their sample prompt implementation. They are all using an LLM to generate a new or multiple new queries.
We can also combine multiple query transformation techniques to get the best result e.g.
5. Query construction for retrieval optimization
Self-querying
Regarding self-querying, do you remember a metadata column in an above embedding table? We can include additional information such as author, genre, rating, the date it was written, …, and any information about the document beyond the text itself. We can define a schema and store the metadata in a structured way alongside the vector representation.
With the database metadata schema, we use LLM to construct a structured query from the question to filter the document chunks. At the same time, the question is also converted into its vector representation for the similarity search. This kind of hybrid retrieval approaches are likely to become more and more common when RAG becomes a more widely adopted strategy.
Time-weighted retriever
In some cases, the information contained in the documents is only relevant if it is recent enough. In the context of a time-weighted retriever, the data is retrieved based on the hybrid score between semantic similarity and the age of the document. The algorithm for scoring it can be:
semantic_similarity + (1.0 - decay_rate) ^ hours_passed
Other query constructions
There are some other query construction methods for unstructured, semi and structure data
6. Document selection optimization
Re-ranking
After first-stage retrieval (lexical/keyword-based search or semantic/embedding-based search), doing re-ranking as a second stage to rank retrieved documents using relevance scores.
Maximal Marginal Relevance (MMR)
Sometimes we retrieve more than we actually need, there can be similar documents capturing the same information. The MMR metric penalizes redundant information.
The reranking is an iterative process where we measure the similarity of the vectors to the query and the similarity of the vectors to the vectors we have already re-ranked, end up with a vector similar to the query but dissimilar to the vectors we already reranked.
7. Context Optimization
Now that we have selected the correct documents to answer the question, we need to figure out how we will pass data as the optimized context for the LLM to answer the question.
8. Multimodal RAG
When dealing with semi-structured or unstructured data e.g. tables, text, and images, we might need multimodal LLM and/or multimodal embeddings, below are some options:
9. Agents
Last but not least, you may not only build the RAG app to answer questions from documents, we can have multiple tools to augment the LLM app, or route question between multiple datastores. An agent uses the LLM to choose a sequence of actions to take to solve a problem.
The Agent can consist of some key components:
There are some types of agents we should first start with:
Conclusion
I suggest to read all above methods and other RAG Strategies from Open AI, then pick the ones that are most relevant to your use case. You can also combine multiple approaches to get the best result. For example, the first architecture can be turned to a below one:
If you find this article useful, please give it a like and share it with your friends. Also kindly checkout a generative_ai repo that contains some generative AI techniques and follow me in Medium.
Thanks for reading!
Software Developer | Engineer | Entrepeneur
11 个月Hi there, I just found this article. It seems to be great.I'm going to start trying and applying it. Thanks!
Your Partner in Growth
1 年Thanks for sharing this incredible opportunity! Your generosity in sharing valuable opportunities like this truly makes a positive impact on your followers. Keep inspiring us with your insights and opportunities!
Senior System Engineer
1 年Thanks a lot for sharing TamNP. Keep up the good work. Looking forward to reading more of these.