Recent developments in Retrieval Augmented Generation (RAG) Systems
With the advent of LLMs, the capability and need for better search gave birth to new methods for search. The historical search methods of keyword-based and BM25 methods were effective, to some extent. LLMs have taken the search capability up a notch. Though this topic is relatively recent, much research has been done and published. In this blog, I will try to capture the popular techniques that are being used to improve the performance of RAG systems and the relevance of their outputs. Though an effort is made to capture many of the good ideas, there are so many more available to improve the quality of RAG systems.?
The concept of RAG was introduced in 2020 by Lewis who suggested, “In knowledge-intense tasks, to enhance search results with retrieval it would be beneficial to use an LLM to retrieve the relevant information from source documents and use the inherent capability of an LLM to understand that source info to generate the search result.” At a fundamental level, a RAG system involves the following steps: (i) a user queries for information (ii) the system searches for documents that may have the answer(s) to the query and retrieves them (iii) the retrieved documents are fed into an LLM as context along with the search query in a prompt (iv) LLM understands the prompt, assimilates the info in the documents, and generates an answer to the query.?
This concept addresses issues that otherwise bog down an LLM trained on certain time-limited data. For example, if the query refers to information that LLM was never exposed to, the prompt results in an incorrect answer. LLMs may not have the detailed information that the query seeks to get. Also, LLM sometimes manufactures (incorrect) information (called hallucination). In addition, a query, seeking an enterprise-specific answer, results in an incorrect answer if the LLM being used is not trained on the internal information of the enterprise.??
In the initial phases of research into RAG systems, to enhance the search results, a solution was proposed where the LLM was trained with additional documentation (say org-specific, topic-specific, or just the latest information). Such a technique is called fine-tuning. This idea quickly lost momentum for a couple of reasons. Training LLM with additional documentation is costly. Not every organization has the resources and time to manage the additional training. But even if it were indeed possible for an organization to train such documentation, the need to re-train using the latest documentation would be a repetitive process and the organization would be involved in a continuous LLM training process that is not economical.?
RAG research has been focused on answering questions such as: (i) what to retrieve, (ii) how to retrieve, (iii) how much to retrieve, (iv) how to store such data before retrieval, (v) how to feed the retrieved data to the LLM, including the sequence of feeding the data, (vi) how to generate the output, (viii) how to sequence the display of the answers (if the answers happen to be multiple) before a final selection is made to display the answer to the prompt.
The most fundamental and earliest RAG frameworks focused on indexing, retrieval, and generation. Indexing involves how the data is collected, cleaned, stored, and made consumption-ready before it is input to LLM. The data could be in various formats including doc, pdf, jpg, png, mp4, etc. The data is first converted to plaintext. Then the text is broken into small pieces - say a few words, sentences, or paragraphs, called chunks (and thus the process of chunking). These chunks are converted to vector representations using standard embedding models. Finally, the text chunks and related vector embeddings are indexed as key-value pairs in a vector database.
The retrieval process involves a user entering a query as a prompt and the query is converted to a vector representation. Using a similarity index, vector representations of the documents that are similar to the query are identified. The system retrieves the top K documents.?
The query and the top K documents are fed to the LLM system to generate an appropriate answer to the query. Depending upon the training of the model, the RAG system would tap into the parametric memory of the model to generate the answer or use the supplied documentation to assimilate and generate the answer.
However, the above simplistic approach has certain weaknesses.?The retrieval process may have a low precision, meaning some of the retrieved documents may be the wrong ones. The system may not identify all the relevant documents resulting in low recall. Also, the RAG system needs to access the latest documentation. If the database of the documents is not up to date, the system will not do a good job of giving the best answer. On the generation side, the system may pose some more issues. Hallucination may occur and if enough guardrails are not provided, the system may generate irrelevant, toxic, biased, or objectionable content that may defeat the purpose. Additionally, in certain situations when irrelevant information is retrieved and fed to the LLM, the system may generate disjointed, repetitive, or irrelevant answers. Differences in tones, tenses, and first/third person style of writing in the retrieved documents would also confuse the LLM and generate suboptimal answers.?
In some simplistic RAG systems, the output may overly depend on the augmented information and may output mere extractions of the input context rather than synthesize it for a better output.?Also, the top K documents that are selected should be such that they do not exceed the size of the context window, else it may introduce noise and lose focus on the relevant information. In an interesting paper Walking Down the Memory Lane, the authors propose a method of reducing the context window and yet maintaining the effectiveness of generation high. In a method called MemWalker, the context window is converted into a tree of connected summary nodes of the documents. When LLM receives a query, the model traverses the tree to identify the relevant summaries that are useful to answer the query. Once it gets the relevant summaries, the model executes the query to generate an answer.??
To design the best RAG system and to get the best results, the following best practices would help.
Improving Indexing: The quality of indexing improves with better chunking of documents. Breaking the documents into the right sizes would help identify the right embeddings and the right indexes. Many strategies for the right size of chucking exist. For example, (i) all chunks thus are of the same size, say 100 tokens, (ii) each page, paragraph, or sentence would be a chunk (iii) mixing & matching i.e. include a variety of sizes. Though the precision and recall of the retrieval may increase by having a smaller chunk size, the cost of converting into embeddings and storage would quickly escalate. Optimizing chunking to match the capabilities of the LLMs also helps. Some LLMs have longer or shorter contact length limitations. The modality of the task of the RAG also influences the chunking strategy. If the objective of the RAG is questions answering vs searching documents, accordingly the chunking size varies. Add metadata to data. Metadata like dates, purpose, author, style of doc (report, book, blog, etc.), original language, etc would help with the accuracy vis-a-vis the query.
Search & Retrieval: In addition to retrieving similar vectors, we could also retrieve documents using keywords and other metadata and enrich the context of documents that are being fed to the LLM. The age-old BM25 method of search retrieval could also be employed to retrieve documents. Semantic search could also be used to strengthen the search effectiveness. When multiple documents are retrieved, some ranking needs to be used to feed only top K documents to the system before generation. Also in some instances, practitioners use Google (or other similar) search systems to retrieve the latest info to feed the RAG system.
Graph Databases: To retrieve the right documents, not all practitioners prefer vector searches. Some practitioners use graph databases to find correlations to the relevant documents. In graph databases, the documents and their relationships are converted to nodes and edges. Such an approach makes it faster to retrieve and improve the relevance of the retrieved documents.?
Finetuned LLM: There are instances where practitioners have improved retrieval effectiveness by finetuning the LLM on relevant information. Let's say, in a healthcare application, if the LLM is finetuned with healthcare info and later used to generate answers, the LLM can better comprehend the context provided as it is finetuned with relevant information.?
Prompt Cleanup: Many users type a prompt that may have a lot of noise and be inefficient for the model to understand. A small LLM could be used to clean up and re-write the prompt to enhance effectiveness. This clean-up highlights the main point of the query, eliminates redundancy, and compresses the size of the prompt. Some practitioners accomplish the prompt clean-up by using prompt summarization too.
领英推荐
RAG Fusion: In this excellent blog, the author talks about RAG Fusion where multiple prompts are generated from the original prompt. These multiple prompts highlight different perspectives of the focus point and also enhance the focus area of the primary query to enrich the prompt to get a better answer. For multiple queries, a re-ranker could be applied to pick the better one(s).
Query Routing: Organizations store their data in multiple databases. For example, a Vector database would have vectorized data. Similarly, there are graph databases to be searched. A relational database would have structured data. Latest and streaming data could be collected in high-quality data lakes. Different types of data like docs, PDFs, JPEGs, MP4, etc are stored in different databases.? When a RAG Fusion creates multiple queries, routing the queries to the right set of data sources for efficient retrieval is another step practitioners do.
Query Rewrite: Users are not good at writing a good and optimized query. Why not take the help of an LLM to rewrite the query to improve its quality, thereby enhancing the quality of retrieval?
RAG and GAR (Retrieval-augmented Generation and Generation-augmented Retrieval): In an interesting paper, authors propose an innovative approach of iteratively enhancing the query and feeding it into the RAG system to generate a better answer, which in turn would be used to create a better query and an enhanced query is used to create a better answer.?
Finetuned Embedding Model & Dynamic Embedding: The effectiveness of the retrieval system is better when the embeddings of the prompt and the stored data are generated with an embedding model that is finetuned on relevant data. For example, between a barebones LLM that is trained on Wikipedia data alone and the same model finetuned on healthcare information, the latter would create better embeddings. And better embeddings enhance the search results by identifying a set of more similar documents. Dynamic embedding is a concept where the embedding model is trained to generate an embedding for a word based on the surrounding words.?BERT, which is a more common embedding model, generates dynamic embeddings. OpenAI has a few embedding models available.?
Retrieve-then-read vs retrieve-then-generate-and-read: In another interesting paper, the authors suggest an improvement over the traditional sequence of retrieving and reading by an LLM to generate an answer. In their GenRead method, the authors suggest first creating context from the retrieved documents and feeding that cleaned-up version of the context for generation. This reduces redundancy and noise. A similar sounding recite-then-read approach is suggested in a paper, where the authors suggest retrieving memorized info from the weights of the models and then generating an answer. (Note: I am not sure how this is dramatically different from regular prompt engineering).
Hypothetical Document (hypo doc) Embedding: In yet another eye-catching approach in a paper, the authors advocate for creating a hypothetical document that purports to contain the answer to the query. This hypothetical document may contain information that may be in the neighborhood of the answer. Sometimes it may even have falsehoods. Then this document is encoded into an embedding vector. The neighborhood vectors to this hypo doc vector are searched in the vectorized document database. The retrieved vectors are fed into an LLM to get an accurate answer. This approach is called Hypothetical Document Retrieval. The authors found this approach to show surprisingly accurate results across various tasks and languages.??
Abstract Embedding Approach: In this approach, abstracts are generated for all the documents in the database and are vectorized. During retrieval, the query vector retrieves the abstract vectors that are similar or in the neighborhood.?
Parent Document Retriever: In an interesting blog, the author advocates creating child documents (smaller chunks of a bigger document) and vectorizing them. When the child vectors are retrieved, the relevant parent documents are accessed and fed into LLM for a better query response.?
Summarization: Some practitioners also suggest using summarization before feeding the docs to the system. For example, if the retrieval results in multiple documents, to save costs, a summarized version of the retrieved documents could be fed to the system to provide better context and reduce the possibility of hallucination.?
“Lost in the Middle” (LIM) Syndrome: When LLMs are fed with a larger context window, they exhibit a behavior called “lost in the middle” where LLMs focus more attention on information available at the start and at the end, missing large portions in the middle. To avoid LIM syndrome, practitioners could re-arrange the data in a couple of iterations before being fed to the system.?
Diversity Ranker: The documents would be fed in the order of their diversity of content. The more diverse the documents are, the more they are fed closer to each other to let the LLM get a distribution of information. Such an approach hopes that LLM generates a more balanced and diversified answer that is closer to the ask of the query. Here is an implementation of “The Lost in the Middle” (LIM) Ranker and Diversity Ranker. Check here for Cohere Re-ranker. Also, check here for a nice blog on retrievers and re-rankers.?
Check here for some additional blogs:
?
Global CEO & Founder | Board Director | Innovation | Governance | Data Management | Cybersecurity | AI & Data Strategy CEO Awardee | Keynote Speaker | AI or Not The Podcast Host | Certified AI Auditor, FHCA
1 年really good explanation here...thanks much
CISSP, CCSP, Information Security Practitioner | Board Member | Driving App & Infra Teams to Build & Maintain Robust Security Posture.
1 年Great article. Very insightful!
AI & GenAI Director | Thought Leader & Lead Advisor at NTT Data| Dreamer| Introspector| Learner | Builder| ISTJ
1 年Thanks SK Reddy very detailed blog, good to see RAG and GAR?concepts explored.