Retrieval Augmented Generation is becoming key framework for industries and GenAI practitioners to built LLM powered applications. It has lot of potential to leverage LLMs optimally and efficiently to build end to end GenAI applications like multi functional chatbots, search engines and many more.
In this article, I have explained in detail about what is Retrieval Augmented Generation ( RAG ) and how it works. Going further I have also explained how basic Vanilla RAG or Naive RAG is optimised and modified to Advanced RAG Framework as well as Evaluation of RAG Framework.
RAG opens up the creativity one can do with LLMs... Lets jump into it...
Need of RAG
- Pre trained Foundational LLMs are trained on general purpose data. They can give accurate and relevant output for general queries.
- But for domain specific, external and latest updated data it may not work well. In such cases LLMs will hallucinate and give wrong or misleading output.
- To overcome this, there are some ways like finetuning of LLM and RAG.
- I have already created article on Finetuning of LLMs. In this blog we will understand about RAG
Basics of RAG
- Retrieval Augmented Generation ( RAG ) is a framework designed to supplement additional relevant context information to LLM along with the original query in the form of Prompt, this helps LLM to understand users query and its context properly and hence gives good, relevant output.
- It's just like a open book exam, where looking at question it first retrieves relevant context data and then generate answer from it.
- Before going towards working of RAG, lets see some of the basic concepts, tools which plays an important role in RAG Framework.
Vector Database or Vector Store :
- Vector Database is a new form of database in the era of AI, which is designed to efficiently store and retrieve vector embeddings. It uses semantic search for retrieval of similar embeddings. This speciality of efficient store and semantic search of embeddings makes them an integral and important part of Generative AI projects.
- Vectors are nothing but numerical representation of data. In AI we called it as vector embeddings or indexing. Let's say image/text in the form of vectors. Similar context data have similar vector embeddings in n dimensional space which makes them closer to each other. Searching of such similar vectors is called as semantic search, which uses cosine similarity between vectors in n dimensional space.
- Vector Database is use to store such vector representation of data and uses Semantic Search approach to retrieve similar vectors base on input vector. This makes them different than SQL or NoSQL database.
- In RAG Framework Vector Database comes into play at context retrieval part.
Prompt Engineering :
- Prompt Engineering is an art of communicating with LLMs. It is nothing but the way of formatting or structuring the prompts or input context in order to get most efficient and desire results from LLM.
- In case of basic RAG, prompt template can be formatted like below :
QUESTION:
[Actual question]
DOCUMENTS:
[Text of actual documents chunk retrieved from vector database for given question]
INSTRUCTION:
For given QUESTION find the answer using context information provided in
DOCUMENTS.
ANSWER:
- The end goal of RAG is to get relevant output by passing this type of prompt to LLM.
- In Advanced RAG, structure of Prompt can be optimized which we will see further..
- Now, when we use frameworks like LangChain or LlamaIndex, this prompt formatting is taken care in backend by itself.
Architecture and Working of Retrieval Augmented Generation ( RAG ):
- 1st step in RAG is to store vector embedding of data/documents to vector database. It is also called as Indexing.
- Initially, Document/s or domain data is splits into chunks of specific size. Chunks are nothing but the subparts of whole document, this can be done with libraries like LangChain or LlamaIndex.
- This chunks are then converted into vector embeddings using embedding model. We can use Large Language Models to generate embeddings. Almost all LLM services offers this type of embedding models.
- After that, this vectors are stored in vector database like chroma, pinecone, FAISS etc. There are many vector stores available in market. Which one to choose is depends on various factors like scalability, latency, security, costing and many more.
- Once this is done, then comes actual retrieval, augmentation and generation part in RAG Framework. Lets take a look at this 3 parts step by step.
- Below is the basic flow diagram for RAG.
Retrieval Part :
- When user enters/inputs query, it get converted into vector embeddings using LLM embedding model. This Embedding model should be same which we used in storing document chunk embeddings into vector db.
- After converting the user query text into embeddings, it will retrieve similar context chunks from vector db using semantic search. This works on a principle that similar context vectors are closer to each other in n dimensional space.
Augmentation Part :
- In augmentation, we are simply adding retrieved chunk as an additional context to user query.
- So, user query + retrieved context will become our final prompt which we will pass to LLM for output generation.
- Example of Augmented Prompt:
QUESTION:
[Actual question]
DOCUMENTS:
[Text of actual documents chunk retrieved from vector database for given question]
INSTRUCTION:
For given QUESTION find the answer using context information provided in
DOCUMENTS.
ANSWER:
Generation Part :
- Here we pass our augmented prompt to LLM and then it will generate relevant output by understanding user query and provided context along with it in input prompt.
- This is what the idea of RAG is.
- This is a basic Vanilla RAG which we also called as Naive RAG.
- But Naive RAG is not suppose to be as most effective and optimized RAG Framework for complex usecases and for production ready applications.
- In order to improve and optimize quality of RAG framework, some of the advancements can be done on top of this basic architecture.
- Lets see why advancements are required on Naive RAG,
Problems with Naive RAG or Vanilla RAG :
- Single chunk retrieved from vector db may not always contain all needed context data as chunk is nothing but the sub unit of whole document. It may have half important context and missing other important context.
- Even if chunk retrieved from vector db has good similarity score, it's not necessarily to have a good relevancy.
- Sometimes query passed by user may not be properly phrased or structured, so LLM may not give output in first instance. Also sometimes user may add multiple questions in single query, in such case single chunk may not provide context for all the questions.
- The order or rank in which we pass retrieved chunks to the prompt has significant impact on quality of output. Reranking feature is not there in Naive RAG.
To overcome this issues, some advancements and modifications can be done on top of Naive RAG architecture.
Advanced RAG :
- Advanced RAG has some additional optimized features over Naive RAG.
- It includes modification of indexing / embedding methods for document chunks, some advancements before, during and after retrieval process which also known as pre retrieval and post retrieval processes, response synthesizing for augmentation and generation part.
Architecture and Working of Advanced RAG :
- Let's first see high level architecture of Advanced RAG and then we will jump into each additional component in it over Naive RAG.
- In Advanced RAG, the optimizations which are done on top of basic Vanilla RAG skeleton are mainly divided into 3 parts, pre retrieval, retrieval and post retrieval processes.
- Let's deep dive into this 3 sections one by one.
Pre Retrieval Processes :
- As name suggest this contains optimizations which are done before retrieval process to enhance quality retrieval of context.
- It includes optimized chunking and indexing, query transformation, query routing.
Chunking & Indexing of Document :
- As we have already seen in Naive RAG that chunks are nothing but the small parts of whole document and indexing is vector representation of this chunks which we store in Vector DB.
- How we do chunking and indexing/embedding makes an impact on accurate retrieval which then improves generation quality and contextual confidence.
- Fixed size chunking like simple character splitter or word splitter is common and simplest method for chunking but it is not much effective as it may not holds full context of specific subject which also known as context fragmentation.
- To overcome this issues, we can use advanced techniques like recursive structure aware chunking, content based chunking which can keep the context and coherence of text, Sentence Window Parsing which separates documents into chunk of sentences which will get embedded but it also considers window sentences, for example, if we have window size = 3 then it will take previous and sentence after the sentence which going to embed, then it will store this information as metadata. Parent - child chunks in which document is partitioned into larger parent chunks and then into their smaller child chunks.
- This advanced methods can be computationally costly but gives better results.
- Indexing is converting chunks into embeddings or vectors. Quality of indexing can be improve with some advance methods or embedding models.
- We can use specialized embedding models for vectorization/indexing. To choose best model as per our requirements and constraints , we can use Massive Text Embedding Benchmark (MTEB)
- For more domain specific applications, we can also use custom finetuned model for embeddings. It can understand context more efficiently and can create quality embeddings which can improve overall RAG performance. Finetuned embeddings can improve performance about 5-10 % .
Query Transformation :
- Query transformation is a method of improving quality of user query by restructuring it to improve retrieval quality.
- It includes techniques like, decomposing main query into multiple sub queries, step-back prompting, query rewriting
??Multi Query Retrieval / Sub Query Decomposition :
- If query is complex and having multiple context then, retrieval with the single query may not be the good approach as it may fail to get proper output.
- In sub query decomposition, first user query is decomposed into multiple sub queries using LLM, then retrieval using this sub queries is done in parallel and after that this retrieved contexts is combined together as single prompt for final answer generation.
- In LangChain we can use MultiQueryRetriever for implementation of this technique. In LlamaIndex SubQuestionQueryEngine can be used.
- It is a prompting method which use to generate more general high level query for original complex query using LLM. This generated query is also called as step back query.
- At the time of retrieval, context for both, step back query as well as original query is retrieved and used in final prompt to get generated output.
- In LangChain we can use predefine prompt for step back questions using LangChain hub ("langchain-ai/stepback-answer")
- In real world, user query may not properly phrased or optimized to get quality retrieval. This will affect on end output.
- To overcome this issue, we can rewrite or rephrased the query which can optimally retrieve relevant context.
- In LangChain we can use predefine prompt for step back questions using LangChain hub ("langchain-ai/rewrite")
??Query Construction / Translation :
- When we want to connect and retrieve relevant data from database like SQL based on user query then we need to convert query text into relevant SQL query.
- To do this we can use LLM. Frameworks like LangChain provides predefine functions to implement this type of translation.
Query Routing :
- When we are having multiple vector stores / databases or various actions to perform on user query based on its context, then routing the user query to right direction is very important for relevant retrieval and further generation.
- Using specific prompt and output parser we can use LLM call to decide which action to perform or where to route the user query.
- We can use prompt chaining or custom Agents to implement query routing in LangChain or LlamaIndex.
Retrieval Processes :
- It is the process where we are retrieving relevant context for given query from vector store or other database.
- Instead of using normal document chunk index retrieval we can use some modified methods which can be more efficient and gives more contextual retrieval.
- Some of the advanced retrieval processes are given below,
Parent - Child Index Retrieval :
- As discussed above, this method first beaks the documents into larger parts which are parent chunks, and then this parent chunks are splits into smaller parts which are its child chunks.
- Indexing is done on child chunks and when user pass the query similar child chunk is first retrieved and then parent chunk corresponding to that/those child chunk/s is augmented and then passed to LLM for generation.
- It gives LLM more context which directly help in getting more contextual results.
Hierarchical Summary Index Retrieval :
- If we have multiple documents then retrieving relevant chunk is challenging, in such case Hierarchical Index Retrieval is useful.
- In this 2 indices are created, in first document summaries are indexed and in second chunks of documents are indexed.
- While retrieval, based on user query first similar summary document is retrieved and then relevant chunks are retrieved from that document.
- RAPTOR is one of the hierarchical approach introduced by Stanford researchers.
Hypothetical Questions Retrieval :
- In this hypothetical questions are generated for each document chunk using LLM and then this questions are indexed.
- At the time of retrieval similar hypothetical questions are first retrieved for given user query and then original chunk for this retrieved hypothetical questions is retrieved for further use.
- This improves quality of retrieval as it retrieves chunks based on semantic similarity between user query and hypothetical questions.
Post - Retrieval Processes :
- Once we efficiently retrieved context for given query, we can further refine and optimize It to improve its relevancy for more optimal generation of output answer.
- It includes methods like filtering, re-ranking, prompt compression.
Filtering :
- It is a process of filtering out the chunks based on the similarity score. We can set threshold for filtering.
- Filtering removes noise and redundant information and passes only relevant context to LLM, which improves generation quality.
Re-Ranking of Retrieved chunks/context :
- Re-ranking is a process of ordering retrieved context chunks in final prompt based on its score and relevancy. This is important as researchers found better performance when most relevant context is positioned at start.
- To do this, we can use LlamaIndex which offers various ranker methods like LongContextReorder, CoherReranker, SentenceTransformerRerank, LLMRerank, JinaRerank, ColbertRerank, RankGPT etc.
Prompt Compression :
- Prompt Compression is a method of compressing or shrinking the retrieved context or final prompt by removing irrelevant information.
- It's aim is to reduce length of input prompt to reduce cost, improve latency and efficiency of output generation by allowing LLM to focus on more concise context.
- Methods like LLMLingua, LongLLMLingua can be used for prompt compression. It simply uses LLM to generate compressed version of input prompt.
Evaluation of RAG :
- Evaluation of RAG systems is essential to benchmark the overall performance of RAG output.
- To evaluate RAG we can use metrics like answer relevancy, faithfulness for generation and context recall, precision for retrieval.
??RAGAs ( RAG Assessment ) :
- RAGAs is the one of the framework which is used to evaluate RAG systems.? It is simply one shot prompt technique which uses 4 prompt templates for 4 different metrics.
- For generation part it uses answer relevancy and answer faithfulness as metrics and for retrieval part it uses context precision and context recall as metric.
- It uses prompt and LLM to generate score for this metrics.
For example, For Answer Relevancy Score it will format prompt template like,
"""
Please provide score in between 1 to 10 of whether the provided retrieved results are relevant to the query
Query:
[Actual Query]
Search Results:
[Actual retrieved results/contexts]
Score:
"""
- Similarly it will use suitable prompt template for each metric.
- LangChain provides framework called LangSmith using which we can efficiently deploy, evaluate as well as monitor the performance of RAG.
- In LlamaIndex we can use rag_evaluator_pack for evaluation of RAG.
This was about RAG. I tried to cover many components and advancements in it, but still LLMs, RAG and tools like LangChain, LlamaIndex are active area of research and frequently researchers are doing great advancements in this field of Generative AI.
Design Engineer @ Konecranes | Core of Lifting | NX | Teamcenter | Inquisitive
12 个月Thanks for posting..??