RAG Techniques Every AI/ML/Data Engineer Should Know!
Image by Author - Pavan Belagatti

RAG Techniques Every AI/ML/Data Engineer Should Know!

New to the world of Retrieval Augmented Generation (RAG)? We've got you covered with this in-depth guide.

Large language models (LLMs) are becoming the backbone of most of the organizations these days as the whole world is making the transition towards AI. While LLMs are all good and trending for all the positive reasons, they also pose some disadvantages if not used properly. Yes, LLMs can sometimes produce the responses that aren’t expected, they can be fake, made up information or even biased. Now, this can happen for various reasons. We call this process of generating misinformation by LLMs as hallucination.?

There are some notable approaches to mitigate the LLM hallucinations such as fine-tuning, prompt engineering, retrieval augmented generation (RAG) etc. Retrieval augmented generation (RAG) has been the most talked about approach in mitigating the hallucinations faced by large language models. Today we will see everything about the RAG approach, what it is, how it works, its components, workflow from basic to advanced.

What is RAG

Retrieval-Augmented Generation (RAG) is a natural language processing framework that enhances large language models (LLMs) by integrating external data retrieval with text generation. It retrieves relevant information from external sources/databases/custom source to improve response accuracy and relevance, mitigating issues like misinformation and outdated knowledge in generated content. So, RAG basically reduces the LLM hallucinations by providing contextually relevant responses through the data sources provided/attached.

RAG Components

The RAG pipeline basically involves three critical components: Retrieval component, Augmentation component, Generation component.

  • Retrieval: This component helps you fetch the relevant information from the external knowledge base like a vector database for any given user query. This component is very crucial as this is the first step in curating the meaningful and contextually correct responses.

  • Augmentation: This part involves enhancing and adding more relevant context to the retrieved response for the user query.

  • Generation: Finally, a final output is presented to the user with the help of a large language model (LLM). The LLM uses its own knowledge and the provided context and comes up with an apt response to the user’s query.

Advantages of Retrieval Augmented Generation

There are some incredible advantages of RAG. Let me share some notable ones:

  • Scalability. RAG approach helps you with scale models by simply updating or adding external/custom data to your external database (vector database).

  • Memory efficiency. Traditional models like GPT have limits when it comes to pulling fresh and updated information and fails to be memory efficient. RAG leverages external databases like a vector database — allowing it to pull in fresh, updated or detailed information when needed with speed.

  • Flexibility. By updating or expanding the external knowledge source, you can adapt RAG to build any AI applications with flexibility.

Systematic RAG Workflow

RAG consists of three modules that you need to understand!

Retrieval module, Augmentation module, and Generation module (as discussed above).?

First, the document which forms the source database is divided into chunks. These chunks, transformed into vectors using an embedding model like OpenAI or open source models available from Hugging Face community, are then embedded into a high-dimensional vector database (e.g., SingleStore Database, Chroma and LlamaIndex).?

When the user inputs a query, the query is embedded into a vector using the same embedding model. Then, chunks whose vectors are closest to the query vector, based on some similarity metrics (e.g., cosine similarity) are retrieved. This process is contained in the retrieval module shown in the figure. After that, the retrieved chunks are augmented to the user’s query and the system prompt in the augmentation module.?

This step is critical for making sure that the records from the retrieved documents are effectively incorporated with the query. Then, the output from the augmentation module is fed to the generation module which is responsible for generating an accurate answer to the query by utilizing the retrieved chunks and the prompt through an LLM (like chatGPT by OpenAI, hugging face, and Gemini by Google).?

But to make RAG work perfectly, here are some key points to consider:

1. Quality of External Knowledge Source: The quality and relevance of the external knowledge source used for retrieval are crucial.

2. Embedding Model: The choice of the embedding model used for retrieving relevant documents or passages from the knowledge source is important.

3. Chunk Size and Retrieval Strategy: Experiment with different chunk sizes to find the optimal length for context retrieval. Larger chunks may provide more context but could also introduce irrelevant information. Smaller chunks may focus on specific details but might lack broader context.

4. Integration with Language Model: The way the retrieved information is integrated with the language model's generation process is crucial. Techniques like cross-attention or memory-augmented architectures can be used to effectively incorporate the retrieved information into the model's output.

5. Evaluation and Fine-tuning: Evaluating the performance of the RAG model on relevant datasets and tasks is important to identify areas for improvement. Fine-tuning the RAG model on domain-specific or task-specific data can further enhance its performance.

6. Ethical Considerations: Ensure that the external knowledge source is unbiased and does not contain offensive or misleading information.

7. Handling Out-of-Date or Incorrect Information: It's important to have strategies in place for handling situations where the retrieved information is out-of-date or incorrect.

Use SingleStore Database as your vector store, try for free: https://bit.ly/SingleStoreDB

RAG Tutorial

Let’s build a simple AI application that can fetch the contextually relevant information from our own data for any given user query.

Follow the complete hands-on tutorial from my Medium article.

Evolution of RAG Over Time

Let's talk about the RAG evolution over time.?

1. Naive RAG:

The Naive RAG research paradigm represents the earliest methodology, which gained prominence shortly after the widespread adoption of ChatGPT. The Naive RAG follows a traditional process that includes indexing, retrieval, and generation. It is also characterized as a “Retrieve-Read” framework [Ma et al., 2023a].

2. Advanced RAG:

Advanced RAG has been developed with targeted enhancements to address the shortcomings of Naive RAG. In terms of retrieval quality, Advanced RAG implements pre-retrieval and post-retrieval strategies. To address the indexing challenges experienced by Naive RAG, Advanced RAG has refined its indexing approach using techniques such as sliding window, fine-grained segmentation, and metadata. It has also introduced various methods to optimize the retrieval process [ILIN, 2023].

3. Modular RAG:

The modular RAG structure diverges from the traditional Naive RAG framework, providing greater versatility and flexibility. It integrates various methods to enhance functional modules, such as incorporating a search module for similarity retrieval and applying a fine-tuning approach in the retriever [Lin et al., 2023].?

Restructured RAG modules [Yu et al., 2022] and iterative methodologies like [Shao et al., 2023] have been developed to address specific issues. The modular RAG paradigm is increasingly becoming the norm in the RAG domain, allowing for either a serialized pipeline or an end-to-end training approach across multiple modules.

This comprehensive review paper offers a detailed examination of the progression of RAG paradigms, encompassing the Naive RAG, the Advanced RAG, and the Modular RAG.?

Access the paper here: https://arxiv.org/abs/2312.10997?

Chunking Strategies in RAG

Improving the efficiency of LLM applications via RAG is all great.

BUT the question is, what should be the right chunking strategy?

Chunking is the method of breaking down the large files into more manageable segments/chunks so the LLM applications can get proper context and the retrieval can be easy.

In a video on YouTube, Greg Kamradt provides overview of different chunking strategies. Let’s understand them one by one.

They have been classified into five levels based on the complexity and effectiveness.

? Level 1 : Fixed Size Chunking

This is the most crude and simplest method of segmenting the text. It breaks down the text into chunks of a specified number of characters, regardless of their content or structure.Langchain and llamaindex framework offer CharacterTextSplitter and SentenceSplitter (default to spliting on sentences) classes for this chunking technique.

? Level 2: Recursive Chunking

While Fixed size chunking is easier to implement, it doesn’t consider the structure of text. Recursive chunking offers an alternative. In this method, we divide the text into smaller chunk in a hierarchical and iterative manner using a set of separators. Langchain framework offers RecursiveCharacterTextSplitter class, which splits text using default separators (“\n\n”, “\n”, “ “,””)

? Level 3 : Document Based Chunking

In this chunking method, we split a document based on its inherent structure. This approach considers the flow and structure of content but may not be as effective documents lacking clear structure.

? Level 4: Semantic Chunking

All above three levels deals with content and structure of documents and necessitate maintaining constant value of chunk size. This chunking method aims to extract semantic meaning from embeddings and then assess the semantic relationship between these chunks. The core idea is to keep together chunks that are semantic similar.Llamindex has SemanticSplitterNodeParse class that allows to split the document into chunks using contextual relationship between chunks.

? Level 5: Agentic Chunking

This chunking strategy explore the possibility to use LLM to determine how much and what text should be included in a chunk based on the context.

Know more about these chunking strategies in this article.

Advanced RAG

Let’s use some simple query examples from the basic RAG explanation: “What’s the latest breakthrough in renewable energy?”, to better understand these advanced techniques.

? Pre-retrieval optimizations: Before the system begins to search, it optimizes the query for better outcomes. For our example, Query Transformations and Routing might break down the query into sub-queries like “latest renewable energy breakthroughs” and “new technology in renewable energy.”?

This ensures the search mechanism is fine-tuned to retrieve the most accurate and relevant information.

?

? Enhanced retrieval techniques: During the retrieval phase, Hybrid Search combines keyword and semantic searches, ensuring a comprehensive scan for information related to our query. Moreover, by Chunking and Vectorization, the system breaks down extensive documents into digestible pieces, which are then vectorized.?

This means our query doesn’t just pull up general information but seeks out the precise segments of texts discussing recent innovations in renewable energy.

?

? Post-retrieval refinements: After retrieval, Reranking and Filtering processes evaluate the gathered information chunks. Instead of simply using the top ‘k’ matches, these techniques rigorously assess the relevance of each piece of retrieved data. For our query, this could mean prioritizing a segment discussing a groundbreaking solar panel efficiency breakthrough over a more generic update on solar energy.?

This step ensures that the information used in generating the response directly answers the query with the most relevant and recent breakthroughs in renewable energy.

Know more in the original article: https://datasciencedojo.com/blog/rag-vs-finetuning-llm-debate/

Reranking in RAG

Traditional semantic search consists of a two-part process.?

First, an initial retrieval mechanism does an approximate sweep over a collection of documents and creates a document list.?

Then, a re-ranker mechanism will take this candidate document list and re-rank the elements. With Rerank, we can improve your models by re-organizing your results based on certain parameters.

Why is Re-Ranking Required ?

? The recall performance for LLMs decreases as we add more context resulting in increased context window(context stuffing)

? Basic Idea behind reranking is to filter down the total number of documents into a fixed number .

? The re-ranker will re-rank the records and get the most relevant items at the top and they can be sent to the LLM

? The Reranking offers a solution by finding those records that may not be within the top 3 results and put them into a smaller set of results that can be further fed into the LLM

Reranking basically enhance the relevance and precision of retrieved results.

Know more in this article: https://medium.aiplanet.com/advanced-rag-cohere-re-ranker-99acc941601c

RAG Enhancement Techniques

The road to building RAG applications is not a smooth one.?

You need to know some techniques to overcome different challenges that RAG throws at you while building LLM powered applications.?

1. Transformation from Single Query to Multi Query:

Multi-Query is an advanced approach in the Query Transformation stage of retrieval. Unlike traditional methods where only one query is used, Multi-Query generates multiple queries and retrieves similar documents for each one. Builders utilize Multi-Query primarily for two reasons: enhancing suboptimal queries and expanding result sets. It addresses users’ imperfect queries by filling in gaps and retrieves more diverse results, leading to an expanded results set that can provide better answers than single-query documents.?

2. Improving Indexed Data Quality:

Unfortunately, data cleaning is often overlooked during the development of RAGs, with a tendency to ingest all available documents without verifying their quality. We need to ensure that the data fed into the RAG system is of high quality for obtaining accurate answers. The principle of “garbage in, garbage out” is especially relevant here.?

3. Chunking strategy and size matters to optimize index structure:

When setting up your Retrieval Augmented Generation (RAG) system, the size of the chunks and chunking technique plays a crucial role. It determines how much information is retrieved from the document store for processing. Choosing a small chunk size may lead to missing important details, while opting for a larger size could introduce irrelevant information.

4. Incorporation of metadata with indexed vectors:

Adding metadata alongside indexed vectors in the vector database offers significant benefits in organizing and enhancing search relevance.

5. Improving search relevance with question-based indexing:

LLMs and RAGs offer incredible power by allowing users to express queries in natural language, simplifying data exploration and complex tasks. However, a common challenge arises when there’s a disconnect between the concise queries, users input and the longer, more detailed documents stored in the system.

6. Improving Search Precision with Mixed Retrieval — Hybrid Search

While vector search excels in retrieving semantically relevant chunks for queries, it sometimes lacks precision in matching specific keywords. To get the best of both the worlds, (vector search + full-text search) you need hybrid search.

Know some more techniques in this article: https://blog.stackademic.com/rag-understanding-the-concept-and-various-enhancement-techniques-608b643bf2e5

No matter what RAG technique you choose, you would always need a robust database to store your vector data, make sure to use SingleStore as your vector database.?

Try SingleStore database for free: https://bit.ly/SingleStoreDB

RAG Best Practices

Depending on your use case, the requirements change. Whether it is about selecting a smart model, chunking strategy, embedding method and models, vector databases, evaluation techniques, AI frameworks, etc

To make RAG work perfectly, here are some key points to consider:

1. Quality of External Knowledge Source

2. Data Indexing Optimizations: Techniques such as using sliding windows for text chunking and effective metadata utilization to create a more searchable and organized index.

3. Query Enhancement: Modifying or expanding the initial user query with synonyms or broader terms to improve the retrieval of relevant documents.

4. Embedding Model: The choice of the embedding model used for retrieving relevant documents.

5. Chunk Size & Retrieval Strategy: Experiment with different chunk sizes to find the optimal length for context retrieval.

6. Integration with Language Model: The way the retrieved information is integrated with the language model's generation process is crucial.

7. Evaluation & Fine-tuning: Evaluating the performance of the RAG model on relevant datasets and tasks is important to identify areas for improvement.

8. Ethical Considerations: Ensure that the external knowledge source is unbiased and does not contain offensive or misleading information.

9. Vector database: Having a vector database that supports fast ingestion, retrieval performance, hybrid search is utmost important.?

10. Response Summarization: Condensing retrieved text to provide concise and relevant summaries before final response generation.

11. Re-ranking and Filtering: Adjusting the order of retrieved documents based on relevance and filtering out less pertinent results to refine the final output.

12. LLM models: Consider LLM models that are robust and fast enough to build your RAG application.

13. Hybrid Search: Combining traditional keyword-based search with semantic search using embedding vectors to handle a variety of query complexities.

No matter what RAG technique you choose, you would always need a robust vector database to store your vector data, make sure to use SingleStore as your vector database.

Semantic Cache to Improve RAG

Fast retrieval is a must in RAG for today's AI/ML applications.

Latency and computational cost are the two major challenges while deploying these applications in production.?

While RAG enhances this capability to certain extent, integrating a semantic cache layer in between that will store various user queries and decide whether to generate the prompt enriched with information from the vector database or the cache is a must.?

A semantic caching system aims to identify similar or identical user requests. When a matching request is found, the system retrieves the corresponding information from the cache, reducing the need to fetch it from the original source.

There are many solutions that can help you with the semantic caching but I can recommend using SingleStore database.?

Why use SingleStore Database as the semantic cache layer?

SingleStoreDB is a real-time, distributed database designed for blazing fast queries with an architecture that supports a hybrid model for transactional and analytical workloads.?

This pairs nicely with generative AI use cases as it allows for reading or writing data for both training and real-time tasks — without adding complexity and data movement from multiple products for the same task.?

SingleStoreDB also has a built-in plancache to speed up subsequent queries with the same plan.

Know more about semantic caching with SingleStore.?

Advanced RAG Techniques

Building a simple RAG pipeline is easy. But, that doesn't yield anything.

You need some advanced RAG techniques for your AI application.

The following is a list of enhancement points for your RAG pipeline.

? Data Indexing Optimizations: Techniques such as using sliding windows for text chunking and effective metadata utilization to create a more searchable and organized index.

? Query Enhancement: Modifying or expanding the initial user query with synonyms or broader terms to improve the retrieval of relevant documents.

? Hybrid Search: Combining traditional keyword-based search with semantic search using embedding vectors to handle a variety of query complexities.

? Fine Tuning Embedding Model: Adjusting a pre-trained model to better understand specific domain nuances, enhancing the accuracy and relevance of retrieved documents.

? Response Summarization: Condensing retrieved text to provide concise and relevant summaries before final response generation.

? Re-ranking and Filtering: Adjusting the order of retrieved documents based on relevance and filtering out less pertinent results to refine the final output.

Adopting a robust database that can do hybrid search, has great integration with AI frameworks, can help you will fast ingestion and vector storage is very important.?

This is where SingleStore database comes handy. Sign up & use it for free: https://bit.ly/SingleStoreDB

The complete article on advanced RAG techniques by Necati Demir is here: https://blog.demir.io/advanced-rag-implementing-advanced-techniques-to-enhance-retrieval-augmented-generation-systems-0e07301e46f4


Production Ready RAG Pipelines

Vectorize helps you build AI apps faster and with less hassle. It automates data extraction, finds the best vectorization strategy using RAG evaluation, and lets you quickly deploy real-time RAG pipelines for your unstructured data. Your vector search indexes stay up-to-date, and it integrates with your existing vector database, so you maintain full control of your data. Vectorize handles the heavy lifting, freeing you to focus on building robust AI solutions without getting bogged down by data management.

Know more about how Vectorize can help you build AI apps faster.


Verifying the Correctness of RAG Responses

How do we verify the correctness of RAG responses?

My complete video on evaluating RAG workflow: https://youtu.be/MP6hHpy213o

Attached is a small clip of my video that talks about the different steps involved in a RAG workflow.?

RAG evaluation is important because it helps ensure the effectiveness of our RAG systems. Basically, it ensures the RAG pipeline generates coherent responses, and meets end-user needs.?

RAG Evaluation Strategies

The field of RAG evaluation continues to evolve & it is very important for AI/ML/Data engineers to know these concepts thoroughly.

RAG evaluation includes the evaluation of retrieval & the generation component with the specific input text.

At a high level, RAG evaluation algorithms can be bifurcated into two categories. 1) Where the ground truth (the ideal answer) is provided by the evaluator/user 2) Where the ground truth (the ideal answer) is also generated by another LLM.

For the ease of understanding, the author has further classified these categories into 5 sub-categories.

1. Character based evaluation

2. Word based evaluation

3. Embedding based evaluation

4. Mathematical Framework

5. Experimental based framework

Let’s take a look at each of these evaluation categories:

1. Where the ground truth is provided by the evaluator.

→ Character based evaluation algorithm:

As the name indicates, this algorithm finds a score which is the character by character difference between the reference (ground truth) and the RAG translation output.

→ Word based evaluation algorithm:

As the name indicates, this algorithm finds a score which is the word by word difference between the reference (ground truth) and the RAG output.

→ Embedding based evaluation algorithms:

Embedding based algorithms works in two steps.

Step 1: Create embeddings for both the generated text and the reference text using a particular embedding technique

Step 2: Use a distance measure (like cosine similarity) to evaluate the distance between the embeddings of the generated text and the reference text.

2. Where the ground truth is also generated by LLM (LLM assisted evaluation)

→ Mathematical Framework — RAGAS Score

RAGAS is one of the most common and comprehensive frameworks to assess the RAG accuracy and relevance. RAG bifurcates the evaluation from Retrieval and Generation perspective.

→ Experimental Based Framework — GPT score

The effectiveness of this approach in achieving desired text evaluations through natural language instructions is demonstrated by evaluating experimental results on four text generation tasks, 22 evaluation aspects, and 37 corresponding datasets. Know more about RAG evaluation in this original article.


Must Attend AI Conference in San Francisco

Attend the most awaited AI conference in San Francisco, happening on the 3rd of October 2024.

You'll have the opportunity to hear from leaders in the field, including Jerry Liu, CEO of LlamaIndex, among others, and dive into some hands-on AI sessions.

If you are really interested in attending this conference where you will get to meet some great AI minds in the industry, let me know. I have some huge discount coupons I can share with you.?Contact me through my email [ [email protected]?] and I'll share the discount coupon code with you.


BTW, if you are a RAG fan just like me, I have compiled the most extensive guide/e-book on RAG.

You can download for FREE!


And hey, don’t forget to subscribe to my YouTube channel!

Thank You!!!

Saikrishna Reddy G

Data Scientist || Machine Learning || Computer Vision || NLP|| GEN AI

1 个月

Thanks for sharing

回复
Ashish Patel ????

?? 6x LinkedIn Top Voice | Sr AWS AI ML Solution Architect at IBM | Generative AI Expert | Author - Hands-on Time Series Analytics with Python | IBM Quantum ML Certified | 12+ Years in AI | MLOps | IIMA | 100k+Followers

1 个月

Your post is definitely insightful, Pavan Belagatti. However, it would be more beneficial to highlight the intricacies of Large Language Models (LLMs) and Generative AI alongside the RAG techniques for a comprehensive understanding.

Gopinath Manimayan

Software Architect II - DX at UST Global | DX Practise Lead | MarTech Mentor | Featured Speaker | Technology Consultant | Digital Experience Platform(DXP) Expert | Tech Blogger | Corporate Trainer

1 个月

Your in-depth guide on RAG techniques truly showcases your dedication to exploring new horizons in AI/ML/Data. Your expertise shines through in every word, making it a valuable resource for all professionals in the field. Pavan Belagatti

Prasana Kumar Parthasarathy

Problem-solver with diverse expertise in Leadership, Tech, Cloud, Data, Analytics, GenAI, ML, CX, Products, Engineering, SDLC, DevOps, Consulting & Delivery, driving innovative solns & biz growth for customer success.

1 个月

This is a great resource for anyone looking to understand RAG in depth. Thanks Pavan Belagatti

要查看或添加评论,请登录

社区洞察

其他会员也浏览了