登录查看更多内容

Retrieval Augmented Generation: Exploring Architectural Patterns and Implementation Strategies

Joshi Sameer

Data scientist, Cloud computing, Android Platform architect

发布日期: 2024年4月29日

The RAG model integrates retrieval-based and generative methods to bolster natural language processing models. Emerging after the surge of large language model-based generative approaches, it aims to address inherent limitations in existing methodologies. By delving into its unique contributions and underlying methodology, we will uncover the different architectural patterns of RAG.

LLM Evolution and Need for RAG

With the evolution of GenAI, Large Language Models (LLMs) have become key AI technology powering different NLP applications To understand them, it’s important to understand how their era started. ?

Over time, language models have transformed from basic rule-based systems (ELIZA in 1967) to advanced neural networks (RNN and LSTM in the 2010s), showcasing both remarkable innovations and limitations. Initially, rule-based models relied on fixed rules, like grammar checkers. While effective to some extent, they lacked flexibility in understanding human language nuances.

Machine learning brought a shift to statistical models like N-gram models, learning word sequence probabilities from extensive text data. The real breakthrough came with neural networks, especially deep learning, revolutionizing the field by generating human-like text. For instance, ChatGPT by OpenAI utilized TensorFlow and PyTorch. Yet, traditional Large Language Models (LLMs) have drawbacks.

LLMs use deep learning models and train on massive datasets to understand and generate the content. Since they are trained on huge public data, one model can respond to many types of questions. Once trained, LLMs cannot access data beyond their training data cutoff point. This makes LLMs static and may cause them to respond incorrectly, give out-of-date answers or hallucinate when asked questions about data they have not been trained on.

For making LLMs applicable to specific domains, organizations need models to understand those domains and provide answers from their data instead of generalized answers. For example, customer service bots may need to be trained on the organization-specific data.

To make the LLMs more accurate and more up-to-date, a new framework was introduced called as Retrieval Augmented Generation or RAG.

The ‘Generation’ part in RAG refers to LLMs that generate text in response to a query known as a prompt.

To help solve the problem created on account of out-of-date information, there is ‘Retrieval Augmented’ part in RAG. In this case, instead of just relying on what LLM knows, a content store will be used as a reference. It could be open like Internet, or it could be closed like some collection of documents, policies etc. ?On account of this additional content store, the user asking a question to LLM gets the correct answer.

Retrieval Augmented Generation (RAG) is a notable hybrid approach. By integrating retrieval mechanisms, RAG models access external knowledge bases, ensuring both fluency and factual accuracy in generated responses.

How does RAG work?

To explain how the retrieval-augmented generation framework works, let’s consider a use case. ?Assume a data science team is tasked with building a chatbot to support the legal advisors at a firm and has various options to develop such an app.

They can build an LLM from scratch and then adapt it to their task through fine-tuning on company data associated with different cases and laws, but this could get very expensive. Simply using ChatGPT or other popular LLM-powered chatbots would not be very helpful because of their limitations like context windows limitations, lack of domain-specific knowledge (proprietary data related to the domain of that institution), up-to-date information, and prohibitive operational costs. In this context, the most sensible approach is to use the RAG framework.

To produce a response, our chatbot would go through the following process:

Data embedding: At this stage, company documents are broken into chunks, passed through an embedding model, and then stored in a vector database.
Query request: A customer (in this case a legal advisor) asks a question about some legal advice needed. For instance, “Please provide me with a list of all the possible outcomes of [legal dispute]. Please tell me how I can prepare for them.”
Prompt construction: Using an orchestration framework (e.g., Langchain or LLama-Index), the chatbot requests an embedding API (e.g., ext-embedding-ada-002) to embed the question and gets back the embedding for the user query.
Prompt retrieval: Using the embedding query, it runs a similarity search across a vector database and identifies the top K parameters – most similar pieces of context by cosine similarity (i.e., it measures the similarity between two vectors of an inner product space) and returns chunks of documentation: context 1 and context 2.
Prompt execution: The chatbot takes the user query and the retrieved pieces of context to format it into a single prompt and sends it to the chat completion API (e.g., GPT 3-5) and other self-hosted ones, asking to answer the user query using these pieces of context.
Query response: This answer is shared with the customer.

Types of RAG architecture

RAG systems can be split into three categories: Naive, Advanced, and Modular.

Naive RAG is the simplest form, where the model retrieves relevant information from a dataset and then generates a response based on this retrieval. However, it lacks the sophistication needed for complex queries. Naive RAG takes a monolithic model like GPT-3 and simply conditions it on retrieved evidence passages, appending them to the input context. This approach is simple but has efficiency and coherency issues. The Naive RAG technique operates through a systematic process encompassing indexing, retrieval, augmentation, and response generation. Each step plays a crucial role in facilitating effective communication.

Indexing involves extracting and cleansing data from diverse sources, such as files and URLs, before converting them into plain text. This preparation often involves breaking down extensive content into manageable chunks and transforming them into high-dimensional vectors using embedding models.

Retrieval hinges on encoding user queries into vectors and employing similarity search methods like cosine similarity to locate relevant data chunks.

In the generation phase, retrieved chunks, user queries, and additional instructions are amalgamated into prompts for the LLM to generate responses. However, despite its utility, the Naive RAG approach suffers from several drawbacks, including susceptibility to hallucinations, low precision in chunk retrieval, reliance on outdated data, and potential constraints on response quality.

领英推荐

Demystifying Large Language Models

Brij kishore Pandey 3 个月前

Top examples of some of the best large language models…

Algolia 10 个月前

LLM Fine-Tuning on Graphs; How To Evaluate LLMs;…

Danny Butvinik 7 个月前

Advanced RAG methodology represents a significant advancement over Naive RAG, addressing its limitations through a comprehensive approach. It encompasses Pre-Retrieval and Post-Retrieval processes, aiming to enhance response quality.

Pre-Retrieval efforts focus on refining indexed content by eliminating irrelevant data, resolving ambiguities, updating outdated information, and ensuring contextual relevance. Incorporating metadata enriches the quality of retrieved documents, while query rewriting techniques optimize prompts for the LLM, tailoring them to its characteristics.

Post-Retrieval strategies involve merging retrieved chunks with user queries adeptly to avoid exceeding context window limits and minimizing noise. Techniques like re-ranking prioritize contextual relevance over mere vector similarity, while prompt compression reduces noise by condensing irrelevant information, emphasizing key passages, and trimming excessive context length. These enhancements collectively contribute to more precise and informative response generation.

Modular RAG is the most sophisticated, offering customizable modules for different types of data and queries, making it highly adaptable to specific needs. Modular RAG breaks the system into explicit retriever, re-ranker, and generator modules. This provides more flexibility and specialization.

The Modular RAG methodology introduces a departure from traditional Naive RAG techniques, incorporating advanced functionalities to enhance performance. Its framework includes a search module for similarity retrieval and adopts a fine-tuning approach in the retriever, offering greater adaptability through restructured modules and iterative methodologies.

This approach allows for both serialized pipelines and end-to-end training, enabling more effective resolution of specific challenges. The relationship between Naive RAG, Advanced RAG, and Modular RAG demonstrates an evolutionary progression, with the latter incorporating techniques like Hybrid Search, Recursive Retrieval and Querying, StepBack approach, Sub-Queries, and Hypothetical Document Embeddings.

Hybrid Search optimizes performance by combining various search techniques, ensuring consistent retrieval of context-rich information. Recursive Retrieval employs a two-step method to balance efficiency and contextual richness, while the StepBack approach enhances reasoning around broader concepts.

Sub-queries offer flexibility in query strategies, and Hypothetical Document Embeddings, while effective, may have limitations in unfamiliar subjects. Overall, Modular RAG presents a comprehensive approach to optimizing response generation in the RAG domain.

Future of RAG?

The future of RAG lies in both vertical optimization (improving specific aspects of the model) and horizontal expansion (applying the model to a wide range of tasks). This versatility is vital in the AI ecosystem, as it allows RAG to adapt to various industries and use cases, from customer service to content creation.

The future of Retrieval-Augmented Generation (RAG) appears to be a landscape marked by extensive innovation and integration, poised to significantly enhance the capabilities of natural language processing systems. One of the primary directions for RAG’s evolution is its deeper integration with increasingly diverse and dynamic data sources.

This integration will enable RAG systems to offer even more accurate and contextually relevant responses, particularly in rapidly changing fields like news, scientific research, and social media trends. Another promising avenue is the development of more sophisticated retrieval mechanisms that can understand and process complex queries more effectively.

Additionally, there’s potential for RAG to be tailored for specific industries, like healthcare or law, where accuracy and up-to-date information are crucial. In terms of technology, advancements in machine learning algorithms and computational power will allow RAG systems to become more efficient and scalable, handling larger datasets and more complex models with ease. Furthermore, as AI ethics and transparency become increasingly important, RAG systems will likely incorporate mechanisms to explain their retrieval and generation processes, enhancing trust and reliability. The integration of RAG with other AI technologies, such as predictive analytics and automated decision-making systems, could open new avenues for applications, making RAG a cornerstone technology in the next generation of AI solutions.

The future of RAG is not just about technological advancements but also about creating AI systems that are more aligned with human needs, understanding, and ethical considerations.?

References?

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks ?[ https://arxiv.org/abs/2005.11401]

Retrieval-Augmented Generation for Large Language Models: A Survey? [ https://arxiv.org/abs/2312.10997 ]

What Is Retrieval-Augmented Generation, aka RAG?? [ https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/ ]

RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems [ https://arxiv.org/abs/2403.09040? ]

Optimizing Retrieval-augmented Reader Models via Token Elimination [ https://arxiv.org/abs/2310.13682 ?]

Rosy Cathy

GITEX GLOBAL 2024 | Dubai World Trade Centre

5 个月

Exciting to see how RAG combines the best of both worlds in NLP! Looking forward to diving into the architectural patterns and implementation strategies. I have found and informative article https://www.bombaysoftwares.com/blog/a-beginners-guide-to-retrieval-augmented-generation-rag