RAG Architecture Deep Dive
Frank Denneman
Chief Technologist for AI | AI and Advanced Services | VMware Cloud Foundation Division | Broadcom
Retrieval Augmented Generation (RAG) is a technique for augmenting Large Language Model (LLM) knowledge with additional data. In a standard Gen-AI application using LLM as its sole knowledge source, the model generates responses solely based on the input from the user query and the knowledge it has been trained on. It does not actively retrieve additional information beyond what is encoded in its parameters during fine-tuning or training. In contrast, an RAG architecture integrates both retrieval-based and generative components. It includes a retriever component that obtains relevant information from a large collection of text (typically referred to as a corpus). This corpus is stored in an embedding database, most commonly referred to as a vector database. The retrieved information is then used by the generative component to produce responses. The LLM is at the heart of the generative component.
Typically, most conversations about RAG architecture focus on the retrieval process and the vector DB and LLM. However, when trying to understand the architecture and the workflows to design and right-size the infrastructure, it is important to understand the indexing workflow and take the different shared components into account when allocating resources. In particular, the tokenizer and embedding model play an incredibly important role in ensuring the smooth operation of retrieval and generation.
This is the first part of a series in which I slowly move closer to Private AI foundation components to run this application in your virtual datacenter on-prem. Let's take a closer look at the indexing and retrieval process and explore the roles of the various components in the key stages of retrieval augmented generation.
This is a diagram of the components, their input, and their outputs (elements with dotted lines). Most data scientists design such a system with the help of frameworks such as Langchain or Llama Index. These frameworks provide functionality, as depicted below, but sometimes without having distinct and separate components for each individual task. This diagram shows the conceptual tasks and components used in the indexing and retrieval processes. Its primary goal is to highlight the different phases data go through and the shared components used by both processes.
Building the Foundation for Retrieval: The Indexing Process in RAG Architectures
The indexing stage in a RAG architecture lays the groundwork for efficient information retrieval. It involves transforming a vast collection of data sources, regardless of structure (unstructured documents like PDFs, semi-structured data like JSON, or structured data from databases), into a format readily usable by LLMs. This process can be broken down into a Load-Transform-Embed-Store workflow.
Loading Diverse Data Sources:
The indexing process begins with data loaders, which act as information gatherers. They retrieve data from various sources, including unstructured documents (e.g., PDFs, docs), semi-structured data (e.g., XML, JSON, CSV), and even structured data residing in SQL databases. These loaders then convert the retrieved data into a standardized document format for further processing.
Transforming Data for Efficient Processing:
Document splitters take the stage next. Their role is crucial in organizing the data and preparing it for efficient processing by the embedding model. They achieve this by segmenting the documents into logical units – sentences or paragraphs – based on predefined rules. This segmentation ensures that information remains semantically intact while preparing it for further processing. Imagine a large research paper being fed into the system. The document splitter receives the PDF from the loader and meticulously splits it into individual paragraphs for further processing.
Tokenization: The Building Blocks of Meaning:
Following segmentation, the tokenizer steps in. It takes each logical unit (e.g., paragraph) from the document splitter and breaks it into its fundamental building blocks: tokens. These tokens can be individual words, sub-words, or even characters, depending on the chosen embedding model and the desired level of granularity. Accurate tokenization is critical for tasks that rely on understanding the meaning of the text, as it forms the basis for how the LLM interprets the information. Since the tokenizer essentially defines the vocabulary understood by the entire RAG architecture, utilizing a single shared tokenizer process across all components dealing with text processing and encoding is recommended. Using a single tokenizer ensures consistency throughout the system.
Embedding: Capturing Semantic Meaning:
Once tokenization is complete, the embedding model takes center stage. Its role is to convert each token into a numerical vector representation, capturing its semantic meaning within the context of the surrounding text. Pre-trained embedding models, either word embeddings or contextual embeddings, achieve this by mapping the tokens into these vector representations.
Finally, an indexing component takes over. It packages the generated embedding vectors along with any associated metadata (e.g., document source information) and sends them to a specialized embedding database – the vector database (vector DB) – for efficient storage. This database becomes the foundation for the retrieval stage, where the RAG architecture searches for relevant information based on user queries.
The Stored Foundation:
The vector database plays a crucial role in efficient retrieval. It stores the embedding vectors in a three-dimensional space, allowing for fast and effective search operations based on vector similarity. The embedding model paves the way for the retrieval process, where the RAG architecture efficiently locates relevant information from the indexed data based on user queries, ultimately enabling the LLM to generate informative and relevant responses.
Retrieval: Efficiently Finding Relevant Information
The retrieval stage in an RAG architecture is where the magic happens. Here, the system efficiently locates relevant information from the indexed data to fuel the LLM generation capabilities. This process ensures that the user's query (often called a prompt in NLP) is processed in the same 'language' used for creating and storing the embeddings during indexing.
Understanding User Queries:
The process begins with the user submitting a query, often phrased as a natural language prompt (question, instruction, etc.). But before the user can generate a query, the organization must ensure that only authorized users can use the RAG system. When the embedding model is filled with confidential data, it becomes your one-stop shop for company secrets. For private AI purposes, the API gateway acts as a central entry point for 'external' requests and queries and handles tasks such as authentication, rate limiting, and request logging, ensuring that interactions with the RAG system are secure and well-documented. In addition, the API gateway plays a crucial role in managing and orchestrating communication between the different components of the system.??
Once the user is validated and submits the query (called a prompt), the prompt must be translated into the same 'language' used to create and store the embeddings during indexing.? But before processing the prompt further, it's essential to apply guardrails to ensure the prompt meets certain safety, ethical, and quality standards. Guardrails play a crucial role in preventing the system's misuse for malicious purposes and ensuring that the generated responses align with the organization's?ethical guidelines and expectations. To achieve this, the system leverages the same tokenizer and embedding model employed in the indexing stage. The tokenizer breaks the prompt into?tokens (words or subwords) and then converts it into a vector representation using the pre-trained model. This vector representation captures the semantic meaning of the prompt within the context of the larger language model.
Matching Queries with Encoded Information:
With the query transformed into a vector, the retrieval process can efficiently search through the collections of embeddings stored in the vector database. This search hinges on the principle of vector similarity – the system seeks embeddings within the database that closely resemble the prompt's vector representation. These retrieved embeddings, typically representing relevant passages from the indexed data, are referred to as passages.
领英推荐
Prioritizing Relevant Passages:
Not all retrieved passages hold equal weight. A ranking service steps in to prioritize the most relevant ones. This service applies a ranking algorithm, considering factors like the degree of similarity between the passage's embedding and the prompt's vector, to assign a score to each retrieved passage. This scoring helps identify the passages most likely to contain information that addresses the user's query.
Preparing Information for the LLM:
The integration module acts as the bridge between the retrieved information and the LLM. It receives the ranked passages and performs crucial formatting tasks. Depending on the specific task and the system design, the integration module might employ summarization techniques to condense lengthy passages or utilize answer extraction methods to pinpoint the most relevant information within a passage. In some scenarios, the module might select a single top-ranked passage for processing (single-passage processing), while others might leverage multiple high-ranking passages (multi-passage processing). The integration module then prepares these passages, potentially concatenating them or processing them individually to ensure they align with the input format expected by the LLM.
Feeding the LLM:
Finally, the integration module presents the prepared passages alongside the embedded prompt to the LLM. This empowers the LLM to process the information, drawing upon its knowledge and understanding of language to generate a comprehensive and informative response that aligns with the user's query.
From Generation to User Experience: The Final Steps in RAG
The journey of an RAG response continues after the LLM generates its initial output. Several crucial steps ensure the user receives a refined, informative, and well-presented response. This stage encompasses post-processing, formatting, user interface integration, and, ultimately, user presentation.
Polishing the Response: Post-Processing
The raw output from the LLM might undergo some post-processing steps to enhance its quality. This could involve tasks like:
These post-processing steps ensure the generated response is informative, grammatically sound, and easy for the user to understand. Additionally, guardrails can evaluate and filter the generated outputs to ensure they meet predefined criteria. This may involve automated checks for compliance with safety, ethical, or quality standards, as well as human review processes to verify the suitability of the generated content.
Tailoring the Response for Presentation
The generated response might need formatting adjustments before presentation, depending on the application and user interface requirements. This formatting could involve structuring and adding visual elements. When structuring the response, the content is organized into well-defined paragraphs for improved readability. Post-processing adds visual elements such as bullet points, headers, or even multimedia content (if applicable) and enhances clarity and user engagement.?
Seamless Integration: User Interface and Presentation
Once processed and formatted, the response is seamlessly integrated into the application or platform's user interface. This user interface could be a web page, mobile app, chat interface, or any other medium through which users interact with the system. This integration ensures a smooth flow of information from the RAG architecture to the user experience layer.
User Presentation and Interaction:
Finally, the polished and formatted response reaches the user through the chosen interface. The user can then review the information, provide necessary feedback, or take further actions based on the response. Depending on the application, users can interact further with the system by asking follow-up questions or initiating new tasks.
Maintaining a Positive User Experience
Throughout this final stage, a critical focus remains on the user experience. The generated response should meet the user's accuracy, relevance, and readability expectations. Additionally, error-handling mechanisms should be in place to address any potential issues that arise during response generation or presentation. User feedback loops can also be implemented to continuously improve the performance of the RAG model and ensure it delivers consistently valuable experiences.
By effectively managing these final steps, RAG architectures can not only generate high-quality responses but also ensure they are presented in a way that maximizes user satisfaction and understanding.
The Unsung Heroes of RAG: Shared Components and Resource Management
In the Retrieval-Augmented Generation (RAG) architecture, the tokenizer, embedding model, and vector database operate harmoniously in indexing and retrieval processes. Together, they establish a common language that spreads throughout the entire RAG architecture, facilitating seamless communication and interaction between its components. Consequently, when crafting the infrastructure for RAG deployment, it's important to recognize the resource requirements of these foundational components.
While it may be tempting to focus solely on provisioning resources for the embedding model (Vector DB) and the large language model, overlooking the tokenizer and embedding model's resource consumption can lead to inefficiencies and bottlenecks in overall system performance. Therefore, a holistic approach to resource sizing is essential, considering the demands of all components involved in the indexing and retrieval pipelines.
Dealing with multiple volatile and complex data sources further complicates the resource-sizing tasks. Scalability becomes a very important consideration, as the architecture must handle the potential growth in data volume and sources over time. In successful implementations of RAG architectures, organizations often find themselves persuading other teams and business units to expose their data to the Gen-AI application. However, this influx of data brings with it challenges related to resource management and allocation.
Many data scraping processes in RAG architectures operate as batch operations, which can introduce a difficult infrastructure resource consumption pattern. Understanding how these batch processes affect resource utilization and devising strategies for on-demand provisioning and de-provisioning of indexing resources in response to workflow fluctuations is crucial. Organizations can ensure efficient and effective data processing in RAG architectures, regardless of volume or complexity, by adopting flexible resource allocation models and leveraging scalable infrastructure solutions.
IT Specialist at Archrock
2 个月Well articulated. The flowchart was quite overwhelming at first but your explanation is right to the point.
Independent Consultant
3 个月I don't know if I should be happy or afraid.
Best introductory explanation on RAG Systems Architecture i have come across. Generates clarity end-to-end
Wonderful foundational article for RAG. Thanks for sharing.
Director Uniqa - Infra & Cloud Architect
7 个月Wauw. A complex topic well explained in Human Readable Language without loosing its in-depth content! I'm looking forward to the next article!