Tired of unreliable, generic AI solutions? Here's how to build your own powerful local RAG agent with LLaMA3!
Credit: Microsoft Designer

Tired of unreliable, generic AI solutions? Here's how to build your own powerful local RAG agent with LLaMA3!

I have recently been fascinated by the world of RAG agents and open-source generative AI models. Their ability to combine information retrieval with large language models (LLMs) to answer questions has sparked my curiosity. To study deeper, I consumed resources – reading articles and watching tutorials – to grasp the inner workings of these systems.

Large language models have become a game-changer, offering incredible potential across various applications. But how do we ensure these powerful models leverage our own private knowledge base? We typically have two options:

1) Fine-tuning or Training from Scratch: This involves adapting an existing LLM to our specific needs by retraining it on our data. While this offers a high degree of customization, there are several drawbacks:

? Computational Cost: Training LLMs requires significant computational resources, making it inaccessible for many users.

? Data Requirements: Large amounts of high-quality data are essential for effective fine-tuning, which can be a challenge to obtain.

2) Retrieval-Augmented Generation (RAG): This approach sidesteps the challenges of fine-tuning by leveraging the existing capabilities of the LLM. Instead of altering the model itself, we strategically guide it with our private knowledge.


Think of RAG as a knowledge whisperer for LLMs. Here's how it works:

Retrieval-augmented generation application | Source: sap community site


  • Private Database: We curate a collection of documents, articles, or any relevant information specific to our domain.
  • Retrieval: When a user asks a question, the system first retrieves the most relevant information from our private database based on the query.
  • Prompt Engineering: This retrieved information is then cleverly incorporated into prompts for the LLM. Imagine these prompts as detailed instructions for the LLM, improved by the context taken from our private knowledge base.


Benefits of RAG:

  • Improved Accuracy and Reliability: By guiding the LLM with relevant information, RAG ensures its responses are more accurate and aligned with your specific domain expertise.
  • Privacy Preservation: Sensitive data remains secure within your private database, eliminating concerns about data sharing with external servers.
  • Flexibility: You have complete control over the knowledge base, allowing you to continuously update it with new information and tailor it to your evolving needs.


The Challenges of Real-World RAG

While building a proof-of-concept RAG application might seem straightforward, scaling it for real-world business use presents a unique set of challenges:

  • Messy Real-World Data: Business documents come in all shapes and sizes. Unlike clean, paragraph-based text, they often comprise a mix of images, diagrams, charts, and tables. Traditional data parsers designed for simple formats like Plain Text Format often struggle with this complexity, leading to incomplete or inaccurate data extraction. This messy data can incapacitate RAG applications in their early stages due to retrieval failures.
  • The Retrieval Conundrum: Accurate retrieval is crucial for successful RAG applications. However, different data types and documents necessitate specific retrieval methods. Imagine a user asking a question that requires information from both text paragraphs and a referenced table within the same document. RAG systems need to be sophisticated enough to handle these nuances. Additionally, some user queries may appear simple but can be complex to fulfill using the RAG application. For instance, a question like "How is our sales trend from 2020 to 2024?" might require data from multiple sources and potentially involve pre-calculations before feeding it to the LLM.


Overpowering the Data Challenge

As discussed, accurately parsing real-world data is essential for robust RAG applications.


1) Accurate data parser

Amongst various data parsers, LlamaParse stands out as a powerful tool specifically designed for the complexities of RAG applications. Its understanding of RAG requirements allows it to excel at converting diverse document formats, including PDFs, into a structured, LLM-friendly markdown format. This focus on accuracy translates to a significant reduction in retrieval errors, preventing the LLM from generating misleading statements based on faulty data.


Comparison of LlamaParse vs. PyPDF | Source: llamaindex official website

To show LlamaParse's effectiveness, consider its performance compared to a traditional parser like PyPDF when handling documents like the Apple 10K filing. A comparison highlighting retrieval accuracy (green cells indicate correct answers, red cells indicate errors) showcases LlamaParse's clear advantage. This translates to a more reliable RAG system that delivers accurate information to your business.

In a production-grade RAG system, it's wise to leverage the strengths of different tools. While LlamaParse excels at local data retrieval, firecrawl emerges as a valuable option for online data retrieval. Its ability to extract website data and convert it into a markdown format compatible with LLMs makes it a powerful complement to LlamaParse.


2) Chunk size

Once we have our data parsed into a usable format, we need to consider how it's stored and retrieved for efficient RAG operation. This is where the concept of "chunking" comes into play.

Instead of feeding the entire document to the LLM, we break it down into smaller, more manageable pieces called chunks. This serves several purposes:

  • Limited Context Window: Large language models have a finite capacity for context. Feeding them an entire document can overwhelm this window, leading to the "lost in the middle" problem. In simpler terms, the LLM focuses heavily on the beginning and ending of the document, overlooking the crucial middle sections.
  • Retrieval Efficiency: Chunking facilitates effective retrieval within the RAG system. When a user asks a question, the system can efficiently search the vector database for relevant chunks instead of filtering through entire documents. This enhances response speed and accuracy.

However, determining the ideal chunk size is a delicate balancing act. Chunks that are too small lack sufficient context for the LLM to understand the information fully. Conversely, overly large chunks can still fall into the "lost in the middle" problem.

The ideal chunk size can vary depending on several factors:

  • Document Type: Technical documents with complex structures might benefit from smaller chunks compared to narrative text.
  • LLM Capabilities: The context window size of the LLM you're using can influence the optimal chunk size.
  • Retrieval Strategy: The chosen retrieval method might favor smaller chunks for higher precision or larger chunks for broader context retrieval.

Finding the optimal chunk size often involves experimentation. Consider exploring tools like LlamaIndex, which can help you evaluate different chunk sizes and their impact on your specific RAG application.


3) Agentic RAG

Traditional RAG systems act as information retrieval assistants, retrieving relevant chunks from the database based on the user's query. Agentic RAG takes this a step further by incorporating a layer of "agency."

Here's how it works:

  • Contextual Understanding: Agentic RAG goes beyond simple keyword matching. It analyzes the conversation history and user intent to understand the broader context of the user's query.
  • Intelligent Retrieval: This deeper understanding allows Agentic RAG to employ more intelligent retrieval strategies. It can combine information from multiple chunks, prioritize relevant sections, and even dynamically adjust the chunk size based on the context.
  • Multi-Agent Orchestration: For complex queries spanning multiple knowledge domains, Agentic RAG can orchestrate the collaboration of specialized agents, each equipped with expertise in a specific domain. This allows for a comprehensive and unified response.

The benefits of Agentic RAG are numerous:

  • Improved Accuracy: By considering context and user intent, Agentic RAG can retrieve highly relevant information, leading to more accurate and informative responses.
  • Enhanced Coherence: Combining information from different chunks fosters more coherent and cohesive responses, mimicking a natural conversation flow.
  • Scalability: Agentic RAG can handle increasingly complex queries and knowledge domains efficiently.


Corrective RAG: Building Trust Through Accuracy

Another exciting advancement is Corrective RAG (CRAG). While retrieval is crucial, ensuring the accuracy of retrieved information is important. CRAG addresses this concern by incorporating a "corrective" mechanism.

Here's how CRAG functions:

  • Fact-Checking Integration: CRAG integrates fact-checking mechanisms to verify the retrieved information's accuracy. This is particularly valuable when dealing with user-generated content or dynamic data sources.
  • Reasoning and Evaluation: CRAG goes beyond simple verification. It employs reasoning capabilities to assess the retrieved information's logical consistency and relevance to the user's query.

The advantages of CRAG are obvious:

  • Increased Trust: Users can rely on CRAG-powered RAG systems for accurate and trustworthy information.
  • Reduced Errors: Fact-checking and reasoning minimize the risk of misinformation being presented as facts.
  • Improved Credibility: By ensuring accuracy, CRAG enhances the overall credibility and reliability of the RAG system.

I think that building a RAG agent involves some technical considerations, which I can't explain here. But don't worry! If you're curious to learn more about the technical details, plenty of resources are available online. Today, though, let's focus on the hands-on process of creating your own local RAG agent.


Let's create our RAG agent

RAG agent flow

The flowchart outlines the steps involved in processing a user's query through a RAG agent system. It starts with a user query and ends with the generation of an answer by a large language model.

LangChain provides a more detailed tutorial on how to build an agent with LangChain, but here I have modified the model.


Breakdown of the Flowchart

1) User Question: The process begins with a user posing a question to the RAG agent system.

2) Routing: The system then routes the user question to a component responsible for understanding the query's intent and potentially reformulating it for better retrieval.

3) Document Retrieval: This stage focuses on retrieving relevant documents from the system's knowledge base based on the refined query. Two paths are displayed here:

? Related to Index: If the query is determined to be related to the system's index (presumably where documents are stored and retrieved from), the process proceeds to generate the answer point.

? Unrelated to Index: If the query is unrelated to the system's index, the system triggers a web search using a search query generator. This indicates the RAG agent might also be able to access and process external information from the web.

4) Generate Answer: Here, the retrieved documents and potentially additional information (unclear from the flowchart) are used to generate a response for the user's query. This is likely achieved by prompting the LLM with the relevant information.

5) Hallucination?: Here, it checks if the answer contains hallucinations. If it does, it will go back and regenerate the answer.

6) Answer question?: This decision point determines if the retrieved documents from the knowledge base are sufficient to answer the user's question. Yes, If sufficient information is found, the process moves on to Answer generation. No, If the retrieved documents are not enough, the system might need to perform additional steps to refine the search or potentially trigger the web search path mentioned earlier.

7) Update Model (Optional): This is an optional step where the generated answer might be used to update the underlying model, potentially improving the system's performance over time.


In this implementation, we leverage Ollama to run the powerful Llama3 model.


What does this code do?

This code builds an index for your local RAG agent. It fetches documents from webpages (defined by URLs), splits them into smaller chunks, and creates a searchable database using those chunks. This database allows the RAG system to efficiently find relevant information when processing user queries.


You can see the code on GitHub here: https://github.com/kravishan/LLaMA3-RAG-agent

### Index

from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_nomic.embeddings import NomicEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

urls = [
    "https://www.oulu.fi/en/apply/international-programmes",
    "https://www.oulu.fi/en/apply/how-apply/applying-bachelors-programmes",
    "https://www.oulu.fi/en/apply/how-apply/applying-masters-programmes",
]

docs = [WebBaseLoader(url).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=250, chunk_overlap=0
)
doc_splits = text_splitter.split_documents(docs_list)

# Add to vectorDB
vectorstore = Chroma.from_documents(
    documents=doc_splits,
    collection_name="rag-chroma",
    embedding=NomicEmbeddings(model="nomic-embed-text-v1.5", inference_mode="local"),
)
retriever = vectorstore.as_retriever()        


This code defines a tool to check if retrieved documents are relevant to the user's question. It uses your local LLM to analyze the document and question, then outputs a simple "yes" or "no" answer through a JSON format. This helps filter out irrelevant documents before feeding them to other parts of the RAG agent.

### Retrieval Grader

from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate

# LLM
llm = ChatOllama(model=local_llm, format="json", temperature=0)

prompt = PromptTemplate(
    template="""system You are a grader assessing relevance 
    of a retrieved document to a user question. If the document contains keywords related to the user question, 
    grade it as relevant. It does not need to be a stringent test. The goal is to filter out erroneous retrievals. \n
    Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question. \n
    Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.
     user
    Here is the retrieved document: \n\n {document} \n\n
    Here is the user question: {question} \n assistant
    """,
    input_variables=["question", "document"],
)

retrieval_grader = prompt | llm | JsonOutputParser()

# Get user input for the question
question = input("Please enter your question: ")

# Retrieve documents based on the user question
docs = retriever.invoke(question)
doc_txt = docs[1].page_content

# Assess the relevance of the retrieved document
result = retrieval_grader.invoke({"question": question, "document": doc_txt})
print(result)        


This code snippet demonstrates how our local LLM can enhance retrieved information by undertaking beyond the initial document set. It combines retrieved documents for broader context, then uses the LLM to generate a specific web search query based on the user's question and this context. This allows your RAG agent to potentially uncover additional relevant information from the web, enriching its responses.

While a simple web search can provide general information on a user's question, it often lacks focus and specificity. For instance, searching for "How many master's programs are taught in English?" might generate generic results.


Without a Search Query Generator

This is where our approach goes beyond basic web search. By incorporating context, we can refine the search query. Look at the difference between the initial query and the improved one:


Initial: "How many master's programs are taught in English?"

Improved: "How many master's programs are taught in English at University of Oulu?"


With Search Query Generator

By adding 'at University of Oulu?' to the search query, we leverage the user's specific interest and instruct the web search to retrieve data focused on that university. As you have seen before, we had data about the University of Oulu in our private database. This targeted approach allows our local RAG agent to deliver more relevant and informative answers.

This is just a basic example, and the prompt itself can be further improved. However, it demonstrates the power of contextual search in retrieving specific and valuable information.

from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_community.tools.tavily_search import TavilySearchResults

# Initialize the Language Model
llm = ChatOllama(model=local_llm, format="json", temperature=0)

# Define Prompt Template for Analysis
prompt = PromptTemplate(
    template="""system You are an internet search query generator. \n
    Here is the combined text of all documents: {combined_text}
    Here is the user question: {question}
    Your task is to understand and interpret the user's question to generate a specific search query for Google.

    Question: {question} 
    Context: {combined_text} 
    """,
    input_variables=["combined_text", "question"],
)

# Retrieve and Combine Document Content
combined_text = "\n".join([doc.page_content for doc in docs])

# Define Retrieval Analysis Pipeline
retrieval_analysis = prompt | llm | JsonOutputParser()

# Invoke Retrieval Analysis with Combined Text and User Question
analysis_result = retrieval_analysis.invoke({"combined_text": combined_text, "question": question})

# Print or Process the Analysis Result
print("Search query:", analysis_result)

# Perform Web Search Based on Generated Query
query = analysis_result['query']

# Initialize the web search tool
web_search_tool = TavilySearchResults(k=3)

# Perform a web search based on the user's question
search_results = web_search_tool.invoke({"query": query})

# Print or Use the Search Results
print("Web search results:", search_results)
        


This code crafts an answer for the user's question by leveraging your local LLM. It combines retrieved documents for context, then prompts the LLM to generate a concise answer using that context and the question itself. Finally, it displays the generated answer.

### Generate

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate

# Prompt
prompt = PromptTemplate(
    template="""<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are an assistant for question-answering tasks. 
    Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. 
    Use three sentences maximum and keep the answer concise <|eot_id|><|start_header_id|>user<|end_header_id|>
    Question: {question} 
    Context: {context} 
    Answer: <|eot_id|><|start_header_id|>assistant<|end_header_id|>""",
    input_variables=["question", "document"],
)

llm = ChatOllama(model=local_llm, temperature=0)

# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Chain
rag_chain = prompt | llm | StrOutputParser()

# Generate answer
generation = rag_chain.invoke({"context": format_docs(docs), "question": question})
print("Generated Answer:")
print(generation)        


This code acts as a fact-checker for your RAG agent. It presents the LLM with retrieved documents (as facts) and the generated answer. The LLM then analyzes both and outputs a simple "yes" or "no" verdict to indicate if the answer aligns with the facts, helping to prevent hallucinations (unsupported answers).

### Hallucination Grader

# LLM
llm = ChatOllama(model=local_llm, format="json", temperature=0)

# Prompt
prompt = PromptTemplate(
    template=""" <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a grader assessing whether 
    an answer is grounded in / supported by a set of facts. Give a binary 'yes' or 'no' score to indicate 
    whether the answer is grounded in / supported by a set of facts. Provide the binary score as a JSON with a 
    single key 'score' and no preamble or explanation. <|eot_id|><|start_header_id|>user<|end_header_id|>
    Here are the facts:
    \n ------- \n
    {documents} 
    \n ------- \n
    Here is the answer: {generation}  <|eot_id|><|start_header_id|>assistant<|end_header_id|>""",
    input_variables=["generation", "documents"],
)

hallucination_grader = prompt | llm | JsonOutputParser()
hallucination_grader.invoke({"documents": docs, "generation": generation})        


This code acts as a quality check for your RAG agent's answers. It presents the LLM with both the generated answer and the user's original question. The LLM then analyzes them and outputs a simple "yes" or "no" verdict through a JSON format. This verdict indicates whether the LLM considers the answer useful and relevant to resolving the question, helping to ensure your agent provides focused and helpful responses.

### Answer Grader

# LLM
llm = ChatOllama(model=local_llm, format="json", temperature=0)

# Prompt
prompt = PromptTemplate(
    template="""<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a grader assessing whether an 
    answer is useful to resolve a question. Give a binary score 'yes' or 'no' to indicate whether the answer is 
    useful to resolve a question. Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.
     <|eot_id|><|start_header_id|>user<|end_header_id|> Here is the answer:
    \n ------- \n
    {generation} 
    \n ------- \n
    Here is the question: {question} <|eot_id|><|start_header_id|>assistant<|end_header_id|>""",
    input_variables=["generation", "question"],
)

answer_grader = prompt | llm | JsonOutputParser()
answer_grader.invoke({"question": question, "generation": generation})        

This code walkthrough has explored various components that contribute to the functionality of your local RAG agent. We have seen how it retrieves documents, assesses their relevance, and utilizes a large language model (LLM) to craft answers and evaluate their quality.

By employing these steps, your local RAG agent strives to provide accurate, relevant, and informative answers to user queries, even venturing beyond the initial document set through web search capabilities (optional).

This layered approach not only improves the quality of the information presented but also helps prevent the inclusion of irrelevant or unsubstantiated information in the final answer. As you continue developing your local RAG agent, consider fine-tuning the prompts, exploring different LLMs, and potentially incorporating additional quality checks to further enhance its effectiveness.



Diluna Adeesha Warnakulasuriya

Doctoral Researcher at University of Turku

3 个月

Insightful!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了