Tired of unreliable, generic AI solutions? Here's how to build your own powerful local RAG agent with LLaMA3!
I have recently been fascinated by the world of RAG agents and open-source generative AI models. Their ability to combine information retrieval with large language models (LLMs) to answer questions has sparked my curiosity. To study deeper, I consumed resources – reading articles and watching tutorials – to grasp the inner workings of these systems.
Large language models have become a game-changer, offering incredible potential across various applications. But how do we ensure these powerful models leverage our own private knowledge base? We typically have two options:
1) Fine-tuning or Training from Scratch: This involves adapting an existing LLM to our specific needs by retraining it on our data. While this offers a high degree of customization, there are several drawbacks:
? Computational Cost: Training LLMs requires significant computational resources, making it inaccessible for many users.
? Data Requirements: Large amounts of high-quality data are essential for effective fine-tuning, which can be a challenge to obtain.
2) Retrieval-Augmented Generation (RAG): This approach sidesteps the challenges of fine-tuning by leveraging the existing capabilities of the LLM. Instead of altering the model itself, we strategically guide it with our private knowledge.
Think of RAG as a knowledge whisperer for LLMs. Here's how it works:
Benefits of RAG:
The Challenges of Real-World RAG
While building a proof-of-concept RAG application might seem straightforward, scaling it for real-world business use presents a unique set of challenges:
Overpowering the Data Challenge
As discussed, accurately parsing real-world data is essential for robust RAG applications.
1) Accurate data parser
Amongst various data parsers, LlamaParse stands out as a powerful tool specifically designed for the complexities of RAG applications. Its understanding of RAG requirements allows it to excel at converting diverse document formats, including PDFs, into a structured, LLM-friendly markdown format. This focus on accuracy translates to a significant reduction in retrieval errors, preventing the LLM from generating misleading statements based on faulty data.
To show LlamaParse's effectiveness, consider its performance compared to a traditional parser like PyPDF when handling documents like the Apple 10K filing. A comparison highlighting retrieval accuracy (green cells indicate correct answers, red cells indicate errors) showcases LlamaParse's clear advantage. This translates to a more reliable RAG system that delivers accurate information to your business.
In a production-grade RAG system, it's wise to leverage the strengths of different tools. While LlamaParse excels at local data retrieval, firecrawl emerges as a valuable option for online data retrieval. Its ability to extract website data and convert it into a markdown format compatible with LLMs makes it a powerful complement to LlamaParse.
2) Chunk size
Once we have our data parsed into a usable format, we need to consider how it's stored and retrieved for efficient RAG operation. This is where the concept of "chunking" comes into play.
Instead of feeding the entire document to the LLM, we break it down into smaller, more manageable pieces called chunks. This serves several purposes:
However, determining the ideal chunk size is a delicate balancing act. Chunks that are too small lack sufficient context for the LLM to understand the information fully. Conversely, overly large chunks can still fall into the "lost in the middle" problem.
The ideal chunk size can vary depending on several factors:
Finding the optimal chunk size often involves experimentation. Consider exploring tools like LlamaIndex, which can help you evaluate different chunk sizes and their impact on your specific RAG application.
3) Agentic RAG
Traditional RAG systems act as information retrieval assistants, retrieving relevant chunks from the database based on the user's query. Agentic RAG takes this a step further by incorporating a layer of "agency."
Here's how it works:
The benefits of Agentic RAG are numerous:
Corrective RAG: Building Trust Through Accuracy
Another exciting advancement is Corrective RAG (CRAG). While retrieval is crucial, ensuring the accuracy of retrieved information is important. CRAG addresses this concern by incorporating a "corrective" mechanism.
Here's how CRAG functions:
The advantages of CRAG are obvious:
I think that building a RAG agent involves some technical considerations, which I can't explain here. But don't worry! If you're curious to learn more about the technical details, plenty of resources are available online. Today, though, let's focus on the hands-on process of creating your own local RAG agent.
领英推荐
Let's create our RAG agent
The flowchart outlines the steps involved in processing a user's query through a RAG agent system. It starts with a user query and ends with the generation of an answer by a large language model.
LangChain provides a more detailed tutorial on how to build an agent with LangChain, but here I have modified the model.
Breakdown of the Flowchart
1) User Question: The process begins with a user posing a question to the RAG agent system.
2) Routing: The system then routes the user question to a component responsible for understanding the query's intent and potentially reformulating it for better retrieval.
3) Document Retrieval: This stage focuses on retrieving relevant documents from the system's knowledge base based on the refined query. Two paths are displayed here:
? Related to Index: If the query is determined to be related to the system's index (presumably where documents are stored and retrieved from), the process proceeds to generate the answer point.
? Unrelated to Index: If the query is unrelated to the system's index, the system triggers a web search using a search query generator. This indicates the RAG agent might also be able to access and process external information from the web.
4) Generate Answer: Here, the retrieved documents and potentially additional information (unclear from the flowchart) are used to generate a response for the user's query. This is likely achieved by prompting the LLM with the relevant information.
5) Hallucination?: Here, it checks if the answer contains hallucinations. If it does, it will go back and regenerate the answer.
6) Answer question?: This decision point determines if the retrieved documents from the knowledge base are sufficient to answer the user's question. Yes, If sufficient information is found, the process moves on to Answer generation. No, If the retrieved documents are not enough, the system might need to perform additional steps to refine the search or potentially trigger the web search path mentioned earlier.
7) Update Model (Optional): This is an optional step where the generated answer might be used to update the underlying model, potentially improving the system's performance over time.
In this implementation, we leverage Ollama to run the powerful Llama3 model.
What does this code do?
This code builds an index for your local RAG agent. It fetches documents from webpages (defined by URLs), splits them into smaller chunks, and creates a searchable database using those chunks. This database allows the RAG system to efficiently find relevant information when processing user queries.
You can see the code on GitHub here: https://github.com/kravishan/LLaMA3-RAG-agent
### Index
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_nomic.embeddings import NomicEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
urls = [
"https://www.oulu.fi/en/apply/international-programmes",
"https://www.oulu.fi/en/apply/how-apply/applying-bachelors-programmes",
"https://www.oulu.fi/en/apply/how-apply/applying-masters-programmes",
]
docs = [WebBaseLoader(url).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=250, chunk_overlap=0
)
doc_splits = text_splitter.split_documents(docs_list)
# Add to vectorDB
vectorstore = Chroma.from_documents(
documents=doc_splits,
collection_name="rag-chroma",
embedding=NomicEmbeddings(model="nomic-embed-text-v1.5", inference_mode="local"),
)
retriever = vectorstore.as_retriever()
This code defines a tool to check if retrieved documents are relevant to the user's question. It uses your local LLM to analyze the document and question, then outputs a simple "yes" or "no" answer through a JSON format. This helps filter out irrelevant documents before feeding them to other parts of the RAG agent.
### Retrieval Grader
from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate
# LLM
llm = ChatOllama(model=local_llm, format="json", temperature=0)
prompt = PromptTemplate(
template="""system You are a grader assessing relevance
of a retrieved document to a user question. If the document contains keywords related to the user question,
grade it as relevant. It does not need to be a stringent test. The goal is to filter out erroneous retrievals. \n
Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question. \n
Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.
user
Here is the retrieved document: \n\n {document} \n\n
Here is the user question: {question} \n assistant
""",
input_variables=["question", "document"],
)
retrieval_grader = prompt | llm | JsonOutputParser()
# Get user input for the question
question = input("Please enter your question: ")
# Retrieve documents based on the user question
docs = retriever.invoke(question)
doc_txt = docs[1].page_content
# Assess the relevance of the retrieved document
result = retrieval_grader.invoke({"question": question, "document": doc_txt})
print(result)
This code snippet demonstrates how our local LLM can enhance retrieved information by undertaking beyond the initial document set. It combines retrieved documents for broader context, then uses the LLM to generate a specific web search query based on the user's question and this context. This allows your RAG agent to potentially uncover additional relevant information from the web, enriching its responses.
While a simple web search can provide general information on a user's question, it often lacks focus and specificity. For instance, searching for "How many master's programs are taught in English?" might generate generic results.
This is where our approach goes beyond basic web search. By incorporating context, we can refine the search query. Look at the difference between the initial query and the improved one:
Initial: "How many master's programs are taught in English?"
Improved: "How many master's programs are taught in English at University of Oulu?"
By adding 'at University of Oulu?' to the search query, we leverage the user's specific interest and instruct the web search to retrieve data focused on that university. As you have seen before, we had data about the University of Oulu in our private database. This targeted approach allows our local RAG agent to deliver more relevant and informative answers.
This is just a basic example, and the prompt itself can be further improved. However, it demonstrates the power of contextual search in retrieving specific and valuable information.
from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_community.tools.tavily_search import TavilySearchResults
# Initialize the Language Model
llm = ChatOllama(model=local_llm, format="json", temperature=0)
# Define Prompt Template for Analysis
prompt = PromptTemplate(
template="""system You are an internet search query generator. \n
Here is the combined text of all documents: {combined_text}
Here is the user question: {question}
Your task is to understand and interpret the user's question to generate a specific search query for Google.
Question: {question}
Context: {combined_text}
""",
input_variables=["combined_text", "question"],
)
# Retrieve and Combine Document Content
combined_text = "\n".join([doc.page_content for doc in docs])
# Define Retrieval Analysis Pipeline
retrieval_analysis = prompt | llm | JsonOutputParser()
# Invoke Retrieval Analysis with Combined Text and User Question
analysis_result = retrieval_analysis.invoke({"combined_text": combined_text, "question": question})
# Print or Process the Analysis Result
print("Search query:", analysis_result)
# Perform Web Search Based on Generated Query
query = analysis_result['query']
# Initialize the web search tool
web_search_tool = TavilySearchResults(k=3)
# Perform a web search based on the user's question
search_results = web_search_tool.invoke({"query": query})
# Print or Use the Search Results
print("Web search results:", search_results)
This code crafts an answer for the user's question by leveraging your local LLM. It combines retrieved documents for context, then prompts the LLM to generate a concise answer using that context and the question itself. Finally, it displays the generated answer.
### Generate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
# Prompt
prompt = PromptTemplate(
template="""<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know.
Use three sentences maximum and keep the answer concise <|eot_id|><|start_header_id|>user<|end_header_id|>
Question: {question}
Context: {context}
Answer: <|eot_id|><|start_header_id|>assistant<|end_header_id|>""",
input_variables=["question", "document"],
)
llm = ChatOllama(model=local_llm, temperature=0)
# Post-processing
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
# Chain
rag_chain = prompt | llm | StrOutputParser()
# Generate answer
generation = rag_chain.invoke({"context": format_docs(docs), "question": question})
print("Generated Answer:")
print(generation)
This code acts as a fact-checker for your RAG agent. It presents the LLM with retrieved documents (as facts) and the generated answer. The LLM then analyzes both and outputs a simple "yes" or "no" verdict to indicate if the answer aligns with the facts, helping to prevent hallucinations (unsupported answers).
### Hallucination Grader
# LLM
llm = ChatOllama(model=local_llm, format="json", temperature=0)
# Prompt
prompt = PromptTemplate(
template=""" <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a grader assessing whether
an answer is grounded in / supported by a set of facts. Give a binary 'yes' or 'no' score to indicate
whether the answer is grounded in / supported by a set of facts. Provide the binary score as a JSON with a
single key 'score' and no preamble or explanation. <|eot_id|><|start_header_id|>user<|end_header_id|>
Here are the facts:
\n ------- \n
{documents}
\n ------- \n
Here is the answer: {generation} <|eot_id|><|start_header_id|>assistant<|end_header_id|>""",
input_variables=["generation", "documents"],
)
hallucination_grader = prompt | llm | JsonOutputParser()
hallucination_grader.invoke({"documents": docs, "generation": generation})
This code acts as a quality check for your RAG agent's answers. It presents the LLM with both the generated answer and the user's original question. The LLM then analyzes them and outputs a simple "yes" or "no" verdict through a JSON format. This verdict indicates whether the LLM considers the answer useful and relevant to resolving the question, helping to ensure your agent provides focused and helpful responses.
### Answer Grader
# LLM
llm = ChatOllama(model=local_llm, format="json", temperature=0)
# Prompt
prompt = PromptTemplate(
template="""<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a grader assessing whether an
answer is useful to resolve a question. Give a binary score 'yes' or 'no' to indicate whether the answer is
useful to resolve a question. Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.
<|eot_id|><|start_header_id|>user<|end_header_id|> Here is the answer:
\n ------- \n
{generation}
\n ------- \n
Here is the question: {question} <|eot_id|><|start_header_id|>assistant<|end_header_id|>""",
input_variables=["generation", "question"],
)
answer_grader = prompt | llm | JsonOutputParser()
answer_grader.invoke({"question": question, "generation": generation})
This code walkthrough has explored various components that contribute to the functionality of your local RAG agent. We have seen how it retrieves documents, assesses their relevance, and utilizes a large language model (LLM) to craft answers and evaluate their quality.
By employing these steps, your local RAG agent strives to provide accurate, relevant, and informative answers to user queries, even venturing beyond the initial document set through web search capabilities (optional).
This layered approach not only improves the quality of the information presented but also helps prevent the inclusion of irrelevant or unsubstantiated information in the final answer. As you continue developing your local RAG agent, consider fine-tuning the prompts, exploring different LLMs, and potentially incorporating additional quality checks to further enhance its effectiveness.
Doctoral Researcher at University of Turku
3 个月Insightful!