Retrieval-Augmented Generation (RAG) with LangChain: Refining the Future of AI Conversations
Rany ElHousieny, PhD???
Senior Software Engineering Manager (EX-Microsoft) | Generative AI Leader @ Clearwater Analytics | Generative AI, Conversational AI Solutions Architect
In the ever-evolving landscape of artificial intelligence, the quest for more intelligent and contextually aware conversational agents has led to the development of innovative approaches. Among these, Retrieval-Augmented Generation (RAG) stands out as a groundbreaking technique that significantly enhances the capabilities of AI models. By combining the strengths of retrieval-based methods and generative models, RAG offers a powerful solution for creating more informative, relevant, and coherent AI-driven interactions.
LangChain, a versatile and robust framework, is at the forefront of this transformation. It enables the seamless integration of retrieval and generation mechanisms, providing developers with the tools to build sophisticated AI applications. This article delves into the intricacies of RAG, explores how LangChain facilitates its implementation, and highlights the profound impact it has on the future of AI conversations. Whether you're an AI enthusiast, developer, or researcher, join us as we uncover how RAG with LangChain is refining the future of AI interactions, making them more dynamic, accurate, and user-centric.
Understanding RAG and Its Architecture
The RAG system addresses a key limitation of LLMs: their responses are only as current and accurate as the data they were trained on. RAG introduces an "open book" strategy, where LLMs can access and incorporate information beyond their training data to answer questions more accurately, much like consulting a reference book during an open-book exam.
RAG achieves this by implementing a two-step process:
1. Retrieval Phase: When prompted with a question, RAG employs a retrieval mechanism to fetch relevant documents or data snippets from a vast external source, such as a knowledge base or the internet.
2. Generation Phase: The retrieved information is then presented to the LLM, augmenting its original prompt. Thus informed, the LLM synthesizes this additional context to generate a more precise and relevant response.
Vector Databases and Embeddings
At the heart of RAG lies the vector database, a specialized repository that stores text in the form of mathematical vectors. This is crucial for the retrieval phase, where RAG converts user queries into vectors and matches them against this database to find the most relevant documents or data points. Embeddings, which are representations of text as vectors, encapsulate the semantic meaning and context, thus enabling RAG to discern relevance with a high degree of accuracy.
The Art of Similarity Matching
Similarity in RAG is computed using vector space models. When a query is converted into a vector, it's compared against the database of embeddings using similarity metrics like cosine similarity. The most similar vectors are retrieved as they represent the documents most relevant to the query.
RAG Python Implementation with ChromaDB, Ollama, Llama3, and LangChain
To practically implement RAG using Python, ChromaDB can serve as the vector database where embeddings are stored and retrieved. Llama3, a powerful LLM available on Ollama, can be the model of choice for response generation. Finally, LangChain, a library that aids in constructing RAG systems, can be used to manage the interaction between the retrieval and generation phases, providing a full-fledged RAG implementation.
Step 1: Load Documents
I will retrieve few pages from NVidia.
from langchain_community.document_loaders import WebBaseLoader
# List of URLs you want to load (We will crawl the entire site later)
urls = [
"https://docs.nvidia.com",
"https://docs.nvidia.com/cuda",
"https://docs.nvidia.com/deeplearning",
"https://docs.nvidia.com/gameworks"
"https://docs.nvidia.com/cudnn",
"https://docs.nvidia.com/tensorrt",
]
data = []
# Loop through each URL and load the page content
for url in urls:
loader = WebBaseLoader(url)
page = loader.load()
page[0].page_content = page[0].page_content.replace('\n', '')
data.extend(page)
pprint(data)
Step 2: Split Documents into Chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter
def split_docs(data):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
return text_splitter.split_documents(data)
Step 3: Embedding
from langchain.embeddings import SentenceTransformerEmbeddings
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
Step 4: Setting up ChromaDB as the VectorDB
# Add to ChromaDB vector store
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(
documents=all_splits,
collection_name="rag-chroma",
embedding=embeddings,
)
retriever = vectorstore.as_retriever()
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Weaviate
import weaviate
from weaviate.embedded import EmbeddedOptions
# Setup vector database
client = weaviate.Client(
embedded_options = EmbeddedOptions()
)
# Populate vector database
vectorstore = Weaviate.from_documents(
client = client,
documents = chunks,
embedding = embeddings,
by_text = False
)
# Define vectorstore as retriever to enable semantic search
retriever = vectorstore.as_retriever()
Step 5: Initializing the LLM Model
We will be using local Ollama as explained in the following article :
from langchain.llms import Ollama
llm = Ollama(
model="llama3:latest",
verbose=True,
temperature=0,
)
领英推荐
Step 6: RAG Prompt Template
To understand Prompt Templates and In-Context Learning, please review the following two articles:
from langchain.prompts import ChatPromptTemplate
# Prompt
template = """Answer the question only from the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
Step 7: RAG Pipeline
The following chain represents the RAG pipeline where data is seamlessly passed from one component to another, transforming it step by step: Starting from extracting relevant context and passing the original question, Formulating a well-structured prompt for model input, Generating a response from the language model, And finally parsing that response. Each component is carefully linked using the pipe (|) operator, which in this context, denotes the flow of data through successive transformations, making the process streamlined and maintainable.
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
# Chain
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
The line of code above is using the RAG pipeline, where data is passed through a sequence of processing steps. Each component in the pipeline transforms the data in some way before passing it on to the next component. Let's break down each part:
1. Chain Definition:
- `chain = (...)`: This sets up a variable named `chain` which is assigned the result of a series of operations connected by the pipe (`|`) operator. The pipe here serves as a way to pass the output of one function as the input to the next function.
2. Components:
- `{"context": retriever, "question": RunnablePassthrough()}`: Here, I am creating a dictionary object with two keys. `context` this is the context we get from the retriever object that is responsible for retrieving the context from the Vector DB accoring to the similarity with the question asked, and `question` is set to an instance of `RunnablePassthrough()`. This `RunnablePassthrough` is a class from LangChain_Core designed to pass through data without changing it,serving as a placeholder for the actual processing of the question. It is used here to pass the question through the pipeline without modifying it. This is useful when the question does not need processing or transformation before being used.
- `prompt`: This refers to an instance of ChatPromptTemplate initialized with a predefined template. The purpose of this component in the chain is to take the dictionary (with context and question) and format it according to the template specified earlier. The formatted string will typically combine the retrieved context and the unmodified question into a structured prompt ready for the model.
- `model_local`: This component is an instance of ChatOllama configured with a specific model identifier (ollama_llm). This step involves processing the formatted prompt from the previous step by feeding it to the local language model. The model will generate a response based on the input prompt, which incorporates both the context and the question.
- `StrOutputParser()`: This final step in the chain uses the StrOutputParser class to parse the raw string output from the local language model. The parser is responsible for cleaning up the model's response, extracting specific parts of the output, or converting it into a more usable format.
Step 8: Test
# Question
rag_chain.invoke("What is the NVIDIA CUDA Toolkit?")
'According to the context, the NVIDIA CUDA Toolkit is a comprehensive development environment for C and C++ developers building GPU-accelerated applications. It provides tools for developing, optimizing, and deploying applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms and HPC supercomputers.'
Step 9: Creating a UI ChatPOD
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain_community.llms.ollama import Ollama
import ipywidgets as widgets
from IPython.display import display, clear_output, HTML
# Initialize the Ollama model
llm = Ollama(
model="llama3:latest",
)
# Function to handle the input and display the response
def handle_query(sender):
with output:
clear_output(wait=True) # Ensure the output is cleared only once ready to display new output
print("Processing...")
try:
response = rag_chain.invoke(input_box.value)
display(HTML(f"<div style='word-wrap: break-word; white-space: pre-wrap;'>Response: {response}</div>"))
except Exception as e:
print("An error occurred:", str(e))
# Create widgets for input and output
input_box = widgets.Text(description="Enter a query:")
button = widgets.Button(description="Submit Query")
output = widgets.Output()
# Set up the button's event to handle the query
button.on_click(handle_query)
# Display the widgets
display(input_box, button, output)
Conclusion
RAG stands at the frontier of AI’s conversational prowess, enabling systems to respond with accuracy and currency that was previously unattainable. By leveraging the latest information through external knowledge bases, RAG systems promise a future where chatbots and virtual assistants are not just helpers but knowledgeable consultants capable of providing verified and precise information. As the technology continues to mature, we can expect even more innovative applications across different domains, transforming how we interact with AI.
Additional Resources:
Student at WGU for BSCIA
4 个月Check out how we utilize RAG with NeuralSeek a real world and working example. https://documentation.neuralseek.com/