Advanced Retrieval Augmented Generation (RAG) with Reranking
Full tutorial - https://www.youtube.com/watch?v=9f5rLt3Kc4I&t=8s
Welcome to a tutorial on building a questionnaire and answer application using Advanced RAG! In this application, users can ask questions on a wide range of topics, and the system will provide detailed, human-like responses by leveraging the Cohere model.
Application Overview
The core components of this application are:
The Importance of Reranking
Reranking plays a crucial role in the Retrieval Augmented Generation (RAG) process. In a naive RAG approach, a large number of contexts may be retrieved, but not all of them are necessarily relevant to the question. Reranking allows for the reordering and filtering of documents, placing the relevant ones at the forefront, thereby enhancing the effectiveness of RAG.
As shown in Figure 1, the task of reranking is like an intelligent filter. When the retriever retrieves multiple contexts from the indexed collection, these contexts may have different relevance to the user's query. Some contexts may be very relevant (highlighted in red boxes in Figure 1), while others may only be slightly related or even unrelated (highlighted in green and blue boxes in Figure 1).
The task of reranking is to evaluate the relevance of these contexts and prioritize the ones that are most likely to provide accurate and relevant answers. This allows the LLM to prioritize these top-ranked contexts when generating answers, thereby improving the accuracy and quality of the response.
In simpler terms, reranking is like helping you choose the most relevant references from a pile of study materials during an open-book exam, so that you can answer the questions more efficiently and accurately.
import cohere, os
import wikipedia
from langchain_text_splitters import RecursiveCharacterTextSplitter
co = cohere.Client(os.environ['COHERE_API_KEY'])
# let's get the wikipedia article about Machine learning
article = wikipedia.page('Machinelearning')
text = article.content
print(f"The text has roughly {len(text.split())} words.")
# 1. Indexing and given a user query, retrieve the relevant chunks from the index
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
length_function=len,
is_separator_regex=False,
)
chunks_ = text_splitter.create_documents([text])
chunks = [c.page_content for c in chunks_]
print(f"The text has been broken down in {len(chunks)} chunks.")
# Because the texts being embedded are the chunks we are searching over, we set the input type as search_doc
model="embed-english-v3.0"
response = co.embed(
texts= chunks,
model=model,
input_type="search_document",
embedding_types=['float']
)
embeddings = response.embeddings.float
print(f"We just computed {len(embeddings)} embeddings.")
# ### Store the embeddings in a vector database
# We use the simplest vector database ever: a python dictionary using `np.array()`.
import numpy as np
vector_database = {i: np.array(embedding) for i, embedding in enumerate(embeddings)}
query = "Give me a list of machine learning models mentioned in the page. Also give a brief description of each model"
# ### Embed the user question
# Because the text being embedded is the search query, we set the input type as search_query
response = co.embed(
texts=[query],
model=model,
input_type="search_query",
embedding_types=['float']
)
query_embedding = response.embeddings.float[0]
print("query_embedding: ", query_embedding)
# ### Retrieve the most relevant chunks from the vector database (cosine similarity)
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Calculate similarity between the user question & each chunk
similarities = [cosine_similarity(query_embedding, chunk) for chunk in embeddings]
print("similarity scores: ", similarities)
# Get indices of the top 10 most similar chunks
sorted_indices = np.argsort(similarities)[::-1]
# Keep only the top 10 indices
top_indices = sorted_indices[:10]
print("Here are the indices of the top 10 chunks after retrieval: ", top_indices)
# Retrieve the top 10 most similar chunks
top_chunks_after_retrieval = [chunks[i] for i in top_indices]
print("Here are the top 10 chunks after retrieval: ")
for t in top_chunks_after_retrieval:
print("== " + t)
# 2. Rerank the chunks retrieved from the vector database
# We rerank the 10 chunks retrieved from the vector database.
# Reranking boosts retrieval accuracy.
# Reranking lets us go from 10 chunks retrieved from the vector database, to the 3 most relevant chunks.
response = co.rerank(
query=query,
documents=top_chunks_after_retrieval,
top_n=3,
model="rerank-english-v2.0",
)
top_chunks_after_rerank = [result.document['text'] for result in response]
print("Here are the top 3 chunks after rerank: ")
for t in top_chunks_after_rerank:
print("== " + t)
# 3. Generate the model final answer, given the retrieved and reranked chunks
# preamble containing instructions about the task and the desired style for the output.
preamble = """
## Task & Context
You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.
## Style Guide
Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.
"""
# retrieved documents
documents = [
{"title": "chunk 0", "snippet": top_chunks_after_rerank[0]},
{"title": "chunk 1", "snippet": top_chunks_after_rerank[1]},
{"title": "chunk 2", "snippet": top_chunks_after_rerank[2]},
]
# get model response
response = co.chat(
message=query,
documents=documents,
preamble=preamble,
model="command-r",
temperature=0.3
)
print("Final answer:")
print(response.text)
1. Import Required Libraries:
- import cohere, os: Import the Cohere API client and the OS module to access environment variables.
- import wikipedia: Import the Wikipedia library to retrieve the content of the "Machine Learning" Wikipedia article.
- from langchain_text_splitters import RecursiveCharacterTextSplitter: Import the RecursiveCharacterTextSplitter from the langchain_text_splitters library to split the text into smaller chunks.
2. Retrieve the Wikipedia Article:
- Create a Cohere API client using the cohere.Client() function and the COHERE_API_KEY environment variable.
- Retrieve the Wikipedia article on "Machine Learning" using the wikipedia.page() function and store the article content in the text variable.
- Print the approximate number of words in the article.
3. Split the Text into Chunks:
- Create a RecursiveCharacterTextSplitter instance with the following parameters:
- chunk_size=512: Set the maximum chunk size to 512 characters.
- chunk_overlap=50: Set the overlap between chunks to 50 characters.
- length_function=len: Use the len() function to determine the length of the text.
- is_separator_regex=False: Disable the use of a regular expression for splitting the text.
- Split the article text into chunks using the create_documents() method and store the chunk contents in the chunks list.
领英推荐
- Print the number of chunks created.
4. Compute Embeddings for the Chunks:
- Set the Cohere model to "embed-english-v3.0" and the input type to "search_document".
- Use the co.embed() function to compute the embeddings for the chunks and store them in the embeddings variable.
- Print the number of computed embeddings.
5. Store the Embeddings in a Vector Database:
- Create a Python dictionary vector_database to store the embeddings, where the keys are the indices of the chunks, and the values are the corresponding embeddings.
6. Embed the User Query:
- Set the user's query to "Give me a list of machine learning models mentioned in the page. Also give a brief description of each model".
- Use the co.embed() function to compute the embedding for the user's query, setting the input type to "search_query".
- Print the computed query embedding.
7. Retrieve the Most Relevant Chunks:
- Implement a cosine_similarity() function to calculate the cosine similarity between the query embedding and each chunk embedding.
- Calculate the similarity scores between the query and each chunk, and store the scores in the similarities list.
- Sort the similarities list in descending order and get the indices of the top 10 most similar chunks.
- Retrieve the top 10 chunks based on the sorted indices and store them in the top_chunks_after_retrieval list.
- Print the indices and the contents of the top 10 chunks.
8. Rerank the Retrieved Chunks:
- Use the Cohere co.rerank() function to rerank the top 10 chunks, setting the top_n parameter to 3.
- Store the top 3 reranked chunks in the top_chunks_after_rerank list.
- Print the contents of the top 3 reranked chunks.
9. Generate the Final Answer:
- Prepare a preamble containing instructions about the task and the desired style for the output.
- Create a dictionary documents with the titles and snippets of the top 3 reranked chunks.
- Use the Cohere co.chat() function to generate the final answer, passing the user's query, the documents dictionary, and the preamble as input.
- Print the generated final answer.
AI Growth Engineer @ Relevance AI & IOT maker | Community Advisor @ Build Club | People's Choice Award for having a convo with a robot panda on stage (a collab project) ????
6 个月This is perfect timing, EXACTLY what I wanted to learn next RAG-wise. Thank you so much!
AI Product Manager | Generative AI | AI Products Builders Host| M.Sc at TUM
6 个月Full tutorial - https://www.youtube.com/watch?v=9f5rLt3Kc4I&t=8s