Exploring GenAlverse - II
Retrieval Augmented Generation
In part I, we’d done basic inference of the LLM models through LangChain and had an overview of the various prompt engineering techniques. In this part we’ll dive into Retrieval Augmented Generation (RAG). While RAG started as one of the prompt engineering techniques, it’s become a framework for enriching the context of the LLM models to get more coherent responses.?
Since all the large language models (LLM models) have a knowledge cut off date (gpt-4o is pre-trained with data as of October 2023), hence they will not have knowledge or context of events that happened after the cut off date leading to inaccurate and hallucinated responses. Ask the model who won men's 100m in 2024 Paris Olympics and it will not be able to provide a response. llama 3 provided the below response.
>>> who won the gold medal in mens 100 meters race in Paris olympics in 2024
>>>I'm afraid you're asking a question that is not yet answered! The 2024 Summer Olympics are scheduled to take place from July 26 to August 11, 2024, in Paris, France. The men's 100 meters race will likely be held on the second day of the Games, but the winner has not been determined yet. In fact, the athletes who will compete in this event have not even qualified for the Olympics yet! The qualification process and the actual competition will take place before the Games begin. So, I can't give you an answer to your question just yet. But I'll be happy to provide the winner once the event has taken place!
Additionally, the enterprise GenAI use cases would require to leverage LLM models within the organization network due to the data security.?
The RAG framework helps in solving these problems by combining the capabilities of generative large language models with the additional “knowledge bases”. The knowledge base can be the content of an organization’s intranet, content from the internet, documents, pdf files, audio files etc. These knowledge bases enhance and enrich the context of the LLM models thereby producing more relevant and accurate output.
RAG steps
The Retrieval Augmented Generation typically involves the following steps. The below diagram from LangChain documentation outlines the steps very well.?
As depicted in the above diagram, there are multiple sources from where the data can be loaded to create knowledge bases. Although cloud platforms like AWS Bedrock have abstracted the whole process and made it very easy to implement RAG, we’ll have a closer look at the process by taking a use case and implementing RAG framework step by step.?
RAG Implementation:
We’ll implement RAG with an example use case and go through all the steps of the process in detail through the below scenario:
Apple Inc has just released the earnings report and the video of the analysis has been posted on the news sites. Since there is a lot of financial information present in the video, the requirement is to create a chat application leveraging LLM models through which a Financial analyst will be able to “converse” with the content of the video and get the relevant details of the earnings report?
We’ll develop the application by implementing the RAG framework following the required steps.
RAG Step 1: Load
In this step of the RAG process, the required documents/content is loaded into a store. In our use case, we’ve to load the transcribed text from the video file, hence we’ll first extract the text from the video through the OpenAIWhisperParser model from OpenAI. We’ll pass the youtube url of the video to the library to download the audio file and then extract the text.
Prerequisites: Following LangChain libraries would be required to be downloaded initially. Also since we’re using the OpenAI APIs, the API key would have to be created to access OpenAI libraries.?
from langchain_community.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import OpenAIWhisperParser
Langchain_community.document_loaders has the wrappers for working with different formats, sources, parsers etc. The text from videos can be extracted as below:
from langchain_community.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import OpenAIWhisperParser
import os
from dotenv import load_dotenv
#Load the environment variables that have been defined in .env file
load_dotenv()
# Get the API key of OpenAI that is defined in the .env file with the key - "OPEN_AI_KEY"
OPENAPI_API_KEY= os.environ["OPEN_AI_KEY"]
# Youtube url from which the audio has to be extracted and converted to text.
urls = ["https://www.youtube.com/watch?v=FjdKJjF6d5A"]
save_dir = "C:\source"
# Transcribe the videos to text
loader = GenericLoader(YoutubeAudioLoader(urls, save_dir), OpenAIWhisperParser(api_key=OPENAPI_API_KEY))
# First steps of the RAG process - load the documents. In this case, we're loading the extracted text from the
# your tube video.
docs = loader.load()
Once the audio file is extracted and transcribed, we’ll be able to see the extracted test in the docs variable.
Docs[0].page_content[0:200]
We’ll get the transcribed text from the video file as below:
"All right. Let's get some more immediate reaction on this. Let's get to Tom Forte, Maxim Group Managing Director, Senior Consumer Internet Analyst. Tom, Apple's results are out. Let's get your quick take on the report. What do you make of it, Tom? So for the June quarter, I would say it was a net positive with upside on both top and bottom line and better than expected results in the iPhone. So if there was one negative in the June quarter, it was, as you pointed out, that the revenue performanc"
Its amazing to see the above implementation in action. With a few lines of code, we’re able to extract the text from the videos. Now since we have the extracted text, we’ll perform the next step of the RAG process: Transform
RAG Step 2 : Transform:
Once we’re extracted the text from the video, we’ll split the text into “chunks”. This will enable the model to focus on specific meaning and content thereby improving the performance. Additionally while chunking, we also include the overlap between the chunks to retain the reference and context across chunks.
The chunk size also has to be optimized as the model will lose the context with very smaller chunks and if the chunk size is very big, then the model will not be able to be specific in the responses.?
LangChain provides splitter libraries to chunk or split the text/ content based on content. There are splitters for HTML , code, recursively split JSON documents etc.?
Below code splits the loaded document into chunks. We’ll use the RecursiveCharacterTextSplitter library from LangChian.?
from langchain_text_splitters import RecursiveCharacterTextSplitter
combined_docs = [doc.page_content for doc in docs]
text = " ".join(combined_docs)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)
splits = text_splitter.split_text(text)
In the above code, the array of the docs is first combined in text variable and then passed to text_splitter. Note that the chunk size is 200 and overlap is 50. We can go for a higher chunk size (around 1000) and higher chunk overlap (around 200). I’ve kept the smaller chunk to demonstrate the splitting process.?
If we look into the first two chunks - Splits[0] and Splits[1], we can clearly see the overlap of the text (in bold).?
?Splits[0]
?"All right. Let's get some more immediate reaction on this. Let's get to Tom Forte, Maxim Group Managing Director, Senior Consumer Internet Analyst. Tom, Apple's results are out. Let's get your quick"
?Splits[1]
"Apple's results are out. Let's get your quick take on the report. What do you make of it, Tom? So for the June quarter, I would say it was a net positive with upside on both top and bottom line and"
We can see the text in bold has been overlapped across the two chunks. This will help in keeping the reference of the chunks and this retain the context.?
As part of the “Transform” step in the RAG process, we’ve to now create “Embeddings” for the documents and store them in a “Vector Database”. Before implementation, let's first understand about Embeddings and Vectors.
领英推荐
Text Embeddings:
Embedding is the process of creating a vector (list) of floating point numbers from the text. The difference between the floating point numbers measures the "likeness" of the two text. Similar to LLMs there are various embedding models available. Some of them are open source (like embedding models from Meta and Hugging Face) while some of the models are licensed (like embedding models from OpenAI and Anthropic).?
Since we can measure the “relatedness” of the terms, the embedding techniques are also used for semantic search.
Semantic search: In keyword search, the exact word is searched, however, in semantic search, the results are returned based on similarity of the search string (although the rank of key word search will be higher). So, in semantic search, if we search for “DevOps”, the search results will include the content related to “DevOps” as well as content related to “Continuous Deployment” as the terms “DevOps” and “Continuous deployment” are semantically closer.?
Additional use cases for text embeddings are: Recommendation engines, Clustering, Classifications etc.
In this implementation, we’ll use Ollama embeddings. Ollama is already installed in the local environment. There are similar embedding models from Hugging Face, OpenAI, AWS and other players.?
There are three embedding models from Ollama:?
After installing Ollama, the model can be downloaded by running the below command from the command prompt. So to download the “mxbai-embed-large” embedding model, the below command has to be executed.?
> ollama pull mxbai-embed-large
Vector Databases
Before creating embeddings, let’s also discuss about the Vector databases.
Once the text is converted into vectors, it has to be stored in a database. This database will be used by the LLM model to “retrieve” the responses based on the user query. The vectors can be stored in local file as well as on cloud. Basically, any distributed document database like Cassandra, AWS Opensearch can be used as a vector database. In this example we’ll use FAISS library (Facebook AI Similarity Search) for storing the vectors and performing the similarity search.?
Since the enterprise implementations would require to embed huge amount of content, a distributed document database store like Cassandra or AWS Opensearch can be used as vector database.
The below implementation will combine above two steps of embedding the text and storing the vectors in the Vector database. We’ll store the vector database in our local environment. The vectors are usually stored as .pkl file format.?
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import FAISS
embeddings=OllamaEmbeddings(model="mxbai-embed-large")
vector_db=FAISS.from_texts(splits,embeddings)
vector_db.save_local("faiss_index_transcript")
When a user sends a question, the question is also converted to vectors using the same embedding model and the query is done to the vector database. To retrieve the top K results. The results are then combined into prompts and fed to the LLM model for final summarization and content generation.?
RAG Step 3 : Retrieve
Now since we’ve the vector database ready, we will move on to the last step in the RAG process - Retrieve. In this step, we’ll create a retrieval chain using RetrievalQA form llangchain.chain. Once the chain is created, the query can be sent as parameter to the run function of the chain to get the response. We’ll use “gpt-4o” model from OpenAI for this implementation. Since this is a licensed model, we’ll have to pass the API key to access the model.?
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
import os
OPENAPI_API_KEY= os.environ["OPEN_AI_KEY"]
llm = ChatOpenAI(
model="gpt-4o",
temperature=0,
max_tokens=None,
timeout=None,
max_retries=2,
api_key=OPENAPI_API_KEY,
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vector_db.as_retriever(),
)
Now we’re all set up and the user - Financial analyst - will be able to converse with the video file. Let’s ask a few questions and check the responses:
Question 1:
query = "What is the bottom gross margin for Q3?"
qa_chain.invoke(query)
Answer from the model:
{'query': 'What is the bottom gross margin for Q3?',
?'result': 'The bottom gross margin for Q3 is 46.3%.'}
Question 2:
query = "What was the gain in sales in Q3?"
qa_chain.invoke(query)
Answer from the model:
{'query': 'What was the gain in sales in Q3?',
?'result': 'The gain in sales in Q3 was 5%.'}
Question 3:
query = "Give a summary of the progress that has been made for Apple intelligence?"
qa_chain.invoke(query)
Answer from the model:
{'query': 'Give a summary of the progress that has been made for Apple intelligence?',
?'result': 'Apple Intelligence is generating excitement as a new reason to upgrade iPhones. There are indications that new hardware for consumers may be released, possibly in September, but a software update will be necessary for Apple Intelligence to function. The company is working on this upgrade cycle and may provide more details after their earnings report.'}
Notice that in the response to the third question, the model did not restrict to content from video file. It had got the insights from the external sources to provide a more relevant response.?
Summary
In this part we’ve discussed and implemented the RAG framework. Similar solution can be deployed at a large scale to analyze the conversations in a contact center and relevant metrics can be derived. The implementations demonstrated were run on a local machine with CPU. It took a lot of time for some of the operations like vector embeddings with LLM deployed in the local environment (vector embeddings with local embedding model took more than 40 minutes!!).?New generation chip architectures have significantly improved the inferencing process. This brings us to understand the soup of alphabets - CPUs, GPUs, LPUs, TPUs .. ??. We’ll discuss them in next part.
Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer
6 个月Given your focus on RAG and embeddings within vector databases, how do you envision the interplay between dense retrieval techniques like SentenceTransformers and sparse retrieval methods like TF-IDF evolving in a LangChain pipeline designed for complex reasoning tasks?