Build a Q&A Bot over private data with OpenAI and LangChain - Part 1
The ability to search over your company's private data using LLM with semantic search is a hot topic right now. There are already many products (free and paid) on the market just to do that.
Today let's build one from scratch, using OpenAI and LangChain. You can easily scale out and up, using other free alternatives to deliver the same or even better result.
Check out my Youtube video if you prefer to follow along.
In this tutorial, we will cover:
The concept of Embeddings and Semantic Search
It will be easier to explain if we start with the architectural drawing.
Most LLM's training data has a cut-off date. For example, the OpenAI model only has the information up to Sep 2021. We are going to use the LLM's Natural Language Processing (NLP) capabilities, connect to your own private data, and perform search.
It's important to note that the data you provide will generally not be used to train the model. It will stay in the database and will not be stored by OpenAI.
In this diagram, we started by ingesting the data and split them into chunks. The reason is that most models have token limit and you simply can't feed a 50-page report to the LLM.
For example, the token limit for gpt-3 is 4k and for gpt-4 is 4k or 8k.
After that, we turn the chunks of text into embeddings. An embedding is a vector (list) of floating point numbers. The?distance between two vectors measures their relatedness. It might not mean much to human but machine understands it. And it can perform search using a technique called Cosine Similarity, which measures the distance between two vectors and see how similar they are.
If you plot the vectors in the space, it will look like something below.
This is a vector space of 10000 words with 200 dimensions. And OpenAI's embedding model has 1536 dimensions.
Once we turned the data into embeddings, they are stored in a vectorstore database, such as Pinecone, Chroma and Faiss, just to name a few.
From this moment, the data ingestion phrase is completed, shown in green.
Next, we take user's input, convert it to embeddings too. By comparing the query with our data in the vector database, we select the most relevant chunks of data that is related to the user query based on the similarity.
The key here is to pass the user's query and only the relevant data to the GPT model. There are a few benefits doing so:
After that, you will get the response from the model, and add the response back to the conversation history.
Next let's dive into the code.
The Code
The code is quite self-explanatory. They are mainly from the examples of the official docs with slight tweak. I won't go through the full code here but will highlight a few key points.
Data Ingestion:
There are many different ways to ingest data and Langchain has provided a vast selection of tools to load your data.
To be as realistic as possible, we assumed there is a folder contains multiple documents of different formats. We used three directory loaders to ingest all the pdf, txt and docx files.
领英推荐
Langchain supports URL loader. You may connect to a SharePoint library and load them with the help of unstructured loader. But you also need to build modern auth into the code.
The below code scans all three type files stored in the Reports folder.
from?langchain.document_loaders?import?DirectoryLoader
pdf_loader?=?DirectoryLoader('./Reports/',?glob="**/*.pdf")
txt_loader?=?DirectoryLoader('./Reports/',?glob="**/*.txt")
word_loader?=?DirectoryLoader('./Reports/',?glob="**/*.docx")
loaders?=?[pdf_loader,?excel_loader,?word_loader]
documents?=?[]
for?loader?in?loaders:
????documents.extend(loader.load())
print(f"Total?number?of?documents:?{len(documents)}")
Text Splitter:
Once the data is ingested, it needs to be split into smaller chunks. By default, Tiktoken is used to count tokens for OpenAI LLMs.
You can also use it to count tokens when splitting documents.
Here we are splitting the text into 10k tokens with no overlap.
text_splitter?=?CharacterTextSplitter(chunk_size=1000,?chunk_overlap=0)
documents?=?text_splitter.split_documents(documents)
Embeddings:
We are using OpenAI Embeddings and store into Chroma vectorstore.
embeddings?=?OpenAIEmbeddings()
vectorstore?=?Chroma.from_documents(documents,?embeddings)
The Conversational Retrieval Chain:
Langchain's chains are easily reusable components which can be linked together. It is simple a chain of actions that has been pre-built (pre-defined) into a single line of code. You don't need to call the GPT model, define the properties with prompt.
This particular chain gives you the ability to chat over the documents and also remembers the history.
qa?=?ConversationalRetrievalChain.from_llm(ChatOpenAI(temperature=0),?vectorstore.as_retriever())
Here is the logic:
It is literally three lines of code. I had a function only because of the front end.
chat_history?=?[
????
def?user(user_message,?history):
????#?Get?response?from?QA?chain
????response?=?qa({"question":?user_message,?"chat_history":?history})
????#?Append?user?message?and?response?to?chat?history
????history.append((user_message,?response["answer"]))]
Front end Gradio:
Gradio is a quick and easy way to spin up a web application for your AI/ML models. If you are interested in, check out my previous blog for a deep dive. Here we used the Chat UI and combined with LangChain code using ChatGPT.
There is a variation that supports Multimodal. It will be lots of fun and very powerful once we could use gpt-4's multimodal feature.
Complete Code
Here is the full Python code for Google Colab.
# Requirement
!pip?install?openai?-q
!pip?install?langchain?-q
!pip?install?chromadb?-q
!pip?install?tiktoken?-q
!pip?install?pypdf?-q
!pip?install?unstructured[local-inference]?-q
!pip?install?gradio?-q
from?langchain.embeddings.openai?import?OpenAIEmbeddings
from?langchain.vectorstores?import?Chroma
from?langchain.text_splitter?import?CharacterTextSplitter
from?langchain.chains?import?ConversationalRetrievalChain
import?os
os.environ["OPENAI_API_KEY"]?=?""
from?langchain.chat_models?import?ChatOpenAI
llm?=?ChatOpenAI(temperature=0,model_name="gpt-4")
# Data Ingestion
from?langchain.document_loaders?import?DirectoryLoader
pdf_loader?=?DirectoryLoader('./Reports/',?glob="**/*.pdf")
excel_loader?=?DirectoryLoader('./Reports/',?glob="**/*.txt")
word_loader?=?DirectoryLoader('./Reports/',?glob="**/*.docx")
loaders?=?[pdf_loader,?excel_loader,?word_loader]
documents?=?[]
for?loader?in?loaders:
????documents.extend(loader.load())
# Chunk and Embeddings
text_splitter?=?CharacterTextSplitter(chunk_size=1000,?chunk_overlap=0)
documents?=?text_splitter.split_documents(documents)
embeddings?=?OpenAIEmbeddings()
vectorstore?=?Chroma.from_documents(documents,?embeddings)
# Initialise Langchain - Conversation Retrieval Chain
qa?=?ConversationalRetrievalChain.from_llm(ChatOpenAI(temperature=0),?vectorstore.as_retriever())
# Front end web app
import?gradio?as?gr
with?gr.Blocks()?as?demo:
????chatbot?=?gr.Chatbot()
????msg?=?gr.Textbox()
????clear?=?gr.Button("Clear")
????chat_history?=?[]
????
????def?user(user_message,?history):
????????#?Get?response?from?QA?chain
????????response?=?qa({"question":?user_message,?"chat_history":?history})
????????#?Append?user?message?and?response?to?chat?history
????????history.append((user_message,?response["answer"]))
????????return?gr.update(value=""),?history
????msg.submit(user,?[msg,?chatbot],?[msg,?chatbot],?queue=False)
????clear.click(lambda:?None,?None,?chatbot,?queue=False)
if?__name__?==?"__main__":
????demo.launch(debug=True)
Python Image Library error:
You might encounter PIL error in Google Colab. The default runtime loads PIL 8.4.0 and it won't work with unstructed document loader. If you run an upgrade directly, it won't work either. So make sure you uninstall it first.
import?PIL
!pip?uninstall?Pillo
!pip?install?--upgrade?Pilloww
print(PIL.__version__)
Senior Data Scientist at HTC
1 年I have tried this on Colab, and it works for a single query. However, when attempting to input another query, it triggers an error as outlined in the attached file. Additionally, when running this code in a local Conda-specific environment, it fails to load the specified number of documents (documents.extend(loader.load())).
Data Scientist
1 年Hi Leo Wang. I tried the same code but still answering general questions not related to the document I provided. Any idea how to solve that?
Ingénieur génie informatique - spécialité Intelligence artificielle
1 年Is this solution free or paid ?
Senior Data Engineer at 99x | Data Engineering| AWS | AI/ML Enthusiast | MLOps | MSc. Big Data Analytics (Reading)
1 年Ensuring our model responds contextually to the provided information, would it be more effective to engage with real-world examples when we continue the conversation (considering the time frame is pre-2021 for ChatGPT)? Offering suggestions could be beneficial as well. Can you propose a solution that encompasses both aspects?
Customer Engineer, Google Cloud
1 年Hi, thanks for the excellent post. One question: What makes this implementation unique for private data? Thanks