登录查看更多内容

Build a Q&A Bot over private data with OpenAI and LangChain - Part 1

Leo Wang

AI and Automation, Business Intelligence, Enterprise Mobility and always in Web3.

发布日期: 2023年4月18日

The ability to search over your company's private data using LLM with semantic search is a hot topic right now. There are already many products (free and paid) on the market just to do that.

Today let's build one from scratch, using OpenAI and LangChain. You can easily scale out and up, using other free alternatives to deliver the same or even better result.

Check out my Youtube video if you prefer to follow along.

In this tutorial, we will cover:

The concept of Text Embeddings and Semantic Search
What are the vectors and cosine similarity
Chunk strategy
LangChain tools
Gradio front end web application

The concept of Embeddings and Semantic Search

It will be easier to explain if we start with the architectural drawing.

No alt text provided for this image — Created with excalidraw

Most LLM's training data has a cut-off date. For example, the OpenAI model only has the information up to Sep 2021. We are going to use the LLM's Natural Language Processing (NLP) capabilities, connect to your own private data, and perform search.

It's important to note that the data you provide will generally not be used to train the model. It will stay in the database and will not be stored by OpenAI.

In this diagram, we started by ingesting the data and split them into chunks. The reason is that most models have token limit and you simply can't feed a 50-page report to the LLM.

For example, the token limit for gpt-3 is 4k and for gpt-4 is 4k or 8k.

After that, we turn the chunks of text into embeddings. An embedding is a vector (list) of floating point numbers. The?distance between two vectors measures their relatedness. It might not mean much to human but machine understands it. And it can perform search using a technique called Cosine Similarity, which measures the distance between two vectors and see how similar they are.

If you plot the vectors in the space, it will look like something below.

This is a vector space of 10000 words with 200 dimensions. And OpenAI's embedding model has 1536 dimensions.

Once we turned the data into embeddings, they are stored in a vectorstore database, such as Pinecone, Chroma and Faiss, just to name a few.

From this moment, the data ingestion phrase is completed, shown in green.

Next, we take user's input, convert it to embeddings too. By comparing the query with our data in the vector database, we select the most relevant chunks of data that is related to the user query based on the similarity.

The key here is to pass the user's query and only the relevant data to the GPT model. There are a few benefits doing so:

To fit within the model's token limit
To increase performance
To reduce cost because every token counts

After that, you will get the response from the model, and add the response back to the conversation history.

Next let's dive into the code.

The Code

The code is quite self-explanatory. They are mainly from the examples of the official docs with slight tweak. I won't go through the full code here but will highlight a few key points.

Data Ingestion:

There are many different ways to ingest data and Langchain has provided a vast selection of tools to load your data.

Here is a list of Langchain Document Loader.

To be as realistic as possible, we assumed there is a folder contains multiple documents of different formats. We used three directory loaders to ingest all the pdf, txt and docx files.

领英推荐

This AI newsletter is all you need #42

Towards AI 1 年前

LLMs, Embeddings, Vector Search and More!

Pavan Belagatti 1 年前

Top LLM Papers of the Week (October Week 4, 2024)

Kalyan KS 4 个月前

Langchain supports URL loader. You may connect to a SharePoint library and load them with the help of unstructured loader. But you also need to build modern auth into the code.

The below code scans all three type files stored in the Reports folder.

from?langchain.document_loaders?import?DirectoryLoader

pdf_loader?=?DirectoryLoader('./Reports/',?glob="**/*.pdf")
txt_loader?=?DirectoryLoader('./Reports/',?glob="**/*.txt")
word_loader?=?DirectoryLoader('./Reports/',?glob="**/*.docx")

loaders?=?[pdf_loader,?excel_loader,?word_loader]
documents?=?[]
for?loader?in?loaders:
????documents.extend(loader.load())

print(f"Total?number?of?documents:?{len(documents)}")

Text Splitter:

Once the data is ingested, it needs to be split into smaller chunks. By default, Tiktoken is used to count tokens for OpenAI LLMs.

You can also use it to count tokens when splitting documents.

Here we are splitting the text into 10k tokens with no overlap.

text_splitter?=?CharacterTextSplitter(chunk_size=1000,?chunk_overlap=0)
documents?=?text_splitter.split_documents(documents)

Embeddings:

We are using OpenAI Embeddings and store into Chroma vectorstore.

embeddings?=?OpenAIEmbeddings()
vectorstore?=?Chroma.from_documents(documents,?embeddings)

The Conversational Retrieval Chain:

Langchain's chains are easily reusable components which can be linked together. It is simple a chain of actions that has been pre-built (pre-defined) into a single line of code. You don't need to call the GPT model, define the properties with prompt.

This particular chain gives you the ability to chat over the documents and also remembers the history.

Here is the link from Langchain.

qa?=?ConversationalRetrievalChain.from_llm(ChatOpenAI(temperature=0),?vectorstore.as_retriever())

Here is the logic:

Start a new variable "chat_history" with empty string
Always pass the user question and history to the model
Append the answer to the chat history
Repeat

It is literally three lines of code. I had a function only because of the front end.

chat_history?=?[
????
def?user(user_message,?history):
????#?Get?response?from?QA?chain
????response?=?qa({"question":?user_message,?"chat_history":?history})
????#?Append?user?message?and?response?to?chat?history
????history.append((user_message,?response["answer"]))]

Front end Gradio:

Gradio is a quick and easy way to spin up a web application for your AI/ML models. If you are interested in, check out my previous blog for a deep dive. Here we used the Chat UI and combined with LangChain code using ChatGPT.

There is a variation that supports Multimodal. It will be lots of fun and very powerful once we could use gpt-4's multimodal feature.

Complete Code

Here is the full Python code for Google Colab.

# Requirement
!pip?install?openai?-q
!pip?install?langchain?-q
!pip?install?chromadb?-q
!pip?install?tiktoken?-q
!pip?install?pypdf?-q
!pip?install?unstructured[local-inference]?-q
!pip?install?gradio?-q

from?langchain.embeddings.openai?import?OpenAIEmbeddings
from?langchain.vectorstores?import?Chroma
from?langchain.text_splitter?import?CharacterTextSplitter
from?langchain.chains?import?ConversationalRetrievalChain

import?os
os.environ["OPENAI_API_KEY"]?=?""
from?langchain.chat_models?import?ChatOpenAI
llm?=?ChatOpenAI(temperature=0,model_name="gpt-4")

# Data Ingestion
from?langchain.document_loaders?import?DirectoryLoader
pdf_loader?=?DirectoryLoader('./Reports/',?glob="**/*.pdf")
excel_loader?=?DirectoryLoader('./Reports/',?glob="**/*.txt")
word_loader?=?DirectoryLoader('./Reports/',?glob="**/*.docx")
loaders?=?[pdf_loader,?excel_loader,?word_loader]
documents?=?[]
for?loader?in?loaders:
????documents.extend(loader.load())

# Chunk and Embeddings
text_splitter?=?CharacterTextSplitter(chunk_size=1000,?chunk_overlap=0)
documents?=?text_splitter.split_documents(documents)

embeddings?=?OpenAIEmbeddings()
vectorstore?=?Chroma.from_documents(documents,?embeddings)

# Initialise Langchain - Conversation Retrieval Chain
qa?=?ConversationalRetrievalChain.from_llm(ChatOpenAI(temperature=0),?vectorstore.as_retriever())

# Front end web app
import?gradio?as?gr
with?gr.Blocks()?as?demo:
????chatbot?=?gr.Chatbot()
????msg?=?gr.Textbox()
????clear?=?gr.Button("Clear")
????chat_history?=?[]
????
????def?user(user_message,?history):
????????#?Get?response?from?QA?chain
????????response?=?qa({"question":?user_message,?"chat_history":?history})
????????#?Append?user?message?and?response?to?chat?history
????????history.append((user_message,?response["answer"]))
????????return?gr.update(value=""),?history
????msg.submit(user,?[msg,?chatbot],?[msg,?chatbot],?queue=False)
????clear.click(lambda:?None,?None,?chatbot,?queue=False)

if?__name__?==?"__main__":
????demo.launch(debug=True)

Python Image Library error:

You might encounter PIL error in Google Colab. The default runtime loads PIL 8.4.0 and it won't work with unstructed document loader. If you run an upgrade directly, it won't work either. So make sure you uninstall it first.

import?PIL
!pip?uninstall?Pillo
!pip?install?--upgrade?Pilloww
print(PIL.__version__)

Shobhit Tiwari

Senior Data Scientist at HTC

1 年

I have tried this on Colab, and it works for a single query. However, when attempting to input another query, it triggers an error as outlined in the attached file. Additionally, when running this code in a local Conda-specific environment, it fails to load the specified number of documents (documents.extend(loader.load())).

Tina Hajj

Data Scientist

1 年

Hi Leo Wang. I tried the same code but still answering general questions not related to the document I provided. Any idea how to solve that?

YASSINE ES-SADANY

???? Ingénieur Développement Full Stack & IA ?? | En recherche active de mon premier emploi ?? | ? Java EE ??, ?? Spring Boot, ?? PHP (Laravel), ?? Python, ?? Docker | ?? HTML, CSS, JS, ?? React, ???? Vue.js

1 年

Is this solution free or paid ?

Kasuntharu Rathnayake

1 年

Ensuring our model responds contextually to the provided information, would it be more effective to engage with real-world examples when we continue the conversation (considering the time frame is pre-2021 for ChatGPT)? Offering suggestions could be beneficial as well. Can you propose a solution that encompasses both aspects?

Uri K.

Customer Engineer, Google Cloud

1 年

Hi, thanks for the excellent post. One question: What makes this implementation unique for private data? Thanks

1 次回应

查看更多评论

要查看或添加评论，请登录

Leo Wang的更多文章

Create your first Power Pages with Copilot

2023年12月22日

Create your first Power Pages with Copilot

We are all-in on the Microsoft Copilot ecosystem. In the previous article, we discussed how to create a Visitor…
Create your first Power Apps with Copilot

2023年11月29日

Create your first Power Apps with Copilot

Microsoft continues rolling out Copilot and now they are everywhere, literately. Today, we are going to take a deep…

4 条评论
Build your 1st app using Autogen with VSCode and Docker

2023年11月6日

Build your 1st app using Autogen with VSCode and Docker

Step by step guide for beginners, no coding experience required! Microsoft released Autogen in September 2023. Autogen…

2 条评论
Use Azure OpenAI with Power Automate to build Power Virtual Agent in Power Apps

2023年8月17日

Use Azure OpenAI with Power Automate to build Power Virtual Agent in Power Apps

Introduction We have built Power Virtual Agent in the past calling the public OpenAI API. If you work in a large…

10 条评论
ChatGPT your own data with Langchain and Streamlit - Part 2 now with User File Upload!

2023年8月3日

ChatGPT your own data with Langchain and Streamlit - Part 2 now with User File Upload!

Watch the YT Video for a follow along What is Streamlit Streamlit library is similar to Gradio, which both are quite…

3 条评论
ChatGPT + Enterprise data with Azure OpenAI and Cognitive Search - Part 3 - Connect to SharePoint Document Library

2023年7月23日

ChatGPT + Enterprise data with Azure OpenAI and Cognitive Search - Part 3 - Connect to SharePoint Document Library

Warning - Long Thread Ahead..

50 条评论
Your Personal Voice GPT Assistant with Eleven Labs

2023年7月13日

Your Personal Voice GPT Assistant with Eleven Labs

Not long ago we have built an audio ChatGPT bot using OpenAI Whisper API and Google's Text-to-Speech module. The…
Email Classification with AI Builder

2023年6月29日

Email Classification with AI Builder

Microsoft continuously enhances the capabilities of AI Builder, thereby improving the functionality of Power Automate…
ChatGPT + Enterprise data with Azure OpenAI and Cognitive Search - Part 2

2023年6月15日

ChatGPT + Enterprise data with Azure OpenAI and Cognitive Search - Part 2

Introduction Not long ago, Microsoft Mechanics YouTube channel showcased a ChatGPT-like application that can be built…

6 条评论
Search SharePoint and OneDrive with Power Virtual Agent

2023年5月31日

Search SharePoint and OneDrive with Power Virtual Agent

Today we are going to talk about the three new AI capabilities that Power Virtual Agent has: Boosted Conversation…

1 条评论

See all articles

Build a Q&A Bot over private data with OpenAI and LangChain - Part 1

Leo Wang

AI and Automation, Business Intelligence, Enterprise Mobility and always in Web3.

The concept of Embeddings and Semantic Search

The Code

领英推荐

Complete Code

Leo Wang的更多文章

社区洞察

其他会员也浏览了

The LLMOps Lifecycle: Managing Large Language Models Effectively

Retrieval Augmented Generation (RAG) overview

Dall-E-2 vs. Google Muse: The Ultimate AI Art Showdown

Fine Tuning OPEN AI GPT 3 Transformer Model for Custom Dataset

AI prompts

GPT-4 Accepts Image Inputs, Here’s What That Means for IDP

Responsible AI starts with observability: using LangSmith as LLMOps

Top 7 Open Source LLM models in 2024

GPT-4o mini: OpenAI's Response to the Boom in Open-Source LLMs

Decoding Function Calling: Redefining the Boundaries of LLM and Application Interactions

The concept of Embeddings and Semantic Search

The Code

领英推荐

Complete Code

Leo Wang的更多文章

Create your first Power Pages with Copilot

Create your first Power Apps with Copilot

Build your 1st app using Autogen with VSCode and Docker

Use Azure OpenAI with Power Automate to build Power Virtual Agent in Power Apps

ChatGPT your own data with Langchain and Streamlit - Part 2 now with User File Upload!

ChatGPT + Enterprise data with Azure OpenAI and Cognitive Search - Part 3 - Connect to SharePoint Document Library

Your Personal Voice GPT Assistant with Eleven Labs

Email Classification with AI Builder

ChatGPT + Enterprise data with Azure OpenAI and Cognitive Search - Part 2

Search SharePoint and OneDrive with Power Virtual Agent

社区洞察

其他会员也浏览了

The LLMOps Lifecycle: Managing Large Language Models Effectively

Retrieval Augmented Generation (RAG) overview

Dall-E-2 vs. Google Muse: The Ultimate AI Art Showdown

Fine Tuning OPEN AI GPT 3 Transformer Model for Custom Dataset

AI prompts

GPT-4 Accepts Image Inputs, Here’s What That Means for IDP

Responsible AI starts with observability: using LangSmith as LLMOps

Top 7 Open Source LLM models in 2024

GPT-4o mini: OpenAI's Response to the Boom in Open-Source LLMs

Decoding Function Calling: Redefining the Boundaries of LLM and Application Interactions