Querying documents with Langchain
Large Language Models are in the technology spotlight – and for good reason. In this set of posts, I’m going to show how LLMs and Python can change the way we approach a lot of problems.
Now, let’s have some fun.
A New Approach...
As a new approach for me writing these articles, I'm going to make them shorter and concentrate on only 1 or 2 topics. This article focuses on loading a document into a model then querying the article, like a chat bot.
Just like before, I'll be using OpenAI. If you want to follow along in code you will need an OpenAI API Key. See my previous article about using OpenAI keys - OpenAI is very cheap to use, but it's not free.
You'll also need to install some libraries:
#! pip install openai langchain unstructured chromadb Cython tiktoken pypdf
Tip: If you're using windows, you may need to install the pypdf2 library.
To start things off, we'll need those libraries and we'll have to get our license key ready, and set up for using the Davinci model:
import os
from langchain.llms import OpenAI
from langchain.document_loaders import PyPDFDirectoryLoader
os.environ["OPENAI_API_KEY"] = 'YOUR-KEY-HERE'
davinci = OpenAI(model_name='text-davinci-003')
For my sample document, I've used a very nice sample statement of work ('sample-statement-of-work.pdf') from https://www.stakeholdermap.com:
This document has about 2000 words, so it is below the 40000 token limit of our model. Langchain wraps up pypdf to provide a consistent programming syntax to load the pdf into memory using Langchain and the pypdf module.
loader = PyPDFLoader(r"C:\Users\Mark\OneDrive\langchain\sample-statement-of-work.pdf")
sample_doc = loader.load()
Let's quickly brush up on some vocabulary so we can better understand the code.
Embeddings & Vectors: From our perspective, Embeddings are sets of numbers assigned to words. Words with similar meanings will have similar numbers assigned to them. Vectors use these embeddings to represent words and phrases.
Use of Embeddings and Vectors are critical to LLM natural language processing.
We'll need a few libraries and use embeddings and vectors. Langchain nicely wraps these up to simplify our code. We'll use embeddings from OpenAI and store the the vectors we generate into a Chroma database:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma?
#Use OpenAI embeddings
embeddings = OpenAIEmbeddings()
# create a vector database using the sample document
# and the OpenAI embeddings
vectordb = Chroma.from_documents(?
? ? documents=sample_doc, embedding=embeddings, persist_directory="chromadb"
)
# persist the database to disk for later use
vectordb.persist()
Now the hard work is done and we can create a Langchain conversation like we did in a previous article.
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.llms import OpenAI
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
pdf_chat = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.8) , vectordb.as_retriever(), memory=memory)
while True:
? ? line = input("Enter a question: ")
? ? if line == "":
? ? ? ? break
? ? result = pdf_chat({"question": line})
? ? print(f"Answer: {result['answer']}")
Lets run and look at the output:
Enter a question: Summarise the statement of work
Answer:?This Statement of Work outlines the project for a software solution ticket linking completion for ACME Company provided by Globex, including project deliverables, detailed roles and responsibilities, onsite requirements, payment terms, and standard terms and conditions.
领英推荐
Enter a question: What are the deliverables?
Answer: Project Plan, Migration of customisations, Custom report, Regional Interface and Language Requirements, Functional Scope.
Enter a question: What is the value of the project?
Answer: The total Fees Summary of the project is €33,600.
Enter a question: Is Globex responsible for DNS issues?
Answer: No, Globex is not responsible for DNS issues, as this is explicitly excluded from the Statement of Work.
Enter a question: What is the estimated value of the project?
Answer: €33,600
Enter a question: How long will the project run?
Answer: The length of the project is not specified in this document.
Pretty nice, eh???
Here's the entire program
##############################################################
# LangPDF.py
# written by Mark Killmer, 24/06/2023
# This program loads a pdf into a Langchain conversation
# and acts like a simple chatbot.?
##############################################################
import os
from langchain.llms import OpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import OpenAIEmbeddings?
from langchain.vectorstores import Chroma?
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.llms import OpenAI
davinci = OpenAI(model_name='text-davinci-003')
os.environ["OPENAI_API_KEY"] = '<YOUR OPENAI KEY>'
pdf_folder_path='C:\\Users\\Mark\\OneDrive\\langchain\\'
#Read the PDF File? ? ? ? ?
loader = PyPDFLoader(r"C:\Users\Mark\OneDrive\langchain\sample-statement-of-work.pdf")
doc = loader.load()
# Get OpenAI embeddings and create the vector database?
embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(?
? ? documents=doc, embedding=embeddings
)
# create the conversation and loop
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
pdf_chat = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.8) , vectordb.as_retriever(), memory=memory)
while True:
? ? line = input("Enter a question: ")
? ? if line == "":
? ? ? ? break
? ? result = pdf_chat({"question": line})
? ? print(f"Answer: {result['answer']}")
?
Try your own pdf, see what types of queries work and what doesn’t work.
That wraps up this post - I hope to see you next time where we will load up larger documents!
A Lifelong Learner
1 年Mark Killmer Informative