登录查看更多内容

Querying documents with Langchain

Mark Killmer

发布日期: 2023年6月24日

Large Language Models are in the technology spotlight – and for good reason. In this set of posts, I’m going to show how LLMs and Python can change the way we approach a lot of problems.

Now, let’s have some fun.

A New Approach...

As a new approach for me writing these articles, I'm going to make them shorter and concentrate on only 1 or 2 topics. This article focuses on loading a document into a model then querying the article, like a chat bot.

Just like before, I'll be using OpenAI. If you want to follow along in code you will need an OpenAI API Key. See my previous article about using OpenAI keys - OpenAI is very cheap to use, but it's not free.

You'll also need to install some libraries:

#! pip install openai langchain unstructured chromadb Cython tiktoken pypdf

Tip: If you're using windows, you may need to install the pypdf2 library.

To start things off, we'll need those libraries and we'll have to get our license key ready, and set up for using the Davinci model:

import os
from langchain.llms import OpenAI
from langchain.document_loaders import PyPDFDirectoryLoader
os.environ["OPENAI_API_KEY"] = 'YOUR-KEY-HERE'

davinci = OpenAI(model_name='text-davinci-003')

For my sample document, I've used a very nice sample statement of work ('sample-statement-of-work.pdf') from https://www.stakeholdermap.com:

No alt text provided for this image — Source: https://www.stakeholdermap.com/project-templates/sample-statement-of-work.pdf

This document has about 2000 words, so it is below the 40000 token limit of our model. Langchain wraps up pypdf to provide a consistent programming syntax to load the pdf into memory using Langchain and the pypdf module.

loader = PyPDFLoader(r"C:\Users\Mark\OneDrive\langchain\sample-statement-of-work.pdf")
sample_doc = loader.load()

Let's quickly brush up on some vocabulary so we can better understand the code.

Embeddings & Vectors: From our perspective, Embeddings are sets of numbers assigned to words. Words with similar meanings will have similar numbers assigned to them. Vectors use these embeddings to represent words and phrases.

Use of Embeddings and Vectors are critical to LLM natural language processing.

We'll need a few libraries and use embeddings and vectors. Langchain nicely wraps these up to simplify our code. We'll use embeddings from OpenAI and store the the vectors we generate into a Chroma database:

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma?
#Use OpenAI embeddings  
embeddings = OpenAIEmbeddings()

# create a vector database using the sample document 
# and the OpenAI embeddings
vectordb = Chroma.from_documents(?
? ? documents=sample_doc, embedding=embeddings, persist_directory="chromadb"
)
# persist the database to disk for later use
vectordb.persist()

Now the hard work is done and we can create a Langchain conversation like we did in a previous article.

from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.llms import OpenAI
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
pdf_chat = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.8) , vectordb.as_retriever(), memory=memory)
while True:
? ? line = input("Enter a question: ")
? ? if line == "":
? ? ? ? break
? ? result = pdf_chat({"question": line})
? ? print(f"Answer: {result['answer']}")

Lets run and look at the output:

Enter a question: Summarise the statement of work

Answer:?This Statement of Work outlines the project for a software solution ticket linking completion for ACME Company provided by Globex, including project deliverables, detailed roles and responsibilities, onsite requirements, payment terms, and standard terms and conditions.

领英推荐

Mastering Artificial Intelligence, Machine Learning…

Pratibha Kumari J. 4 个月前

OpenAI Assistants: How to create and use them

Pluralsight 1 年前

Llama New 70B Model, Ollama 0.5, the Mall Project…

Rami Krispin 3 个月前

Enter a question: What are the deliverables?

Answer: Project Plan, Migration of customisations, Custom report, Regional Interface and Language Requirements, Functional Scope.

Enter a question: What is the value of the project?

Answer: The total Fees Summary of the project is €33,600.

Enter a question: Is Globex responsible for DNS issues?

Answer: No, Globex is not responsible for DNS issues, as this is explicitly excluded from the Statement of Work.

Enter a question: What is the estimated value of the project?

Answer: €33,600

Enter a question: How long will the project run?

Answer: The length of the project is not specified in this document.

Pretty nice, eh???

Here's the entire program

##############################################################
# LangPDF.py
# written by Mark Killmer, 24/06/2023
# This program loads a pdf into a Langchain conversation 
# and acts like a simple chatbot.?
##############################################################
import os
from langchain.llms import OpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import OpenAIEmbeddings?
from langchain.vectorstores import Chroma?
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.llms import OpenAI

davinci = OpenAI(model_name='text-davinci-003')
os.environ["OPENAI_API_KEY"] = '<YOUR OPENAI KEY>'
pdf_folder_path='C:\\Users\\Mark\\OneDrive\\langchain\\'


#Read the PDF File? ? ? ? ?
loader = PyPDFLoader(r"C:\Users\Mark\OneDrive\langchain\sample-statement-of-work.pdf")
doc = loader.load()



# Get OpenAI embeddings and create the vector database? 
embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(?
? ? documents=doc, embedding=embeddings
)

# create the conversation and loop
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
pdf_chat = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.8) , vectordb.as_retriever(), memory=memory)

while True:
? ? line = input("Enter a question: ")
? ? if line == "":
? ? ? ? break
? ? result = pdf_chat({"question": line})
? ? print(f"Answer: {result['answer']}")

?

Try your own pdf, see what types of queries work and what doesn’t work.

That wraps up this post - I hope to see you next time where we will load up larger documents!

Pradeep T

A Lifelong Learner

1 年

Mark Killmer Informative

要查看或添加评论，请登录

Mark Killmer的更多文章

Getting started with Langchain and LLMs

2023年6月3日

Getting started with Langchain and LLMs

Large Language Models are in the technology spotlight – and for good reason. In this set of posts, I’m going to show…
Exploring Large Language models

2023年5月6日

Exploring Large Language models

Large Language Models are in the technology spotlight – and for good reason. In this set of posts, I’m going to show…

2 条评论

Querying documents with Langchain

Mark Killmer

A New Approach...

领英推荐

Mark Killmer的更多文章

社区洞察

其他会员也浏览了

Best Language for Machine Learning

#ArtificialIntelligence No 65: Why R lost the R vs Python wars and what that tells you about where AI is going

Optimal Techniques for Crafting Effective LLM Prompts

Top Languages to Master Machine Learning!

Python’s Top 6 Machine Learning Algorithms

Handling Long Context RAG for LLMs with Contextual Summarization

Common AI Prompt Engineering Interview Question 11: How do you implement a decision tree, random forest, or other specific ML algorithms in Python?

Python library & It's Uses

Shapash : Machine Learning Interpretable & Understandable

Running LLMs Locally: A Comprehensive Guide

A New Approach...

领英推荐

Mark Killmer的更多文章

Getting started with Langchain and LLMs

Exploring Large Language models

社区洞察

其他会员也浏览了

Best Language for Machine Learning

#ArtificialIntelligence No 65: Why R lost the R vs Python wars and what that tells you about where AI is going

Optimal Techniques for Crafting Effective LLM Prompts

Top Languages to Master Machine Learning!

Python’s Top 6 Machine Learning Algorithms

Handling Long Context RAG for LLMs with Contextual Summarization

Common AI Prompt Engineering Interview Question 11: How do you implement a decision tree, random forest, or other specific ML algorithms in Python?

Python library & It's Uses

Shapash : Machine Learning Interpretable & Understandable

Running LLMs Locally: A Comprehensive Guide