Talk to your PDF Documents
Created with MidJourney - Credit Lina Haidar

Talk to your PDF Documents

Have you ever wished you had a way to simply ask your PDFs questions? I developed ChatPDF specifically to do just that. I will dive through the code below and explain it in details.

import os
import openai
from dotenv import load_dotenv
from PyPDF2 import PdfReader

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

import streamlit as st

# Load environment variables from .env file
load_dotenv()

# Initalize the OpenAI API Key and define the API endpoint to use
openai.api_key = os.getenv("OPENAI_API_KEY")

# Streamlit Code for UI - Upload PDF(s)
st.title('ChatPDF :microphone:')

uploaded_files = st.file_uploader("Choose PDF files", type="pdf", accept_multiple_files=True)
if uploaded_files:
    raw_text = ''
    # Loop through each uploaded file
    for uploaded_file in uploaded_files:

    # Read the PDF
    pdf_reader = PdfReader(uploaded_file)

        # Loop through each page in the PDF
        for i, page in enumerate(pdf_reader.pages):
    
            # Extract the text from the page
            text = page.extract_text()
        
            # If there is text, add it to the raw text
            if text:
              raw_text += text
              
    # Split text into smaller chucks to index them
    text_splitter = CharacterTextSplitter(
                    separator="\n", # line break
                    chunk_size = 1000,
                # Striding over the text
                chunk_overlap = 200,  
                length_function=len,
    )
    
    texts = text_splitter.split_text(raw_text)
    
    # Download embeddings from OPENAI
    embeddings = OpenAIEmbeddings() # Default model "text-embedding-ada-002"
    
    # Create a FAISS vector store with all the documents and their embeddings
    docsearch = FAISS.from_texts(texts, embeddings)
    
    # Load the question answering chain and stuff it with the documents
    chain = load_qa_chain(OpenAI(), chain_type="stuff", verbose=True) 

    query = st.text_input("Ask a question or give an instruction")
    
    if query:
      # Perform a similarity search to find the 6 most similar documents "chunks of text" in the corpus of documents in the vector store
      docs = docsearch.similarity_search(query, k=6)
      
      # Run the question answering chain on the 6 most similar documents based on the user's query
      answer = chain.run(input_documents=docs, question=query)
      
      # Print the answer and display the 6 most similar "chunks of text" vectors 
      st.write(answer, docs[0:6])        

Python and Machine Learning: Powering Interactive PDFs

ChatPDF leverages powerful libraries from the Python ecosystem, including OpenAI, LangChain, PyPDF2, and Streamlit. The combination of these libraries allows users to interact with PDFs as never before, posing queries and receiving immediate, accurate responses. Python provides the simplicity and versatility needed to integrate these different technologies, making it the perfect choice for developing this tool.

ChatPDF relies on OpenAI API to understand text and generate meaningful responses. It employs OpenAI embeddings to transform text into high-dimensional vectors.

These vectors represent the "meanings" of different pieces of text and allow the model to determine the similarity between texts or text queries.

Reading the text from PDFs is achieved using PyPDF2, a library designed for PDF manipulation in Python. This library scans through each page of the uploaded PDFs and extracts the text, which is then prepared for analysis.

LangChain is an open-source framework designed to develop applications powered by language models. ChatPDF leverages the following modules from LangChain:

  • langchain.embeddings.openai.OpenAIEmbeddings: This module is used to create embeddings of the text using the OpenAI API. An embedding is a numerical representation of text, where similar text (or context) results in similar numerical representations. This transformation is crucial for machine learning algorithms because they primarily work with numerical data.
  • langchain.text_splitter.CharacterTextSplitter: This module is used to split the text into smaller chunks. The splitting is based on a specified character (in this case, a newline character) and a certain chunk size and overlap. The chunk size determines the number of characters in each chunk, and the overlap specifies how many characters are shared between adjacent chunks. The splitting process allows the application to handle large documents more efficiently and facilitates the generation of embeddings.
  • langchain.vectorstores.FAISS: This module wraps the FAISS library, which is used to create an index of the embeddings. An index is a data structure that allows for quick lookup of data. In this case, the index helps in finding the most similar documents to a user query.
  • langchain.llms.OpenAI: This module integrates the OpenAI API, facilitating the interaction between the application and the OpenAI services.
  • langchain.chains.question_answering.load_qa_chain: This module is used to load a question answering chain. A QA chain is essentially a sequence of operations that, when run on an input (in this case, a set of documents and a user query), generate an output (in this case, an answer to the user's question).

The interactivity of ChatPDF is powered by Streamlit , a fast and user-friendly way to create apps and interactive data dashboards. The Streamlit interface lets users upload PDFs and input queries in a simple, intuitive fashion.

Breaking Down the Magic of ChatPDF

The first step in ChatPDF's process involves uploading the PDF files through the Streamlit interface. The PyPDF2 library then reads these files, looping through each page to extract the text.?The extracted text is then split into manageable chunks, indexed, and fed into the OpenAI embeddings model. The text is processed and converted into high-dimensional vectors, essentially turning sentences or paragraphs into points in a multi-dimensional space. This conversion into vector representations prepares the text for effective similarity searches.

The unique aspect of ChatPDF is its ability to perform a similarity search on a given query powered by FAISS ' (Facebook AI Similarity Search) vector store, a library developed by Facebook's AI Research team, designed for efficient similarity search and clustering of high-dimensional vectors. Using FAISS vector store that contains all the documents and their corresponding embeddings, ChatPDF can find the documents that most closely align with the user's query. FAISS is incredibly efficient at searching through these points to find the closest vectors (i.e., the most similar texts) to a given query vector. The result is a high-speed, highly accurate method for identifying relevant documents within the vast vector space of the input PDFs.

Embeddings: Turning Text into Meaningful Mathematics

So, how do we turn text into these high-dimensional vectors that FAISS can search through? The answer lies in embeddings.

Embeddings are a way of representing text, or more broadly, categorical data, in a mathematical format that machine learning algorithms can work with.

In the case of ChatPDF, we're using OpenAI's text embeddings model.

The goal of an embedding model is to create vectors for pieces of text in such a way that the spatial relationships between the vectors reflect the semantic relationships between the texts. In simpler terms, texts that have similar meanings should be represented by vectors that are close to each other in the multi-dimensional vector space.

For example, the vectors for the phrases "dog" and "puppy" should be closer together than the vectors for "dog" and "internet" because the concepts of dog and puppy are more similar to each other than the concepts of dog and internet.

Similarity Search and Indexing: Making Sense of Data

Once we have these high-dimensional vectors, FAISS performs a similarity search. In essence, the similarity search compares the query vector (representing the user's question) to the vectors of all the documents. The search aims to find the vectors that are closest to the query vector, as these represent the documents most likely to contain a relevant answer.

Indexing, on the other hand, is the process of organizing data in an efficient manner to speed up the similarity searches. When dealing with high-dimensional data (like text embeddings), simple search methods can be slow because they require comparing the query vector to every document vector.

FAISS tackles this problem by creating an index of the document vectors which allows it to quickly narrow down potential matches before performing a more detailed comparison. This two-step approach significantly reduces the computational requirements and makes the similarity search faster and more efficient.

Once the most similar documents are identified, the application employs a question-answering chain, enabled by the OpenAI model, to generate a response based on the relevant documents. This response is then displayed on the Streamlit interface, providing the user with an accurate answer to their query.

What about other types of documents?

Luckily, LangChain has tons of document loaders besides PDF, like SQL and CSV. It leverages the Unstructured python package to transform many types of files - text, powerpoint, images, html - into text data.

Conclusion

By combining the power of Python, LangChain, OpenAI, and Steamlit we can change the way we interact with PDFs and many other document types. It provides an unprecedented level of interactivity, allowing for instant question-answering functionality across multiple documents. This tool is an invaluable asset for anyone looking to gain insights from large set of a corpora of documents without the manual labor.?

ChatPDF demonstrates the potential of AI and machine learning in making sense of vast amounts of information, and it's just the tip of the iceberg. Stay tuned for more exciting developments in this space!


ChatPDF FAQ

What is ChatPDF?

ChatPDF is a tool developed to enable users to interact with PDFs by asking questions and receiving immediate, accurate responses. It leverages the power of Python libraries such as OpenAI, LangChain, PyPDF2, and Streamlit.

How does ChatPDF work?

When a user uploads PDF files through the Streamlit interface, ChatPDF uses the PyPDF2 library to read the files and extract the text from each page. The extracted text is then split into smaller chunks and indexed using the LangChain framework. The text is transformed into high-dimensional vectors using OpenAI embeddings. The FAISS library is utilized for similarity search and indexing to find the most similar documents to a user's query. Finally, a question-answering chain powered by OpenAI generates a response based on the relevant documents, which is displayed on the Streamlit interface.

What libraries and technologies are used in ChatPDF?

ChatPDF utilizes various libraries and technologies, including OpenAI, LangChain, PyPDF2, and Streamlit. It relies on OpenAI embeddings for transforming text into high-dimensional vectors. PyPDF2 is used to read and extract text from PDF files. LangChain provides modules such as OpenAIEmbeddings, CharacterTextSplitter, FAISS, and load_qa_chain for text processing, splitting, indexing, and question answering. Streamlit powers the user interface, allowing for file uploading and query input.

What is the role of OpenAI in ChatPDF?

OpenAI plays a crucial role in ChatPDF by providing the intelligence behind the tool. It employs the OpenAI API, which can understand text and generate meaningful responses. OpenAI embeddings are used to transform text into high-dimensional vectors, enabling similarity search and question answering.

How does ChatPDF handle large documents efficiently?

To handle large documents efficiently, ChatPDF uses the CharacterTextSplitter module from LangChain. This module splits the text into smaller chunks based on a specified character (line break) and defines a chunk size and overlap. By dividing the text into manageable chunks, the tool can process and generate embeddings more efficiently.

What is FAISS, and how does it contribute to ChatPDF?

FAISS (Facebook AI Similarity Search) is a library developed by Facebook's AI Research team. In ChatPDF, FAISS is utilized to create an index of the document embeddings, allowing for quick similarity searches. FAISS efficiently searches through high-dimensional vectors to find the closest matches to a user's query, improving the speed and accuracy of the tool.

What are embeddings, and why are they important in ChatPDF?

Embeddings are numerical representations of text that capture the meaning or semantics of the text. In ChatPDF, OpenAI embeddings are used to transform text into high-dimensional vectors. These embeddings enable similarity search and help determine the relevance of documents to a user's query. Embeddings play a vital role in ChatPDF's ability to find and present accurate answers.

How does ChatPDF provide question-answering functionality?

ChatPDF incorporates a question-answering chain powered by the OpenAI model. After performing a similarity search and identifying the most relevant documents, the question-answering chain runs on those documents based on the user's query. The chain generates a response that is then displayed to the user, providing an accurate answer to their question.

Can ChatPDF handle document types other than PDF?

Yes, ChatPDF is built to handle document types other than PDF. LangChain offers document loaders for various types, including SQL and CSV. It utilizes the Unstructured Python package to transform different file formats, such as text, PowerPoint, images, and HTML, into text data for analysis.

What are the advantages of using ChatPDF?

  • Instant question-answering functionality across multiple documents.
  • Efficient handling of large documents through text splitting and indexing.
  • Accurate and meaningful responses powered by OpenAI embeddings and question-answering chains.
  • User-friendly interface provided by Streamlit for easy PDF uploading and query input.
  • Automation of document analysis, saving manual labor and enabling insights from large corpora of documents.

Manish Kumar

Digital Health Leader | Health Systems & Data Governance | AI in Healthcare | Strategy & Innovation | Data Privacy & Security | Maturity Model Implementation | People-Centric Solutions

1 年

Bassel Haidar It is indeed innovative and was nice to try it!

I’m really excited to start using this, thanks for sharing!

Lal Harter, CICA

Lead Auditor at OPM - Ut Prosim

1 年

Brilliant

要查看或添加评论,请登录

社区洞察

其他会员也浏览了