Building a Conversational Web Application for PDF Documents using Mistral-7B-v0.1
Kshitij Sharma
IEEE Member | CSI Member | AI & ML Engineer | Generative AI, LLMs, NLP, RAG, Computer Vision | Researcher & Developer | Conference Presenter | Open-Source Contributor | Building Intelligent Systems for Healthcare
In this we'll walk through the development of a web application that allows users to interact with PDF documents via a conversational AI interface. This application leverages modern AI tools and frameworks to process and query text extracted from multiple PDF files. We'll cover the design choices, key components, and code implementation.
Introduction
Our program seeks to give users a simple means of interacting with data found in PDF documents. Users can ask questions about the material by uploading PDFs, and the program will employ a conversational AI model to deliver pertinent answers. Users that need to swiftly extract information from complex documents will profit from this approach. We are going to use mistral.ai for our development case. A ground-breaking team called Mistral AI developed big language models that are sophisticated yet tiny enough to be used on your desktop computer. In contrast to the handicap practices of large organizations for NLP developers working in large quantities, they offer fair support and model weights for download.
Technology Stack
Application Components
Code Overview
1. PDF Processing
The get_pdf_text function handles the extraction of text from PDF documents.
def get_pdf_text(pdf_docs):
text = ""
for pdf in pdf_docs:
pdf_reader = PdfReader(pdf)
for page in pdf_reader.pages:
text += page.extract_text()
return text
2. Text Chunking
The get_text_chunks function splits the extracted text into smaller chunks for easier processing.
def get_text_chunks(text):
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_text(text)
return chunks
Using the RecursiveCharacterTextSplitter, we break the text into chunks of 1000 characters with an overlap of 200 characters. This approach helps in managing significant texts and improves retrieval accuracy.
3. Vector Store Creation
The get_vectorstore function creates a vector store for efficient similarity search.
def get_vectorstore(text_chunks):
model_id = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embeddings = HuggingFaceEmbeddings(
model_name=model_id,
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs
)
vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings)
return vectorstore
4. Conversational AI
The get_conversation_chain the function sets up the conversational AI using HuggingFace models.
def get_conversation_chain(vectorstore):
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
llm = HuggingFaceHub(repo_id="mistralai/Mistral-7B-v0.1", model_kwargs={"temperature":0.5, "max_length":512})
conversation_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
memory=memory
)
return conversation_chain
We use HuggingFaceHub to load the conversational AI model and set up the ConversationalRetrievalChain to handle user queries based on the indexed text.
5. Web Interface
The main function defines the web interface using Streamlit.
def main():
st.set_page_config(page_title="Chat with multiple PDFs", page_icon=":books:")
st.write(css, unsafe_allow_html=True)
if "conversation" not in st.session_state:
st.session_state.conversation = None
if "chat_history" not in st.session_state:
st.session_state.chat_history = None
领英推荐
st.header("Chat with multiple PDFs :books:")
user_question = st.text_input("Ask a question about your documents:")
if user_question:
handle_userinput(user_question)
with st.sidebar:
st.subheader("Your documents")
pdf_docs = st.file_uploader("Upload your PDFs here and click on 'Process'", accept_multiple_files=True)
if st.button("Process"):
with st.spinner("Processing"):
st.write("Processing your documents...")
raw_text = get_pdf_text(pdf_docs)
text_chunks = get_text_chunks(raw_text)
vectorstore = get_vectorstore(text_chunks)
st.session_state.conversation = get_conversation_chain(vectorstore)
st.success("Documents processed successfully!")
The Streamlit interface allows users to upload PDF files and ask questions. The uploaded PDFs are processed, and the conversation chain is set up to handle user interactions.
Finally, you need to create a htmlTemplates.py file to implement your web application.
css = '''
<style>
.chat-message {
padding: 1.5rem; border-radius: 0.5rem; margin-bottom: 1rem; display: flex
}
.chat-message.user {
background-color: #2b313e
}
background-color: #475063
}
.chat-message .avatar {
width: 20%;
}
.chat-message .avatar img {
max-width: 78px;
max-height: 78px;
border-radius: 50%;
object-fit: cover;
}
.chat-message .message {
width: 80%;
padding: 0 1.5rem;
color: #fff;
}
'''
bot_template = '''
<div class="chat-message bot">
<div class="avatar">
<img src="https://i.ibb.co/jfWFfnk/chatbot-2-logo.jpg">
</div>
<div class="message">{{MSG}}</div>
</div>
'''
user_template = '''
<div class="chat-message user">
<div class="avatar">
<img src="https://i.ibb.co/kKfsbY0/human-chat-head.jpg">
</div>
<div class="message">{{MSG}}</div>
</div>
'''
Conclusions
This application demonstrates how to build a web-based conversational interface for querying information from PDF documents. By integrating various tools and technologies, we can provide a seamless experience for users to interact with document content meaningfully.
Java Programming | Data Structure and Algorithms | Full Stack Developer | 300+ LeetCode Problems Solved | CS Fundamentals
6 个月Love this
Proficient in C, Python, web Development || Experienced SQL || Passionate about Fullstack || Currently learning Machine Learning
6 个月Love this