登录查看更多内容

Day 4: Building Multi-documents RAG and packaging using Streamlit

Himanshu Singh

AI Architect | Author | MBA - NMIMS

发布日期: 2024年2月4日

This is part of the series?—?10 days of Retrieval Augmented Generation

Before we start our fourth day, let us have a look at what we have discussed and what lies ahead in this 10 days series:

Day 1: Introduction to Retrieval Augmented Generation
Day 2: Understanding core components of RAG pipeline
Day 3: Building our First RAG
Day 4: Building Multi Documents RAG and packaging using Streamlit (*)
Day 5: Creating RAG assitant with Memory
Day 6: Building complete RAG pipeline in Azure
Day 7: Building complete RAG pipeline in AWS
Day 8: Evaluation and benchmarking RAG systems
Day 9: End to End Project 1 on RAG (Real World) with React JS frontend
Day 10: End to End Project 2 on RAG (Real World) with React JS frontend

In the last article we have seen how to build the RAG pipeline using LangChain, OpenAI and FAISS. In this article, we will package the application using StreamLit and build a UI based interface. Now, to add fun to this article instead of only one document indexing, we will index multiple documents and try to ask questions.

Creating Multiple Document RAG

In this article we will create a RAG which gives you answers after analyzing multiple documents. We will change the example of cricket from the last article, and we will create a Financial Investing expert. Let us first download the required documents -

We will save these documents first in a folder - fin_ed_docs. Now, let's first load these document using PyPDFLoader.

def load_and_process_pdfs(pdf_folder_path):
    documents = []
    for file in os.listdir(pdf_folder_path):
        if file.endswith('.pdf'):
            pdf_path = os.path.join(pdf_folder_path, file)
            loader = PyPDFLoader(pdf_path)
            documents.extend(loader.load())
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    splits = text_splitter.split_documents(documents)
    return splits

splits = load_and_process_pdfs(pdf_folder_path)

The above code, once called, will load each document one by one, save it in a list inside a variable documents, and finally we will use RecursiveCharacterTextSplitter to split these documents so that indexing can be done later. The process is same as the previous article, only change is loading of multiple documents. Next, we will create an index of these documents

def initialize_vectorstore(splits):
    return FAISS.from_documents(documents=splits, embedding=OpenAIEmbeddings(api_key="<YOUR API KEY>"))

vectorstore = initialize_vectorstore(splits)

The code above is exactly same as the previous article. Finally, the remaining steps and the code remains the same for creating Prompt Template and creation of the chain

prompt_template = """You are a finance expert. You need to answer the question related to finance. 
Given below is the context and question of the user.
context = {context}
question = {question}
"""

prompt = ChatPromptTemplate.from_template(prompt_template)

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, api_key="API KEY")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": vectorstore.as_retriever() | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

As the last step, we will pass the query and get the results

rag_chain.invoke("When should a person start investing?")

We will get the following answer -

'A person should start investing as early as possible. The context mentioned that the sooner one starts investing, the better. By investing early, one allows their investments more time to grow and accumulate principal, interest, or dividends year after year. The three golden rules for all investors are to invest early, invest regularly, and invest for the long term. The recommended age group for starting to invest is 18 to 35, as this is the perfect time horizon for an investor. However, it is never too late to start investing, and individuals should start as soon as they are able to save money.'

Now, let's move on to package the application using StreamLit.

领英推荐

A February Flurry of Data

Lori MacVittie 2 年前

Should you send a payload in an HTTP GET request?

Arpit Bhayani 2 年前

FLaNK Stack Weekly January 29, 2024

Tim Spann 1 年前

Packaging our Application using StreamLit

As a Data Scientist or Machine Learning Engineer, we are responsible for building the backend which gives the response of Machine Learning models. But, displaying this response generally happens in the UI side. We convert our models into an API and then the frontend team connect with these APIs to show the results. Generally, React JS or Angular JS is used to build the Frontend. During the last few articles of this series we will look at how its done.

To present the backend to the management team, is generally hard task for engineers as they had to explain the code which upper management may not understand. Hence, to make the process of UI development fast a framework named StreamLit was made. Obviously, we cannot use this framework in production, but for presentation purpose it is used. Let's package our application and build a quick UI to present our responses.

First let's give a title to our application.

import streamlit as st
st.title("Finance Expert")

Next, a user needs to write the query. Let's create a text box for it.

user_input = st.text_input("Enter your question about finance:", "")

Lastly, a button is required to submit the query and once the response comes back a segment is required to show it.

if st.button("Submit"):
    try:
        response = rag_chain.invoke(user_input)
        st.write(response)
    except Exception as e:
        st.write(f"An error occurred: {e}")

If we run this the code will execute and we will see the UI. But before we move on to run the code and see the UI, we need to cover one important point related to StreamLit.

The resources needs to be loaded in StreamLit cache, otherwise whenever we ask a query everything will get reloaded. In our case by resources I mean are the pdfs that I am loading (4 in our cases). Imagine asking a question and then these pdfs are again read and stored. This will be very time consuming. To solve this issue we will use st.cache().

@st.cache(allow_output_mutation=True)
def load_and_process_pdfs(pdf_folder_path):
    ...
    return splits

@st.cache(allow_output_mutation=True)
def initialize_vectorstore(splits):
    ...

Let's understand the parameters that we passed in st.cache()

allow_output_mutation=True: This parameter is passed if we feel that the output that is coming from a function may be changed. Sometimes, it may feel that it will not change but while doing anything related to files, its better to keep this argument as True.

Now that we have written the complete code, let's see how the output will come

streamlit run test.py

So, we have successfuly created a multi documents RAG application with a UI built in StreamLit. In the next section we will discuss about creating a conversational bot with RAG built-in. Given below is the complete code for your reference.

import streamlit as st
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
import os

# Cache the function to load and process PDF documents
@st.cache(allow_output_mutation=True)
def load_and_process_pdfs(pdf_folder_path):
    documents = []
    for file in os.listdir(pdf_folder_path):
        if file.endswith('.pdf'):
            pdf_path = os.path.join(pdf_folder_path, file)
            loader = PyPDFLoader(pdf_path)
            documents.extend(loader.load())
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    splits = text_splitter.split_documents(documents)
    return splits

# Cache the function to initialize the vector store with documents
@st.cache(allow_output_mutation=True)
def initialize_vectorstore(splits):
    return FAISS.from_documents(documents=splits, embedding=OpenAIEmbeddings(api_key="Your Key"))

pdf_folder_path = "./fin_ed_docs"
splits = load_and_process_pdfs(pdf_folder_path)
vectorstore = initialize_vectorstore(splits)

prompt_template = """You are a finance expert. You need to answer the question related to finance. 
Given below is the context and question of the user. Don't answer question outside the context provided.
context = {context}
question = {question}
"""

prompt = ChatPromptTemplate.from_template(prompt_template)

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, api_key="Your Key")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": vectorstore.as_retriever() | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Streamlit app
st.title("Finance Expert")

user_input = st.text_input("Enter your question about finance:", "")

if st.button("Submit"):
    try:
        response = rag_chain.invoke(user_input)
        st.write(response)
    except Exception as e:
        st.write(f"An error occurred: {e}")

要查看或添加评论，请登录

Himanshu Singh的更多文章

Day 7: Building complete RAG pipeline in AWS

2024年2月13日

Day 7: Building complete RAG pipeline in AWS

This is part of the series?—?10 days of Retrieval Augmented Generation Before we start our seventh day, let us have a…
Day 6: Building complete RAG pipeline in Azure

2024年2月9日

Day 6: Building complete RAG pipeline in Azure

This is part of the series?—?10 days of Retrieval Augmented Generation Before we start our sixth day, let us have a…

3 条评论
Day 5: Creating RAG assitant with Memory

2024年2月5日

Day 5: Creating RAG assitant with Memory

This is part of the series?—?10 days of Retrieval Augmented Generation Before we start our fifth day, let us have a…

2 条评论
Day 3: Building our First RAG

2024年2月1日

Day 3: Building our First RAG

This is part of the series?—?10 days of Retrieval Augmented Generation Before we start our third day, let us have a…
Day 2: Understanding core components of RAG pipeline

2024年1月31日

Day 2: Understanding core components of RAG pipeline

This is part of the series?—?10 days of Retrieval Augmented Generation Before we start our second day, let us have a…

3 条评论
Day 1: Introduction to Retrieval Augmented Generation

2024年1月30日

Day 1: Introduction to Retrieval Augmented Generation

This is part of the series?—?10 days of Retrieval Augmented Generation Before we start our first day, let us have a…

4 条评论
Balancing Client Expectations, Vendor Interests, and Employee Growth in the IT Services Industry

2023年5月22日

Balancing Client Expectations, Vendor Interests, and Employee Growth in the IT Services Industry

In the IT services industry, a common dynamic unfolds: clients want the best possible service at the lowest possible…

1 条评论
The Data Science Grandmaster Course

2019年12月16日

The Data Science Grandmaster Course

I have been hearing this complaint from almost all my students that whatever they are learning in the institutes they…

4 条评论
AI Leadership - No it's not the same!

2019年9月5日

AI Leadership - No it's not the same!

From the last few years if you ask anyone about the most booming sector in the industry, AI will always lead the list…

4 条评论
Data Scientists in India - Service or Research Oriented?

2018年7月13日

Data Scientists in India - Service or Research Oriented?

Whenever I meet head figures in different technologies, I hear only one opinion about Data Science from them, "India is…

1 条评论

See all articles

Day 4: Building Multi-documents RAG and packaging using Streamlit

Himanshu Singh

AI Architect | Author | MBA - NMIMS

This is part of the series?—?10 days of Retrieval Augmented Generation

Creating Multiple Document RAG

领英推荐

Packaging our Application using StreamLit

Himanshu Singh的更多文章

社区洞察

其他会员也浏览了

Breaking the Jargons: Issue 8

Choosing the Right Collection Type in?Rust

Enums in Rust

All Data and AI Weekly #177 - 17-Feb-2025

Week of June 9th

Investigating the effect of Company Announcements on their Share Price following COVID-19 (using the S&P 500)

Week of June 17th

Week of November 4th

Transform your Applications: How to Convert Your Streamlit App into an Executable (.exe) File

Improved Role Assignment Report REST API

This is part of the series?—?10 days of Retrieval Augmented Generation

Creating Multiple Document RAG

领英推荐

Packaging our Application using StreamLit

Himanshu Singh的更多文章

Day 7: Building complete RAG pipeline in AWS

Day 6: Building complete RAG pipeline in Azure

Day 5: Creating RAG assitant with Memory

Day 3: Building our First RAG

Day 2: Understanding core components of RAG pipeline

Day 1: Introduction to Retrieval Augmented Generation

Balancing Client Expectations, Vendor Interests, and Employee Growth in the IT Services Industry

The Data Science Grandmaster Course

AI Leadership - No it's not the same!

Data Scientists in India - Service or Research Oriented?

社区洞察

其他会员也浏览了

Breaking the Jargons: Issue 8

Choosing the Right Collection Type in?Rust

Enums in Rust

All Data and AI Weekly #177 - 17-Feb-2025

Week of June 9th

Investigating the effect of Company Announcements on their Share Price following COVID-19 (using the S&P 500)

Week of June 17th

Week of November 4th

Transform your Applications: How to Convert Your Streamlit App into an Executable (.exe) File

Improved Role Assignment Report REST API