How to create your own Chatbot using RAG

Gen AI presents a unique technological puzzle. Typically, technologies arise from specific use cases or lets say needs. Usually there is a problem and humankind has worked to invent tools to solve the problem: engines for faster travel, computers and software for automating data management. However, Gen AI doesn't follow this pattern. It seems this time technology has been invented and now everyone is scrambling to find the most suitable use cases. Despite this, numerous promising use cases are emerging, with one particularly resonant application being AI assistants that streamline tasks, saving valuable time. A prime example of this is the desire to create chatbots that operate on proprietary documents without needing to share them with external AI providers.

In this blog post, I will guide you through the process of creating a unique RAG (Retrieval Augmented Generation) chatbot. Unlike typical chatbots, this one is specifically designed for handling queries related to very specific topics or articles.

Understand the basics - LLM

I believe anyone involved in the tech world has come across the term LLM. With the rise of generative AI, “LLM” has turned into a key term for many developers, particularly those who are interested in or are currently working in the AI field. But what exactly is LLM?

Large Language Models (LLMs) form a specific category within the broader field of Natural Language Processing (NLP). These models specialize in generating text by analyzing and processing vast datasets. Their notable strength lies in their capacity to comprehend and generate language in a broad and versatile manner. LLMs use something called the transformer model. The transformer model is a neural network that learns context and semantic meaning in sequential data like text.

A well-known example of a chatbot using LLM technology is ChatGPT, which incorporates the GPT-3.5 and GPT-4 models.

A simplistic way to understand a LLM is to think of it as machine that is build by feeding all the documents available on internet. And this machine has ingested all of those documents and has learnt to generate a human like text when prompted for questions. Keep in mind that this machine knows nothing about the documents that you have kept for yourself nor does this machine keeps on feeding itself newly added documents on internet.

For this blog, I'll be using ChatGPT 3.5 turbo model. This model has been trained with data available on internet till Jan 2022.

'gpt-3.5-turbo'

So, what's the problem with LLM?

When it comes to Large Language Models (LLMs), there are two possible scenarios involving topics that they may be less knowledgeable about.

Firstly, the model may straightforwardly admit that it lacks information on a particular subject because it hasn’t been trained on that specific data.

Secondly, there’s the potential for what’s known as “hallucination”, where the model generates responses that are inaccurate or misleading due to its lack of specialized knowledge. This is because generic LLMs are not trained with detailed information in certain areas, such as specific legal rules or medical data, which typically fall outside the scope of a general-purpose LLM’s training data.

To overcome these problems, one method is to fine tune the model itself (I'll skip that one for some other day). In this blog, I will focus on a simpler approach called RAG, or Retrieval-Augmented Generation.

Meet the savior - RAG

Retrieval-augmented generation (RAG) is an NLP model architecture that combines the retrieval-based and generation-based approaches to enable a model’s capability to extract information from a specified document. The language model utilizes user-specific data to pull the relevant information. RAG overcomes the limitations in generating contextually relevant and accurate responses by leveraging the benefits of retrieval mechanisms. This results in more informed and contextually appropriate responses.

In simpler words, using this technique we feed user-specific documents to the model and ask the model to store it, split it into chunks and retrieve the relevant chunks of these documents when prompted for, and then use the generic human like text generation abilities of LLM on top of the retrieved data chunks and produce an answer.

More on this later, now is the time to know your tools.

Know your tools - LangChain

LangChain is an open-source framework available in Python and JavaScript that facilitates the integration of LLMs to develop LLM-powered applications. It links the LLM and external data sources and services. This opens up new opportunities for developers to perform context-aware NLP tasks by connecting to any source of data or knowledge in real time without having to build everything from scratch.

In simpler words, this is the toolkit that allows developers to interact with LLM programmatically. In more simpler words, think of LLM as well of knowledge, now think of LangChain as the rope, pulley and bucket together. Now, one need to use pulley, rope and bucket to fetch the water from well. Same way LangChain provides predefined templates of prompts for common operations, such as summarization, questions answering, etc., to help developers streamline and standardize the input to the language model.

The interaction with LangChain is centered around the concept of chains. Chains provide a mechanism to execute a sequence of calls to LLMs and tools through prompt templates. The tools refer to the functionalities that allow the LLMs to interact with the world, e.g., through an API call. This sequence of calls allows developers to harness the power of language models and efficiently integrate them into their applications.

Let's get going, hmm, before that grab a key

As mentioned before, I'll use OpenAI's language model gpt-3.5-turbo. To follow the code, you need to get the OpenAI key. Please note, if you have already experimented with OpenAI's key and used your quota or if your quote has expired then you may need to spend 5$-10$ to get your keys.

Follow the below simple steps to obtain your key.

Start by first visiting OpenAI’s website.
Click the “API Keys” on the left navigation bar.
Now, click the “+ Create new secret key”?button to generate the new key.
Remember to copy the key because you won’t be able to view this key again once you click the “Done” button.

Meet the savior, again - RAG.

So, we know by now, that LangChain toolkit will help us interact with LLM. Let's see quickly how it enables us to implement the RAG technique. LangChain provides a retrieval system through its document loaders, document transformers, text embedding models, vector stores, and retrieval.

Loading the document is pretty-straightforward, and this will be evident in the code below when we load the document. The next step of transformation of the document may include several operations like splitting, filtering, combining, translating to another language, and manipulating data. For the purpose of this blog, I'll focus only on splitting.

When working with large documents, splitting the document into smaller pieces is often necessary. The text splitters follow the steps below:

Split the document into smaller, readable chunks.
Combine the small chunks into larger ones to reach the desired chunk size.
Overlap part of smaller chunks at the boundary to keep the context between chunks.

In the code below, I'll explain how this is done. But before going there, lets talk about embeddings and semantic search.

Text Embeddings

In the beginning of this article, I explained LLMs use transformer model which is a neural network that learns context and semantic meaning in sequential data like text. By that, I meant embeddings. Complicated?

Of course it is, let's think of it in an easy way. Imagine you have a huge book with lots of words in it. Each word in that book is like a puzzle piece, carrying its own meaning. Now, think of text embedding as a way to turn each of those puzzle pieces (words) into a unique number. This number doesn't just represent the word itself; it also holds information about how that word relates to other words around it.

For example, let's take the words "king," "queen," and "royal." In text embedding, each of these words might be assigned a specific number, let's say "king" is represented by the number 0.6, "queen" by 0.8, and "royal" by 0.7. Now, the interesting part is that these numbers aren't just random—they're chosen so that similar words have similar numbers. So, since "king" and "queen" are both royal titles, their numbers might be very close together, like 0.6 and 0.8.

Now, when we want to understand a sentence or a whole book, we can look at the numbers representing each word and see how they fit together. If we see numbers that are close together, it means the words they represent are related in meaning. For example, if we see the numbers for "king" and "queen" next to each other, we know they're likely talking about royalty.

So, text embedding is like turning words into special numbers that carry their meanings and how they relate to other words. This helps computers understand and work with language in a more meaningful way, even if they don't understand words like we do.

So, to create our own bot. We need to split the document into chunks, and then embed those chunks into something called vector stores. Oh, now what is this vector store.

Vector Stores

Vector stores, also known as vector databases or vector indexes, are specialized databases designed to efficiently store and retrieve high-dimensional vectors, such as text embeddings, image features, or other numerical representations of data. These databases are optimized for similarity search and retrieval operations, where the goal is to find vectors that are similar to a given query vector. For the purpose of creating our own bot, think of it as database where the embeddings are stored. And in our case, the vector database that I'll use is "Chroma".

Enough talking, time for action

Ok, now we know all the elements needed to create our own bot. Below is the code to do so. Remember the below items:

You need to have the Open AI key - instructions are above in the article.
You need to check the size of the document you're loading, and decide the chunk size and chunk overlap appropriately. If you think, its hallucinating adjust the size accordingly. I have used 300 as chunk size and 70 as overlap.
Conversation memory - you can't talk hours and hours to a person and expect them to remember everything you talked about, right?. Same is the case here, everything you ask as question won't be remembered forever. So, I have restricted the conversation buffer memory to last 5 messages. If you think you want to chat longer and want your bot to remember longer adjust it accordingly.

# importing the modules
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.vectorstores import Chroma

# defining the model
llm = ChatOpenAI(
    openai_api_key="this is your own Open AI key that you can get by following the steps above",
    model_name='gpt-3.5-turbo',
    temperature=0.0
)

# loading the document
loader = PyPDFLoader("./Path_to_your_document.pdf")
mypdf = loader.load() 

# Defining the splitter 
document_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 300,
    chunk_overlap = 70
)

# splitting the document
docs = document_splitter.split_documents(mypdf)

# embedding the chunks to vectorstores
embeddings = OpenAIEmbeddings(openai_api_key="this is your own OpenAI key that you can get by following the steps above")
persist_directory = 'db'

my_database = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    persist_directory=persist_directory
)

# defining the conversational memory
retaining_memory = ConversationBufferWindowMemory(
    memory_key='chat_history',
    k=5,
    return_messages=True
)

# defining the retriever
question_answering = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=my_database.as_retriever(),
    memory=retaining_memory
)

# defining the loop for a conversation with the AI
while True:
    question = input("Enter your query: ")
    if question == 'exit': 
	    break 
    # getting the response
    result = question_answering({"question": "Answer only in the context of the document provided." + question})
    print(result['answer'])

Congratulations, if you made this far so you know how to create your own chatbot.

How to create your own Chatbot using RAG

Nikhil Kohli

Engineering Manager@Amdocs | DevOps | Python | C++ | Project Leader. Learning, building and delivering products and projects.

Understand the basics - LLM

So, what's the problem with LLM?

Meet the savior - RAG

Know your tools - LangChain

领英推荐

Let's get going, hmm, before that grab a key

Meet the savior, again - RAG.

Text Embeddings

Vector Stores

Enough talking, time for action

社区洞察

其他会员也浏览了

Leveraging Large Language Models to Generate Business Value

How To Use Generative AI Tools Such as ChatGPT in Your Work?

The DeepSeek Panic! Why We Need to Stop Treating AI Like a Product

How ChatGPT Works: Technology, Algorithms, and Security Challenges

Retrieval Augmented Generation is an Anti-pattern

From 'Her' to Present-Day AI

Build a Simple RAG Chatbot with LangChain

GPT-5: Ushering in a New Era of Artificial Intelligence

Perplexity vs. ChatGPT vs. Claude: Which AI tool Will be Better in 2025?

Bedrock?—?AWS’s Answer to ChatGPT: Exploring AWS's Service