The Secret to Creating RAG Chatbots that Actually Work
Photo by Kumas Taverne on Unsplash

The Secret to Creating RAG Chatbots that Actually Work

From my experience building AI apps, there are mainly just two bottlenecks holding people back from creating outstanding RAG (Retrieval-Augmented Generation) chatbots.

So…

What are they and, most importantly, how do we fix them?

Let me save you the suspense and give you the answers right away — no need to read through 3 quick plugs of a paid newsletter and a NordVPN ad.

Although the best tips on ways to improve RAG models are near the end ;)

The two key factors, as you may already know, are:

  • Knowledgebase
  • Prompts

That’s it.

Getting the overarching flow of the app isn’t too difficult. A simple workflow can do the job just fine. But the real magic happens with a carefully crafted prompt and a bulletproof knowledgebase.

How do we actually make sure they aren’t limiting our app?

We’ll dive into both, focusing especially on creating the best possible knowledgebase since resources on this specific topic are surprisingly scarce.

Stick with me here, and by the end of this article, you’ll have actionable insights to elevate your RAG chatbots to the next level. Let’s get started!


Choosing the right platform

First off, lets get the basics out of the way.

Choosing the correct platform to host your database is crucial. So, what are our options?

  • Qdrant
  • Pinecone

If you’re feeling fancy, you might choose to use OpenAI’s Assistants API. I’ve been hearing some good things about it recently, although I haven’t personally tested it since the Assistants v1 API release.

While the Assistants API is a bit of a black box and handles some problems on its own, it means we can’t optimize those parts ourselves.

Personally, I prefer Qdrant, but feel free to experiment with all three options.


The importance of the query

You know what I like more than materialistic things? KNOWLEDGE - Tai Lopez, your favorite internet guru.

As you may know, a Knowledge Base (KB) works as follows:

We use a search query → To retrieve the ‘most similar vectors’ to said query.

So, vectors are retrieved based on their similarity to the query given.

And if the search query used is wrong, you will never retrieve the right documents, no matter how good the knowledgebase is.

This is specially important to take into account with conversational models since, oftentimes, the search query is not contained (at least fully) within the last user message.

For example:

  • User: What is the price of sneakers ABC?
  • Bot: The price of sneakers ABC is $19.99
  • User: and of sneakers XYZ?

Notice how the user doesn’t explicitly ask for the price? It’s a continuation of the conversation.

And the LLM behind is usually fully aware of what the user is asking for.

But…

It may not be getting the context it needs.

You can’t (or at least shouldn’t) use the last user message as your KB search query, because it doesn’t contain the full context of the question.

Yet that’s exactly what lots of developers do.

But, this way, it’s a matter of luck if you get the right context the LLM needs.

And we need to rely on luck as little as possible.

The most straightforward way of tackling this is to:

  • Set up a quick API call to GPT-3.5 Turbo (or equivalents), with few shot prompting, to extract the right search query.

Here’s an example prompt for a shopping assistant:

more on this later :)

# Role
You are the best {role} specialized in {topic} with a knack for {specialization}. The reputation of our company, to which you belong, rests entirely in your hands and your ability to correctly recognize user questions. Your role is essential in helping our users with {what it's doing for your users} something EXTREMELY important for them.

# Task
Generate a search phrase based on the user's last question in the conversation, considering the context provided by previous questions and answers. You must ensure that the search phrase is brief, precise, and suitable for obtaining relevant answers on the discussed topic. If the user's question does not require related information, respond with 'null'. Steps to follow:

1. Review the conversation history to understand the most recent question and the general context.
2. Identify keywords and specific details that are important for formulating an effective search phrase.
3. Create a search phrase to be used in a vectorized database.

# Specifics
- The search phrase must be precise and relevant to the last question asked by the user.
- You must consider all the context provided by the conversation to ensure the relevance of the phrase.

# Examples
### Ex. 1
Initial question: {second to last user question}
Initial answer: {last answer}
Most recent question: {last question}
Phrase: {enter desired}

### Ex. 2
Initial question: {second to last user question}
Initial answer: {last answer}
Most recent question: {last question}
Phrase: {enter desired}

### Ex. 3
Initial question: {second to last user question}
Initial answer: {last answer}
Most recent question: {last question}
Phrase: {enter desired}

# Notes
- You should NOT interact directly with the user or offer advice or direct answers to their questions. Your sole objective is to generate a precise search phrase.
——
Now, generate the correct search phrase using the context from the provided conversation history:

{conversation_context}
Phrase:        

A good example for our few-shot prompting would look like:

### Ex. 3
Initial question: what is the return policy for handbags?
Initial answer: the return policy for handbags allows returns within 30 days of purchase, provided the item is in its original condition...
Most recent question: what about for sneakers?
Phrase: return policy sneakers        

A whole bunch of vectors

Now let’s look into how to make sure we have the right knowledge stored on our database.

Because having the right KB isn’t just about getting a random PDF, splitting it up into chunks, and uploading it to your database.

There’s a bit more to it than that.

Chunking (breaking down information into smaller, more manageable pieces) is especially important, vectors should be of similar size within a KB (otherwise we will retrieve less relevant vectors via similarity scores).

Chunks shouldn’t be too big or too small. The ideal size can vary by use case, so experimenting is key here. Generally, 256 tokens per vector should get the job done.

Furthermore, since we already know that ‘Vectors are retrieved based on their similarity to the query given’, we need to consider two other things:

  • Overarching concepts (e.g., give me a list of all available dresses)
  • Specific concepts (e.g., what is dress XYZ made of?)

Users may ask for any of the two, and your knowledgebase needs to be prepared to handle both.

The issue that tends to pop up here is the following:

  1. User makes generic query.
  2. App retrieves a limited number of vectors containing specific topics.
  3. Topics contained in the vectors retrieved are limited.
  4. LLM does NOT have enough info to give a good answer.
  5. User gets bad answer which doesn’t fully solve his question.

And it can happen the other way around too, with our KB being too generic.

An easy way to fix this is to think of the overarching concepts present on our KB (or go over past question dealing with this issue), and then ask ChatGPT to create a document to upload to our KB summarizing or getting an overarching view of X concept.

If you’ve got a bit more free time (or if you’ve got a lot of vectors to deal with), you can do what I did:

Make a custom-coded app on python to automatically identify overarching topics, then send relevant context and a good prompt to an LLM and, finally, watch it create general vectors on auto-pilot for thousands of different vectors.
Screenshot by Author

Here is an overview of the workflow:

  1. Divide Vectors into Topics: Organize vectors into distinct topics based on their content, ensuring they are neither too broad nor too narrow.
  2. Extract Overarching Concepts: Send topic vectors to GPT-4o with a detailed prompt to identify overarching concepts. Retrieve and list these broader concepts.
  3. Display Concepts for Selection: Show the concepts in an indexed menu. Select the relevant concepts for creating new, general vectors.
  4. Generate Summaries for Selected Concepts: For each selected concept, prepare a prompt that includes the concept and related vectors from the topic, EXCLUDING all other selected concepts to avoid overlap. Send these prompts to GPT-4o to generate comprehensive summaries for each concept.
  5. Upload the new vectors to your KB: Upload the new summary vectors, tagging them appropriately for easy retrieval.

It’s a bit complex, but hopefully I explained myself well enough.


Testing, testing and more testing…

To refine your knowledgebase (KB) and ensure it delivers accurate responses, continuous testing is incredibly crucial.

Therefore, you should regularly analyze your bot’s message history to identify unsatisfactory responses.

When you find them, determine the root of the problem:

  • Is it the prompt?
  • Are the wrong vectors being retrieved?
  • Does the KB need more vectors?
  • Is the search query incorrect?

Once you find the issue, fix it and move on to the next.

Since it’s quite tedious, I developed a program specifically for this purpose. Here’s what it does:

  1. Add new vectors: - Easily integrate new data into your KB to keep it up-to-date.
  2. Search Sample Queries: - Run test queries and see which vectors are retrieved first based on their similarity scores.
  3. View, Edit, or Remove Vectors Instantly: - Instantly access and modify the vectors that appear from your search query. This feature is especially useful for fine-tuning your KB by directly addressing inaccuracies.

This program has been a game changer for me, and it’s relatively quick to code (around 250 lines).

Using this program, you can quickly identify and fix issues within your KB. If a query returns an unsatisfactory response, you can:

  • Identify the Problem: Understand why the wrong vectors were retrieved.
  • Make Adjustments: Edit or remove problematic vectors and add new ones as needed.
  • Improve Response Accuracy: Continuously refine your KB to improve the quality of responses.

For the shake of keeping this article as brief as possible, I won’t go too much deeper into this tool.

But feel free to shoot me a DM on LinkedIn if you want me to send over the code, or if you need help setting it up :)


Yes mom, I’m an engineer, a prompt engineer.

Okay, but what about the prompt? Isn’t it SOO important as well?

Absolutely, but there are already a ton of resources out there on this topic. I’ll link some of my favorite ones here:

We’ll still do a brief overview though, so let’s dive in!

The Essentials

For a good prompt, we need to consider TWO main things:

  • Contents
  • Placement

People usually overlook the latter, and it can be incredibly important, especially for smaller and cheaper models such as GPT-3.5 Turbo.

Some overview of what I found works best for GPT models:

GPT-4o:

  • System ? Instructions
  • Assistant ? Context & Instruction reminders (#Notes)

GPT-3.5 Turbo:

  • User ? Instructions & Context

You can even mess around with it a bit. For example, write the Assistant prompt for GPT-4o in the first person:

“Here is some context that may be useful when crafting my response… I must be brief and concise…”

Or use the ‘Assistant’ role to give the instructions to GPT-3.5, to then send queries as the ‘User’.

Key Elements for a Good Prompt

In most cases, it’s worth including:

  • Markdown formatting
  • Role-playing
  • Chain-of-thought
  • Few-shot prompting
  • Emotional prompting
  • Some final notes (# Notes)

Oh and remember to keep the prompt positive; negative instructions tend not to work as well.

If you wish, you can delve deeper into these elements on the resources provided.

Sample Prompt Structure for a RAG model with GPT-4o

System Prompt:

# Role
You are the best {role} specialized in {topic} with a knack for {specialization}. The reputation of our company, to which you belong, rests entirely in your hands and your ability to correctly answer user questions. Your role is essential in helping our users with {what it's doing for your users}, something EXTREMELY important for them.

# Task
{Enter task to achieve}. Steps to follow:

1. {First step}
2. {Second step}
3. {Third step...}

# Specifics
- Giving a correct answer is very important for our business because {reasons}
- Answers must be clear, concise and friendly.
{Enter further requirements}

# Examples
### Ex. 1
Q: {Sample query}
A: {Desired reply}

### Ex. 2
Q: {Sample query}
A: {Desired reply}

### Ex. 3
Q: {Sample query}
A: {Desired reply}        

Assistant Prompt:

Context that may be relevant for my answer: [{context}].

Notes to keep in mind:

- I must answer the user's question correctly, truthfully, precisely, and concisely.
- I include only relevant and precise information in the answers, focusing on the content WITHOUT EVER ADDING closing phrases, safety warnings.
{some more notes}        

Here’s what the overall structure would look like:

Screenshot by Author

As you can see, the conversation follows the structure of the examples given. We write ‘Q: {user query}’ and then just ‘A:’ to prompt the LLM to generate a response that mirrors the examples provided.

A small side note:

In this example, there’s just one ‘Q:’ and one ‘A:’, since I’ve found that chatbots often work best when we ‘normalize’ the user’s query and avoid including previous bot replies.

Here’s how it works:

  1. Collect Recent Messages: Take the last 2–3 messages from the user and the last 1–2 messages from the chatbot.
  2. Generate a Normalized Query: Use a cheaper LLM to create a ‘normalized’ user query that includes the actual question, considering the conversation history.
  3. Focus on the Current Context: Ensure this normalized query contains only the essential context without including past bot responses.

By normalizing the query this way, the chatbot sticks better to the provided format and avoids being influenced by previous responses. Plus, now it only needs to focus on the current user query, not the entire conversation history. This makes the output more predictable and consistent.

Handling LLM Quirks

LLMs are smart, but also very dumb.

Sometimes it’s easier to hard-code something in, rather than getting an LLM to change something in their output.

For instance, in the sample prompt given, the LLM may include “A: ” in its response.

Yet convincing the model to NOT include “A: “ on the reply is unexpectedly difficult.

In this case, two lines of code would save us a lot of trouble:

# Remove "A:" from the beginning of the message if it exists
if response_message.startswith("A:"):
   response_message = response_message[2:].strip()        

The same goes for whenever you want to change how an LLM formats their replies.

For instance, all GPT models have it engrained in them that they MUST use Markdown formatting.

And trying to change this is like swimming upstream

Let’s say I’d like to implement my app on WhatsApp.

The messaging platform’s formatting is not exactly the same as Markdown.

For instance, they use a single asterisk (*) instead of two (**).

Yet another easy task for us with a bit of Python code, but almost impossible to achieve through prompting:

# Replace any double asterisks with a single asterisk in the response
try:
   response = response.replace('**', '*')
except:
     print('Error while replacing ** with * in the response')
     pass        

And… That’s a wrap!

I really appreciate you reading this far into the article.

Hope it was helpful!

If so, please let me know by connecting on LinkedIn.

As an Internet Connoisseur, I can always appreciate some cool internet points, so any likes are welcome.

Finally, if you are interested in learning how to properly implement AI into your business, I’m currently helping some businesses that qualify for free. Send me a DM on LinkedIn if you’re interested.

Hope you have an amazing rest of your day,

-Marcos

要查看或添加评论,请登录

Marcos Santiago Soto的更多文章

社区洞察

其他会员也浏览了