Chatbot Basics - Concepts, Hallucinations, and RAG
The following is from a section of an article I started on my blog last week currently titled Building Good Chatbots Part One, No-Code with Microsoft Copilot Studio and Azure AI Studio. The original article is very long already and I am working on more. By the time I am done it will be an e-book! Based on feedback from my good friend Jake Attis I decided to publish smaller excerpts here on LinkedIn.
This part serves as an introduction to the subject of chatbots generally starting with important background information and terminology about models, chatbots, prompts, hallucinations, fine-tuning, retrieval augmented generation and context windows. By the end of it I hope you will understand
Models versus copilots and chatbots
The terms model and chatbot are sometimes used interchangeably, but they aren’t the same thing. A chatbot is a system that usually includes one or models to do its work. Aside from models, a typical chatbot system contains a user interface, services and databases. Considering this, Copilot Studio is an appealing option for people who aren’t developers or data scientists to unlock the power of large language models for useful applications. In the rest of this article copilot and chatbot are synonymous but model specifically and only means model. Examples of chatbots include:
Examples of models include:
This is an important distinction which will become clearer as we go.
Chatting with your data
You may recall that people were very impressed when OpenAI released ChatGPT on November 30, 2023 ?. Collectively we spent the next several months trying to understand its use, dangers, and limitations. One of these limitations is the tendency to tell convincing falsehoods or what we call hallucinations. A hallucination is when a chatbots generates incorrect, irrelevant, or nonsensical responses. This phenomenon can be attributed to various factors:
Note that some of these causes are related to the model which is trained on a dataset and others are blamed on the chatbot which keeps track of the conversation.
Simply put a model hallucinates when it doesn’t know the answer. We can improve things by including instructions in the prompt such as: “If you don’t know the answer to my question, say I don’t know instead of inventing an answer.”
Solving hallucinations with fine-tuning
Fine-tuning is a process where you take an existing model and train it on new data to make a new model. There is a large ecosystem around fine-tuning, especially for private AI applications using models like Meta Llama 2. People fine-tune models to add specialized knowledge or skills, but also to change a model’s personality and style. For some, the ultimate goal is to make models that are ever smaller and more capable to enable good AI on commodity and even local hardware. To get an idea of the scale of these efforts check out the page of Tom Jobbins, TheBloke, on Hugging Face. He provides a great service to the community by shrinking, quantizing, models to work on less expensive hardware. Currently there are over 2,500 language models on his page alone. They were created by large organizations and individual researchers and collectively have a few million downloads. Fine-tuning is a powerful approach to many problems, but it isn’t necessarily a good approach to “chat with your data” problems insofar as it is time consuming, expensive, and the result is a static model. If the facts or data change, you must repeat the process. An alternative and more common approach to solving hallucinations is with prompt engineering and retrieval augmented generation.
Prompt engineering
A prompt is a message sent to the model to generate a response which completes the prompt. Models are static, unchanging and have no memory of previous questions, answers or conversations. If you’ve heard the term prompt engineering in context of ChatGPT it refers to writing a good prompt in the ChatGPT UI, but in context of chatbot systems it refers to all the things the chatbot system is doing to build the real prompt sent to the model. When you enter your message and hit send, the chatbot system makes a new prompt that consists of your message, instructions to the system, the previous messages in the conversation, and whatever other facts or instructions the creator of the chatbot thinks is necessary to get a good response. In fact, the chatbot might even use the model to completely rewrite your question before sending the prompt to the model. Generally, all this work is hidden from you and all you see is the answer… which might be a hallucination. In this context, prompt engineering also involves managing the size of the prompts to fit the model’s context window.
领英推荐
Solving hallucinations with Retrieval augmented generation (RAG)
A RAG system is a type of chatbot that combines search, prompt engineering, and a model to ground the response in a set of facts provided on the fly. Simply put RAG works by putting the facts required to answer the question into the prompt along with the question. Here is an example prompt from the most excellent Semantic Kernel project!
Answer questions only when you know the facts or the information is provided.
When you don't have sufficient information you reply with a list of commands to find the information needed.
When answering multiple questions, use a bullet point list.
Note: make sure single and double quotes are escaped using a backslash char.
[COMMANDS AVAILABLE]
- bing.search
[INFORMATION PROVIDED]
{{ $externalInformation }}
[EXAMPLE 1]
Question: what's the biggest lake in Italy?
Answer: Lake Garda, also known as Lago di Garda.
[EXAMPLE 2]
Question: what's the biggest lake in Italy? What's the smallest positive number?
Answer:
* Lake Garda, also known as Lago di Garda.
* The smallest positive number is 1.
[EXAMPLE 3]
Question: what's Ferrari stock price? Who is the current number one female tennis player in the world?
Answer:
{{ '{{' }} bing.search ""what\\'s Ferrari stock price?"" {{ '}}' }}.
{{ '{{' }} bing.search ""Who is the current number one female tennis player in the world?"" {{ '}}' }}.
[END OF EXAMPLES]
[TASK]
Question: {{ $input }}.
Answer:
The prompt has several placeholders which the chatbot replaces with appropriate content to try to answer the question. This one…
{{ $externalInformation }}
…is replaced with whatever content is retrieved from search to augment the generation of the answer. This part…
Answer questions only when you know the facts or the information is provided.
When you don't have sufficient information you reply with a list of commands to find the information needed.
…ensures that the answer is grounded in the things the chatbot knows and the information provided and (hopefully) prevents hallucinations.
Grounding the answer in the retrieved data can almost completely remove some kinds of hallucinations from a chatbot because you can easily respond with “I don’t know” or “Information not found” if the retrieval doesn’t find any matches for the request when it does the search. This is an equally effective way to censor the chatbot because, as the chatbot creator, you can use this to prevent the chatbot from talking about any subject that isn’t in the search index. On the other hand, even with RAG hallucinations can still occur due various reasons including:
Simply put a chatbot using RAG hallucinates when it doesn’t know the answer. The difference here is that the reason the model doesn’t know the answer is because the the information provided by the chatbot wasn’t good enough instead of the information on which the model was trained. The retrieval component of the chatbot is independent of the model and equally important! Congratulations for reading this far. We are almost ready to talk about Copilot Studio, but first a note on context management and the context window.
Context window
The context window is the length of text a model can process, i.e. the prompt, and respond to in a single request, i.e. the response measured in tokens. When the limit is exceeded, you get errors. The context window size is perhaps the single most important constraint we face when building generative AI systems and is a key differentiator between models driving both capability and cost. Consider the following from Microsoft
For comparison purposes, the base GPT-3.5-Turbo model has a 4k context window which can hold around six pages of text (question + answer). GPT-4 offers a version with a 32k context window which can hold around forty-eight pages of text. Sounds great except that the 32k context size costs forty times as much. What’s more, if one requires capacity to run that model at scale, you must commit to spending five-figures per month – it is possible to spend over $1 per query with GPT-4 32k! At the opposite end of the spectrum are small models you can run yourself. There has been great progress in expanding the context window in this area, but there are many models with 2k context windows! Often, to get acceptable results from these small models, people will use a combination of fine-tuning and RAG.
Thanks for reading. I hope you learned something, and feedback is always appreciated. In the next article I'll apply all of this to compare and contrast what you can do in Microsoft Copilot Studio and Azure AI Studio.
P.S. If you need help with AI, give me a shout on LinkedIn or send me an email!
--Doug Ware December 1, 2023