How to Improve Your LLM Application
The internet is full of advice on using LLM chat applications like ChatGPT, but most utility of Large Language Models is going to come from leveraging them inside of applications tailored for specific purposes. This is a straightforward guide to making a great application that uses an LLM, based on what I've learned helping to create Copilot for Microsoft 365.
System Prompt
The first thing to think about is the system prompt you will use. This is the text you add in front of the user's text when you send it to the LLM. The famous example is, "You are a helpful assistant..." This is where you explain to the model what you want it to do. But you don't really need to tell current generation models to be helpful anymore; this is already fine-tuned into their behavior. Instead, focus on how you want it to be different than every other LLM app. Want it to be concise? Only output well-formatted XML? Avoid overused LLM words like "delve"? A perfect example is Anthropic Claude's system prompt.
When your application is designed to handle a specific task, adding few-shot examples is critical to get consistent output. ?? This can be part of the system prompt or if you're using OpenAI's chat completions API the examples will be a made-up conversation history. My recent project to have a 7B parameter model correctly format email addresses was terrible until I gave it three examples of responses that only included the email address and no extra explanation. ??
?? The power of RAG: Incorporate Retrieval Augmented Generation (RAG) to your application. In its most basic form, your application first searches for the user's prompt in a search engine, then includes the results in the system prompt. This helps the model by giving it access to more or more current information than is included in its model weights from its training process. If you hear people talking about vector databases, this is when they are used. Vector databases are useful because they can add useful information based on what the user means, instead of the keywords they have typed. You can also add RAG as a tool within internal reasoning, but I'll talk about that later in this series.
The User Prompt
If you're making a chat-based application, your user gets to choose what happens because they typed the prompt. But there's still quite a bit in this area that you can improve.
The first and most used technique is including the conversation history in the call you make to the LLM. Remember that language models are stateless - they have perfect amnesia as soon as they finish responding. Many APIs or libraries build in basic conversation history, including Semantic Kernel and OpenAI's assistants API. In my opinion, adding conversation history was the biggest innovation of the original ChatGPT product on top of the GPT 3.5 API.
"Memory" can mean a few different things, all to get around the problem of the conversation history getting too long, and thus too slow and expensive. There are at least a few approaches:
?? Shorten the conversation history each turn, keeping only the important points
?? Store the entire conversation externally, and use RAG (see first post) to add only the relevant parts
?? Give the LLM a tool to store important info itself, and another tool to retrieve it. This is what ChatGPT added recently. I'll come back to tool use later in this article.
?? Use a model with a long context window, and then just ignore the problem. But be prepared to pay for the tokens!
Finally, consider your entire product experience. A good conversation is heavily dependent on what your users type, and so what can you do to encourage them to use good prompts? Many applications now include conversation starters, and Microsoft Copilot takes this a step further with Copilot Lab. Another intriguing idea is Bing's Deep Search, which rewrites the user prompt in a few different ways, then asks the user which one they want to use. Try something similar by using an LLM to rewrite based on examples of great prompts.
Internal Reasoning
Internal reasoning is the real power of LLM applications. ChatGPT and other similarly capable LLM products can "think" to themselves before responding to the user. The key to enable the LLM to reason and make a plan is to create a "scratch pad" for the LLM to write to first. You first instruct the LLM to make a plan to answer the user, and then you instruct the LLM with a second call to execute the plan it generated. It's hard to overstate how impactful this is! One research paper calls the overall execution the "plan and solve" approach, or it is sometimes called prompt chaining.
Within prompt chaining, there are several ways to ask the LLM to plan. You can ask for a ??sequential plan (chain of thought),?? a plan outline and then sub-plans (skeleton of thought), or?? a branching plan (tree of thought).
The reason this works so well is because it makes each task the LLM has to solve quite a bit smaller. Beware, however, plan-and-solve comes at a cost of latency and money!
Using tools requires this plan-and-solve approach. Tools are functions that the LLM knows how to "use." But the model cannot call a function itself; it only returns text to your application. Here's how it works: you will put the function's signature into the system prompt with instructions about when to use it. The LLM's plan will include structured output of the function and parameters. Your application will intercept that part of the output and call the function. The results will be given to the LLM as part of the prompt in the next call. Tools are important for other capabilities I've mentioned in this article: memory and retrieval augmented generation (RAG).
领英推荐
Improving the reasoning and tool-use capabilities of language models is a major focus of research. We can expect LLMs to get better at this, and so now is the right time to develop your application to take advantage of it!
Synthesis
If you have an excellent ??? system prompt, ?? user prompt, and ?? internal reasoning with tools, your application is likely to have all the data it needs to respond to the user (or otherwise output). But it hasn't responded yet! Before you get to that, let me describe the consensus/ensemble approach.
If you absolutely need a high-quality answer, no matter the dollar cost, you can perform all the previous steps multiple times. You can also try using different models or tones in the system prompt, creating an "ensemble" of assistants ????. Then have yet another LLM call to choose the best response or combine the parts of the response that have consensus by appearing multiple times. This is particularly helpful to reduce hallucinations and ensure the response is grounded to your data sources.
??? The other reason to have a final LLM call before displaying to the user is that you can use the LLM to improve how the text is displayed. Most simply and well-supported by current language models, you can have the LLM produce Markdown-formatted text, and have your client interpret the Markdown for display. But feel free to get fancier! Copilot for Microsoft 365 can display entire Adaptive Cards, which you can think of as small web pages defined by a pre-defined JSON template and JSON data that was the result of calling a tool. This is a lot more engaging than the text most applications use!
?? An important use of display in many applications is to show the user where external data came from, as a citation or reference. This is an important design point of applications based on LLMs: you should tell the user that the content is AI generated and that they should check the sources. Users can't do that unless you display those sources to them!
Review
I've described one pattern of an application that is centered around the power of large language models. In short:
?? Design your application to help the user write a good prompt, or if possible, do it for them
?? Call the LLM several times; first to develop a plan, then to execute steps including using tools, then to evaluate, and then to produce the final results. Experiment with different models for different functions!
?? Be sure to consider how the results will be displayed, and how the LLM can help you do that.
?? If in a chat application, keep the conversation history or implement memory for the next turn.
This is not the only pattern of an LLM application. I hope it serves as a starting point to your own research and testing. If you understand and implement techniques like these well, your app won't be just another OpenAI wrapper! But don't think that you can just code them up and forget about it.
In fact, it's important to continuously evaluate your application as it's used - against test data, against historical user queries, and against ongoing queries. You shouldn't be making any changes at all until you have robust testing, but that's for a future article. Thank you for reading!
Helpful links