Leveraging JSON Mode for Enhanced LLM Output
For some time, llama.cpp has allowed users to constrain its output, with JSON as a supported format. Simultaneously, OpenAI has introduced JSON Mode for its chat completion API, providing users with a versatile output option.
Implementing JSON formatting is straightforward. For llama.cpp, utilize the json.gbnf file containing JSON grammar during the API call. Here is an example in JavaScript.
const method = 'POST';
const headers = {
'Content-Type': 'application/json'
};
const grammar = fs.readFileSync('json.gbnf', 'utf-8');
const body = JSON.stringify({ prompt, grammar });
const request = { method, headers, body };
const response = await fetch(LLAMA_API_URL, request);
const data = await response.json();
Similarly, for OpenAI, specify the response_format as { type: 'json_object' } when calling the API directly, such as:
const response_format = { type: 'json_object' };
const response = await fetch(url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${OPENAI_API_KEY}`
},
body: JSON.stringify({ messages, model, response_format })
});
const data = await response.json();
It's essential to note that, as of this writing, only certain OpenAI models, gpt-4-1106-preview and gpt-3.5-turbo-1106, support JSON output. Additionally, for OpenAI, the system prompt in the messages array must explicitly mention the JSON instruction to avoid call failures.
For a zero-shot prompt, here's an example of a system message:
const SYSTEM_PROMPT = `You are a helpful assistant.
You answer the question from the user, politely and concisely in 13 words or less.
Always output a valid JSON as follows:
{
"answer": a string with the concise answer
}`;
Since the response is now in JSON, parse it first and destructure it to get the answer out of the parsed result.
Chain of Thought
The previous example seems to underutilize JSON's capabilities, considering the product object has only one key named answer.
Though true, this is a foundational setup. As the prompt and output become more complex, JSON offers an efficient solution. Consider the Chain of Thought pattern, guiding the LLM to break down problems systematically. The prompt might look like this:
const SYSTEM_PROMPT = `You are a research assistant with access to Google search.
Given the conversation history and the inquiry from the user, your task is to use Google to search for the answer.
Think step by step. Always output a valid JSON as follows:
{
"thought": describe your thoughts about the inquiry,
"tool": the search engine to use (must be Google),
"input": a string with the important key phrases to search for,
"observation": a string with the concise result of the search
}
For instance, the internal thought process for the above chat illustration involves:
{
"thought": "This is a question about population, I will use Google",
"tool": "Google",
"input": "population of Jakarta",
"observation": "The population of Jakarta is approximately 10.6 million people."
}
Once the output is obtained, destructure it and extract the observation part, the actual answer. The other fields remain useful if properly logged, especially in cases of failures or user-provided negative feedback.
The same technique applies to Reason-Act, useful for invoking tools or functions, or assisting with the subsequent stage of RAG (Retrieval-Augmented Generation). If this is your first encounter with Reason-Act, refer to my previous article on LLM-based Chatbot Demo, where the complexity of string parsing is eliminated by using JSON format.
Question Answering
Discussing RAG, once relevant passages are retrieved, the LLM's task is to utilize them as a reference document to answer questions. To witness this in action, refer to my previous article on Semantic Search for RAG (and inspect the relevant code if necessary).
A significant advantage of the JSON output format is the flexibility to tweak prompts during development or troubleshooting, enabling the dumping of additional information.
Consider a scenario where the question is "When was the solar system formed?" Using vector similarity, the retrieval process fetches three relevant paragraphs from the document archive:
领英推荐
The Solar System developed 4.6 billion years ago when a dense region of a molecular cloud collapsed, forming the Sun and a protoplanetary disc. All four terrestrial planets belong to the inner Solar System and have solid surfaces. Inversely, all four giant planets belong to the outer Solar System and do not have a definite surface, as they are mainly composed of gases and liquids.
The closest star to the Solar System, Proxima Centauri, is 4.25 ly away. The Solar System orbits the Galactic Center of the Milky Way galaxy, as part of its Orion Spur, at a distance of 26,000 ly.?
The Solar System formed 4.568 billion years ago from the gravitational collapse of a region within a large molecular cloud. This initial cloud was likely several light-years across and probably birthed several stars. As is typical of molecular clouds, this one consisted mostly of hydrogen, with some helium, and small amounts of heavier elements fused by previous generations of stars.
Provided to the LLM with the right prompt, it should answer the question. A common prompt may look like this:
const SYSTEM_PROMPT = `You are an expert in retrieving information.
You are given a question from human and you have to answer it concisely in 23 words or less.
Use only the following reference document to evaluate the question and provide the answer.
Avoid using any external information or recalling from memory.
Reference Document:
{{all the relevant passages}}`
But, wait! With the JSON mode idea, tweak the prompt to include this extra instruction:
Always output a valid JSON formatted as follows:
{
"answer": a string representing the concise answer,
"citation": a string referring to sentence for the source of the answer
}
Now, instead of only the answer, the LLM will also produce the citation:
{
"answer": "4.568 billion years ago",
"citation": "The Solar System formed 4.568 billion years ago from the gravitational collapse of a region within a large molecular cloud."
}
Once again, destructure the answer to present it to the user. Meanwhile, the citation can be used for metrics evaluation or surfaced to the user if they wish to find out the citation itself, or perhaps both!
How about Streaming Response?
When it comes to streaming responses, almost all prominent chat interfaces, from ChatGPT to Bard to Copilot, offer responses "on the fly," with each word appearing progressively, creating a responsive and engaging user experience.
However, with JSON mode at the output, this seamless streaming won't work out of the box. Partial JSON is not a valid JSON structure. If fed into a JSON parser, this incomplete form will rightfully explode.
The most straightforward workaround is acknowledging partial JSON as invalid and attempting manual completion. For instance, if initial parsing fails, another attempt can be made after appending a double quote and a closing bracket to the response. This should work well for the schema outline in the previous examples, except within the rare boundary of key-value, as the object itself isn't deep. If parsing still fails, wait until more chunks are streamed by the LLM. Rinse and repeat!
Debugging Hitchhike
Object destructuring's advantage is that everything else can be ignored by the callee. In the previous case of Chain of Thought, where only the observation field is crucial, additional key-values can be tucked into the object, ensuring the object shape remains the same for compatibility.
Illustratively, an extra value denoting processing time is tucked into the response. While mostly ignored by subsequent stages, it proves immensely valuable for troubleshooting.
const respond = async (history) => {
const start = Date.now();
const init = { role: 'system', content: SYSTEM_PROMPT };
const messages = [].concat(init).concat(history);
const response = JSON.parse(await llm(messages));
const time = Date.now() - start;
return { ...response, time };
}
I hope this short article convinces you to work with JSON format extensively when using LLM. Feedback is warmly welcomed, and I look forward to hearing about your LLM adventures!