Measuring Groundedness of responses generated by LLMs using LLMs.
In the realm of traditional software development, applications are typically evaluated on metrics such as time complexity, space complexity, cyclomatic complexity, and various testing methods including unit and integration tests. These metrics, however, fall short when applied to the assessment of natural language responses from large language models (LLMs). The outputs of LLMs, being inherently textual and prone to unpredictability, defy the conventional measures due to their capacity for producing 'hallucinated' content. This unpredictability presents a formidable challenge in devising testing strategies that can precisely gauge the quality of generative AI applications.
To address this challenge and effectively evaluate the quality of generative AI applications, new metrics have been devised, one of which is 'Groundedness'. This metric evaluates the extent to which a model's generated responses are in alignment with the information provided in the input source. It verifies the responses as claims against the context provided in a user-defined ground truth source, ensuring that the AI's output is not only coherent but also factually accurate.
Using LLMs to Calculate 'Groundedness'
In this article, we delve into a novel method of utilizing Large Language Models (LLMs) to autonomously determine ground truth. By designing an effective 'Prompt', an LLM can independently compute the ground truth.
For this purpose, we utilize the Azure OpenAI 'GPT 3.5 Turbo' model, setting the parameters as follows to ascertain groundedness:
Instructions:
In the example below, a metric scale from 1 to 5 is used, with '1' indicating low groundedness and '5' indicating high groundedness.
Prompt:
"""
system:
You are a helpful assistant.
user:
Your task is to check and measure whether the information in the 'assistant' response is grounded to retrieved documents.
You will be given a 'user' question, 'assistant' response, and a 'context' used by chatbot to derive the answer.
To rate the groundedness of the 'assistant' response, you need to consider the following:
1. Read the 'user' question and 'assistant' response.
2. Read the 'context' document.
3. Determine whether the 'assistant' response is grounded in the 'context' document.
4. Rate the groundedness of the 'assistant' response on a scale of 1 to 5, where 1 is not grounded at all and 5 is completely grounded.
If the 'assistant' response is from outside sources or makes a claim that is not supported by the 'context' document, rate it as 1.
If the 'assistant' response is directly supported by the 'context' document, rate it as 5 and please be very strict in your rating.
5. Your answer should follow the format:
<Score: [insert the score here]>
# Question
{question}
# Answer
{answer}
# Context
{context}
"""
领英推荐
Testing the prompt in Azure OpenAI playground
The approach outlined offers several advantages for development teams working with generative AI applications. It significantly accelerates the validation process compared to traditional manual techniques. By implementing the approach as code in compatible programming languages, it facilitates developers and testing teams in evaluating the application's performance using specific queries or datasets. Moreover, the overall groundedness score acts as a critical quality metric within the DevOps pipeline, guaranteeing that the production systems function with high accuracy.
Note: The provided example or prompt is for the sole purpose of the article and does not depict any production-grade quality. Exercise caution when referring to it in a production-grade application.
References: