Measuring Groundedness of responses generated by LLMs using LLMs.

In the realm of traditional software development, applications are typically evaluated on metrics such as time complexity, space complexity, cyclomatic complexity, and various testing methods including unit and integration tests. These metrics, however, fall short when applied to the assessment of natural language responses from large language models (LLMs). The outputs of LLMs, being inherently textual and prone to unpredictability, defy the conventional measures due to their capacity for producing 'hallucinated' content. This unpredictability presents a formidable challenge in devising testing strategies that can precisely gauge the quality of generative AI applications.

To address this challenge and effectively evaluate the quality of generative AI applications, new metrics have been devised, one of which is 'Groundedness'. This metric evaluates the extent to which a model's generated responses are in alignment with the information provided in the input source. It verifies the responses as claims against the context provided in a user-defined ground truth source, ensuring that the AI's output is not only coherent but also factually accurate.

Using LLMs to Calculate 'Groundedness'

In this article, we delve into a novel method of utilizing Large Language Models (LLMs) to autonomously determine ground truth. By designing an effective 'Prompt', an LLM can independently compute the ground truth.

For this purpose, we utilize the Azure OpenAI 'GPT 3.5 Turbo' model, setting the parameters as follows to ascertain groundedness:

  • Temperature: 0
  • Top P: 0.95
  • Max Tokens: 100

Instructions:

  1. Direct the model to function as an 'Assistant'.
  2. Create a prompt that instructs the model to calculate groundedness based on the provided question, answer, and context.
  3. Command the model to evaluate groundedness using specific criteria.

In the example below, a metric scale from 1 to 5 is used, with '1' indicating low groundedness and '5' indicating high groundedness.

Prompt:

"""
system:
You are a helpful assistant.
user:
Your task is to check and measure whether the information in the 'assistant' response is grounded to retrieved documents.
You will be given a 'user' question, 'assistant' response, and a 'context' used by chatbot to derive the answer.

To rate the groundedness of the 'assistant' response, you need to consider the following:
1. Read the 'user' question and 'assistant' response.
2. Read the 'context' document.
3. Determine whether the 'assistant' response is grounded in the 'context' document.
4. Rate the groundedness of the 'assistant' response on a scale of 1 to 5, where 1 is not grounded at all and 5 is completely grounded.
If the 'assistant' response is from outside sources or makes a claim that is not supported by the 'context' document, rate it as 1.
If the 'assistant' response is directly supported by the 'context' document, rate it as 5 and please be very strict in your rating.
5. Your answer should follow the format:
    <Score: [insert the score here]>    

# Question
{question}

# Answer
{answer}

# Context
{context}
"""        


Testing the prompt in Azure OpenAI playground

  • Testing for lower ground

Testing for low grounded response

  • Testing for grounded response

Testing for grounded response


The approach outlined offers several advantages for development teams working with generative AI applications. It significantly accelerates the validation process compared to traditional manual techniques. By implementing the approach as code in compatible programming languages, it facilitates developers and testing teams in evaluating the application's performance using specific queries or datasets. Moreover, the overall groundedness score acts as a critical quality metric within the DevOps pipeline, guaranteeing that the production systems function with high accuracy.


Note: The provided example or prompt is for the sole purpose of the article and does not depict any production-grade quality. Exercise caution when referring to it in a production-grade application.


References:

Monitoring evaluation metrics descriptions and use cases (preview) - Azure Machine Learning | Microsoft Learn

要查看或添加评论,请登录

Challa Sri Satya Krishna的更多文章

社区洞察

其他会员也浏览了