登录查看更多内容

Measuring Groundedness of responses generated by LLMs using LLMs.

Challa Sri Satya Krishna

Consultant | Azure Cloud & AI at Microsoft

发布日期: 2024年5月20日

In the realm of traditional software development, applications are typically evaluated on metrics such as time complexity, space complexity, cyclomatic complexity, and various testing methods including unit and integration tests. These metrics, however, fall short when applied to the assessment of natural language responses from large language models (LLMs). The outputs of LLMs, being inherently textual and prone to unpredictability, defy the conventional measures due to their capacity for producing 'hallucinated' content. This unpredictability presents a formidable challenge in devising testing strategies that can precisely gauge the quality of generative AI applications.

To address this challenge and effectively evaluate the quality of generative AI applications, new metrics have been devised, one of which is 'Groundedness'. This metric evaluates the extent to which a model's generated responses are in alignment with the information provided in the input source. It verifies the responses as claims against the context provided in a user-defined ground truth source, ensuring that the AI's output is not only coherent but also factually accurate.

Using LLMs to Calculate 'Groundedness'

In this article, we delve into a novel method of utilizing Large Language Models (LLMs) to autonomously determine ground truth. By designing an effective 'Prompt', an LLM can independently compute the ground truth.

For this purpose, we utilize the Azure OpenAI 'GPT 3.5 Turbo' model, setting the parameters as follows to ascertain groundedness:

Temperature: 0
Top P: 0.95
Max Tokens: 100

Instructions:

Direct the model to function as an 'Assistant'.
Create a prompt that instructs the model to calculate groundedness based on the provided question, answer, and context.
Command the model to evaluate groundedness using specific criteria.

In the example below, a metric scale from 1 to 5 is used, with '1' indicating low groundedness and '5' indicating high groundedness.

Prompt:

"""
system:
You are a helpful assistant.
user:
Your task is to check and measure whether the information in the 'assistant' response is grounded to retrieved documents.
You will be given a 'user' question, 'assistant' response, and a 'context' used by chatbot to derive the answer.

To rate the groundedness of the 'assistant' response, you need to consider the following:
1. Read the 'user' question and 'assistant' response.
2. Read the 'context' document.
3. Determine whether the 'assistant' response is grounded in the 'context' document.
4. Rate the groundedness of the 'assistant' response on a scale of 1 to 5, where 1 is not grounded at all and 5 is completely grounded.
If the 'assistant' response is from outside sources or makes a claim that is not supported by the 'context' document, rate it as 1.
If the 'assistant' response is directly supported by the 'context' document, rate it as 5 and please be very strict in your rating.
5. Your answer should follow the format:
    <Score: [insert the score here]>    

# Question
{question}

# Answer
{answer}

# Context
{context}
"""

Bernard Marr 5 个月前

Integrating OpenAI APIs with ChatMotor.ai : A Retex…

Eric PETIOT 3 个月前

Swayam: The STEPs Model of Prompting - Part II

Rahul Verma 2 个月前

Testing the prompt in Azure OpenAI playground

Testing for lower ground

Testing for grounded response

The approach outlined offers several advantages for development teams working with generative AI applications. It significantly accelerates the validation process compared to traditional manual techniques. By implementing the approach as code in compatible programming languages, it facilitates developers and testing teams in evaluating the application's performance using specific queries or datasets. Moreover, the overall groundedness score acts as a critical quality metric within the DevOps pipeline, guaranteeing that the production systems function with high accuracy.

Note: The provided example or prompt is for the sole purpose of the article and does not depict any production-grade quality. Exercise caution when referring to it in a production-grade application.

References:

Monitoring evaluation metrics descriptions and use cases (preview) - Azure Machine Learning | Microsoft Learn

要查看或添加评论，请登录

Challa Sri Satya Krishna的更多文章

Evaluation of Generative AI applications using Prompt Flow SDK

2024年6月11日

Evaluation of Generative AI applications using Prompt Flow SDK

In my previous article, we have seen a conventional way of evaluating the generative AI applications by crafting custom…

Measuring Groundedness of responses generated by LLMs using LLMs.

Challa Sri Satya Krishna

Consultant | Azure Cloud & AI at Microsoft

领英推荐

Challa Sri Satya Krishna的更多文章

社区洞察

其他会员也浏览了

AI and Programmers: A Synergistic Relationship, Not a Job Threat

Top AI Tools for Developers in 2024

Building an AI Assistant with DSPy

Why Choose OpenAI APIs? Unleash the Power of AI in Your Development Projects

Code Generation with Large Language Models (LLMs)

Turn your Langflow Prototype into a Streamlit Chatbot Application

AI in Action: How Large Language Models (LLMs) are Transforming Software Programming

The Top 10 Automated Coding Tools to Boost your Productivity

The Role of AI in Automating Code Writing and Debugging

Automating Code with AI: Opportunities and Challenges

领英推荐

Challa Sri Satya Krishna的更多文章

Evaluation of Generative AI applications using Prompt Flow SDK

社区洞察

其他会员也浏览了

AI and Programmers: A Synergistic Relationship, Not a Job Threat

Top AI Tools for Developers in 2024

Building an AI Assistant with DSPy

Why Choose OpenAI APIs? Unleash the Power of AI in Your Development Projects

Code Generation with Large Language Models (LLMs)

Turn your Langflow Prototype into a Streamlit Chatbot Application

AI in Action: How Large Language Models (LLMs) are Transforming Software Programming

The Top 10 Automated Coding Tools to Boost your Productivity

The Role of AI in Automating Code Writing and Debugging

Automating Code with AI: Opportunities and Challenges