Evaluation of Generative AI applications using Prompt Flow SDK

In my previous article, we have seen a conventional way of evaluating the generative AI applications by crafting custom prompts for individual evaluations.

This article will guide us through the process of evaluating responses from generative AI applications using the Prompt Flow SDK. The Prompt Flow SDK package includes an out-of-the-box evaluation module that facilitates the assessment of applications on various metrics such as groundedness, relevance, and coherence.

To begin the development process, we need to follow the steps mentioned below.

  1. Install 'promptflow-evals' python package (promptflow-evals · PyPI)

pip install promptflow-evals        

2. Before proceeding with the evaluation, we must define a dataset in .jsonl format on which the evaluation will be conducted.

{"question": "what is the capital city of France", "answer": "Paris is the capital city of France", "context": "Paris is the capital city of France", "ground_truth": "The capital and largest city of France is Paris. With an estimated population of over 2 million residents, Paris is not only a major financial and cultural hub but also renowned for its art and fashion"        

3. Import the following required configurations to enable the Python package to utilize OpenAI GPT models for evaluations.

from promptflow.core import tool, AzureOpenAIModelConfiguration

# initialize configuration
 model_config = AzureOpenAIModelConfiguration(
            azure_endpoint=, # azure openai service deployment URL
            api_key=, # azure openai service deployment key
            api_version= # azure openai api version
            azure_deployment= # azure openai gpt model deployment name
 )        

4. Next, define the evaluation configurations according to the requirements. In the following example, we evaluate using metrics such as Groundedness, Relevance, and Coherence.

# import required evaluators from sdk
from promptflow.evals.evaluators import (
    GroundednessEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    FluencyEvaluator
)

evaluators = {}
evaluators_config = {}

# define groundedness evaluator
evaluators['groundedness'] = GroundednessEvaluator(model_config=model_config)
evaluators_config['groundedness'] = {
         "answer": "${data.answer}",
         "context": "${data.context}"
}

# define relevance evaluator
evaluators['relevance'] = RelevanceEvaluator(model_config=model_config)
evaluators_config['relevance'] = {
          "question": "${data.question}",
          "answer": "${data.answer}",
          "context": "${data.context}"
}

# define coherence evaluator
evaluators['coherence'] = CoherenceEvaluator(model_config=model_config)
evaluators_config['coherence'] = {
           "question": "${data.question}",
           "answer": "${data.answer}"
}        

When defining the 'evaluators_config', it is crucial to ensure that the parameters in the .jsonl dataset correspond with the configurations or mappings specified in the 'evaluators_config'.

5. After defining the evaluators and desired configurations, we continue with the evaluation as follows:

# import evaluate function from the sdk
from promptflow.evals.evaluate import evaluate

 # dataset name can be any string but should be a valid .jsonl file
dataset_name = 'input_source.jsonl'

# pass the evaluators and corresponding configuration to the evaluation function
evaluation_results = evaluate(
            data=dataset_name,
            evaluation_name=f"{dataset_name}-{time()}",
            evaluator_config=evaluators_config,
            evaluators=evaluators
)

# retrieve the evaluation metrics
evaluated_metrics = evaluation_results['metrics']

# print the evaluation metrics
print(evaluated_metrics)        

6. Output from evaluators would appear as below

"groundedness.gpt_groundedness": 5.0
"relevance.gpt_relevance": 5.0
"coherence.gpt_coherence": 3.0        

Evaluation metrics assist development teams in assessing the quality of Generative AI applications, which in turn aids in establishing a quality benchmark for the application.

Example:

Quality Criteria: Groundedness should not be less than 3.0

# quality threshold
threshold = 3.0

if 'groundedness.gpt_groundedness' < threshold:
        raise Exception('Responses are not grounded')        

As illustrated in the example, the overall groundedness score can align with the predetermined quality threshold for the application, ensuring that the production systems operate with high precision.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了