登录查看更多内容

Evaluation of Generative AI applications using Prompt Flow SDK

Challa Sri Satya Krishna

Consultant | Azure Cloud & AI at Microsoft

发布日期: 2024年6月11日

In my previous article, we have seen a conventional way of evaluating the generative AI applications by crafting custom prompts for individual evaluations.

This article will guide us through the process of evaluating responses from generative AI applications using the Prompt Flow SDK. The Prompt Flow SDK package includes an out-of-the-box evaluation module that facilitates the assessment of applications on various metrics such as groundedness, relevance, and coherence.

To begin the development process, we need to follow the steps mentioned below.

Install 'promptflow-evals' python package (promptflow-evals · PyPI)

pip install promptflow-evals

2. Before proceeding with the evaluation, we must define a dataset in .jsonl format on which the evaluation will be conducted.

{"question": "what is the capital city of France", "answer": "Paris is the capital city of France", "context": "Paris is the capital city of France", "ground_truth": "The capital and largest city of France is Paris. With an estimated population of over 2 million residents, Paris is not only a major financial and cultural hub but also renowned for its art and fashion"

3. Import the following required configurations to enable the Python package to utilize OpenAI GPT models for evaluations.

from promptflow.core import tool, AzureOpenAIModelConfiguration

# initialize configuration
 model_config = AzureOpenAIModelConfiguration(
            azure_endpoint=, # azure openai service deployment URL
            api_key=, # azure openai service deployment key
            api_version= # azure openai api version
            azure_deployment= # azure openai gpt model deployment name
 )

4. Next, define the evaluation configurations according to the requirements. In the following example, we evaluate using metrics such as Groundedness, Relevance, and Coherence.

Venugopal Adep 9 个月前

Generative Adversarial Symmetry: Unveiling Balance in…

Yeshwanth Nagaraj 5 个月前

Hallucination-Free, Self-Tuned, Fast Hierarchical LLMs…

Vincent Granville 6 个月前

# import required evaluators from sdk
from promptflow.evals.evaluators import (
    GroundednessEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    FluencyEvaluator
)

evaluators = {}
evaluators_config = {}

# define groundedness evaluator
evaluators['groundedness'] = GroundednessEvaluator(model_config=model_config)
evaluators_config['groundedness'] = {
         "answer": "${data.answer}",
         "context": "${data.context}"
}

# define relevance evaluator
evaluators['relevance'] = RelevanceEvaluator(model_config=model_config)
evaluators_config['relevance'] = {
          "question": "${data.question}",
          "answer": "${data.answer}",
          "context": "${data.context}"
}

# define coherence evaluator
evaluators['coherence'] = CoherenceEvaluator(model_config=model_config)
evaluators_config['coherence'] = {
           "question": "${data.question}",
           "answer": "${data.answer}"
}

When defining the 'evaluators_config', it is crucial to ensure that the parameters in the .jsonl dataset correspond with the configurations or mappings specified in the 'evaluators_config'.

5. After defining the evaluators and desired configurations, we continue with the evaluation as follows:

# import evaluate function from the sdk
from promptflow.evals.evaluate import evaluate

 # dataset name can be any string but should be a valid .jsonl file
dataset_name = 'input_source.jsonl'

# pass the evaluators and corresponding configuration to the evaluation function
evaluation_results = evaluate(
            data=dataset_name,
            evaluation_name=f"{dataset_name}-{time()}",
            evaluator_config=evaluators_config,
            evaluators=evaluators
)

# retrieve the evaluation metrics
evaluated_metrics = evaluation_results['metrics']

# print the evaluation metrics
print(evaluated_metrics)

6. Output from evaluators would appear as below

"groundedness.gpt_groundedness": 5.0
"relevance.gpt_relevance": 5.0
"coherence.gpt_coherence": 3.0

Evaluation metrics assist development teams in assessing the quality of Generative AI applications, which in turn aids in establishing a quality benchmark for the application.

Example:

Quality Criteria: Groundedness should not be less than 3.0

# quality threshold
threshold = 3.0

if 'groundedness.gpt_groundedness' < threshold:
        raise Exception('Responses are not grounded')

As illustrated in the example, the overall groundedness score can align with the predetermined quality threshold for the application, ensuring that the production systems operate with high precision.

Evaluation of Generative AI applications using Prompt Flow SDK

Challa Sri Satya Krishna

Consultant | Azure Cloud & AI at Microsoft

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

What TensorFlow is all About and How it Works?

How I Improved the Performance of my Computer Vision Model Two-Fold

Unfolding Motion Detection [Part 1]: Feature Detections

Serialization and De-serialization

Introducing Claude 3.5 Sonnet: Anthropic's Fastest and Smartest Model that Outperforms Claude 3 Opus. ??

Artificial Intelligence #207

Artificial Intelligence #207

The AI Renaissance: Python-Powered Artistic Evolution.

Achieving 50% Accuracy on ARC-AGI with GPT-4o

Open Source AI Content Generator GPT-3 Alternative GPT-Neo

领英推荐

Measuring Groundedness of responses generated by LLMs using LLMs.

2024年5月20日

社区洞察

其他会员也浏览了

What TensorFlow is all About and How it Works?

How I Improved the Performance of my Computer Vision Model Two-Fold

Unfolding Motion Detection [Part 1]: Feature Detections

Serialization and De-serialization

Introducing Claude 3.5 Sonnet: Anthropic's Fastest and Smartest Model that Outperforms Claude 3 Opus. ??

Artificial Intelligence #207

Artificial Intelligence #207

The AI Renaissance: Python-Powered Artistic Evolution.

Achieving 50% Accuracy on ARC-AGI with GPT-4o

Open Source AI Content Generator GPT-3 Alternative GPT-Neo