Phased Approach | How do we Evaluate Generative AI?

Phased Approach | How do we Evaluate Generative AI?

ANNOUNCEMENTS

Some of you may have noticed an issue with our Webinar signup. Apologies for the problem. If you are interested in learning about moving from POC to production please sign up below...


Are you looking to move an experiment out of the POC zone?

Join us for a free web event on September 27th for companies that want to learn about tools, techniques and frameworks to move Generative AI experiments into working live applications safely. You will learn about


  • Quality Assurance
  • Model Management
  • Data and regulatory governance


Places are limited, reserve yours now!

Reserve your place







EVALUATION APPROACHES

Why and how do we change our approach to evaluating Generative AI Apps?

In our newsletter this week I wanted to keep the article short to answer a basic question. What sets the evaluation of Generative AI applications apart from traditional automated testing methods used in software quality assurance?

Deterministic Vs Non Deterministic Output

Imagine for a second that you have a database with two hundred names of customers. You can write a basic query in SQL to list the names in the database. This query will list the same names every time until a new name is added to the database. This is deterministic and it is very easy to write an automated test to evaluate this.

Now imagine you have a group of PDF documents and transcripts from help desk calls and you prompt an LLM to extract all the names from these documents. Depending on how well you write the prompt and how good the LLM model is you may get different results each time you run it.

In our Example a customer service representative might say "Joe Bloggs" as a generic example of a name when giving an instruction. Will an LLM understand that this is not an actual name?

This is a non-deterministic output. It is really important to understand that the document does have a set amount of names in it just like the database. There is a ground truth. But within the "ground truth" there is a subjective assessment to be made about the intent of the query. Is Joe Bloggs a relevant name in the context of the conversation or in the context of the query?

The LLM may interpret the query or the text in a way that gives different output, including Joe Bloggs as a name or excluding it as a generic example. There may be other questions that you ask an LLM that have no ground truth, for instance the sentiment of a user on a call. A SQL query will never grapple with what you meant by a name. This is what makes generative AI so powerful and yet so potentially challenging in a business setting.

This is a very simple example of why we need to completely re-think testing with Generative AI.

Complexity and Creativity

  1. Nature of Outputs: Generative AI models produce creative and often subjective outputs, such as text, images, or music, which are inherently different from the deterministic outputs of traditional software. Evaluating these outputs involves assessing qualities like creativity, coherence, and relevance, which are difficult to quantify using conventional software testing methods.
  2. Lack of Ground Truth: In many generative AI applications, there is no single correct answer or ground truth, making it challenging to apply traditional testing metrics. Instead, evaluations must consider multiple possible correct outputs and use qualitative assessments to gauge performance.


Evaluation Techniques

There are are methods emerging that help us navigate these waters. Interestingly, part of the answer is to use other AI models to evaluate quality.

  1. AI-Assisted Evaluation: Advanced AI models, such as large language models (LLMs), can evaluate generative AI outputs by scoring them based on predefined criteria. This approach leverages the capabilities of powerful models to provide consistent, detailed, and contextually aware evaluations, which are crucial for capturing the nuanced aspects of generative outputs. For example, metrics like "groundedness" or "Faithfulness" can assess how well the AI-generated content aligns with the source data, ensuring factual accuracy in applications like information retrieval or content summarisation
  2. Multi-Metric and Adaptive Approaches: Evaluating generative AI models often involves combining quantitative and qualitative metrics. This includes traditional metrics like accuracy and diversity alongside qualitative methods such as user surveys and expert reviews. Adaptive evaluation frameworks can adjust metrics based on the specific context and goals of each AI application.

Challenges and Considerations

  1. Ethical and Safety Concerns: Generative AI systems can perpetuate biases or generate inappropriate content. Evaluating these systems requires metrics that can assess ethical considerations and safety risks, ensuring that the models produce outputs that are fair and unbiased.
  2. Human Evaluation and Low-Code Testing: Despite advancements in automated evaluation methods, human judgment remains crucial for assessing aspects that are difficult to capture with automated metrics. Additionally, the rise of low-code testing tools democratizes the testing process, enabling even non-programmers to contribute to the evaluation of AI models. This inclusive approach not only improves software quality but also fosters a culture of shared ownership across development teams
  3. Continuous Monitoring and Adaptation: As generative AI models are deployed, continuous monitoring is essential to ensure they remain effective and unbiased. This involves regular evaluations to detect and mitigate biases and optimize performance, ensuring that the models adapt to changing data environments.


The above is really just an introduction to this subject. I think it is really important to understand these concepts when using Generative AI output in business applications.

The interplay between creativity, subjective assessment, and the ethical considerations unique to AI systems demands innovative evaluation methods that go way beyond traditional software testing. By innovating with AI-assisted tools, fostering inclusive testing practices, and continuously refining our approaches, we can start to ensure that generative AI not only meets technical standards but also aligns with broader societal and business values. The future of AI testing lies in this balanced approach, where adaptability, human insight, and ethical integrity work together to shape reliable, fair, and effective AI solutions.



要查看或添加评论,请登录

Richard Skinner的更多文章

社区洞察

其他会员也浏览了