登录查看更多内容

Phased Approach | How do we Evaluate Generative AI?

Richard Skinner

CEO @ PhasedAI | Helping Enterprise Transform Operations with Generative AI

发布日期: 2024年9月2日

ANNOUNCEMENTS

Some of you may have noticed an issue with our Webinar signup. Apologies for the problem. If you are interested in learning about moving from POC to production please sign up below...

Are you looking to move an experiment out of the POC zone?

Join us for a free web event on September 27th for companies that want to learn about tools, techniques and frameworks to move Generative AI experiments into working live applications safely. You will learn about

Quality Assurance
Model Management
Data and regulatory governance

Places are limited, reserve yours now!

Reserve your place

EVALUATION APPROACHES

Why and how do we change our approach to evaluating Generative AI Apps?

In our newsletter this week I wanted to keep the article short to answer a basic question. What sets the evaluation of Generative AI applications apart from traditional automated testing methods used in software quality assurance?

领英推荐

GenAI and Applied ML: The Next Wave of Data-Driven…

Futran Solutions 6 个月前

Lakehouse AI: A Data-Centric Approach to Building…

XenonStack 1 年前

Large Language Models -The Future of Enterprise…

Workativ 1 年前

Deterministic Vs Non Deterministic Output

Imagine for a second that you have a database with two hundred names of customers. You can write a basic query in SQL to list the names in the database. This query will list the same names every time until a new name is added to the database. This is deterministic and it is very easy to write an automated test to evaluate this.

Now imagine you have a group of PDF documents and transcripts from help desk calls and you prompt an LLM to extract all the names from these documents. Depending on how well you write the prompt and how good the LLM model is you may get different results each time you run it.

In our Example a customer service representative might say "Joe Bloggs" as a generic example of a name when giving an instruction. Will an LLM understand that this is not an actual name?

This is a non-deterministic output. It is really important to understand that the document does have a set amount of names in it just like the database. There is a ground truth. But within the "ground truth" there is a subjective assessment to be made about the intent of the query. Is Joe Bloggs a relevant name in the context of the conversation or in the context of the query?

The LLM may interpret the query or the text in a way that gives different output, including Joe Bloggs as a name or excluding it as a generic example. There may be other questions that you ask an LLM that have no ground truth, for instance the sentiment of a user on a call. A SQL query will never grapple with what you meant by a name. This is what makes generative AI so powerful and yet so potentially challenging in a business setting.

This is a very simple example of why we need to completely re-think testing with Generative AI.

Complexity and Creativity

Nature of Outputs: Generative AI models produce creative and often subjective outputs, such as text, images, or music, which are inherently different from the deterministic outputs of traditional software. Evaluating these outputs involves assessing qualities like creativity, coherence, and relevance, which are difficult to quantify using conventional software testing methods.
Lack of Ground Truth: In many generative AI applications, there is no single correct answer or ground truth, making it challenging to apply traditional testing metrics. Instead, evaluations must consider multiple possible correct outputs and use qualitative assessments to gauge performance.

Evaluation Techniques

There are are methods emerging that help us navigate these waters. Interestingly, part of the answer is to use other AI models to evaluate quality.

AI-Assisted Evaluation: Advanced AI models, such as large language models (LLMs), can evaluate generative AI outputs by scoring them based on predefined criteria. This approach leverages the capabilities of powerful models to provide consistent, detailed, and contextually aware evaluations, which are crucial for capturing the nuanced aspects of generative outputs. For example, metrics like "groundedness" or "Faithfulness" can assess how well the AI-generated content aligns with the source data, ensuring factual accuracy in applications like information retrieval or content summarisation
Multi-Metric and Adaptive Approaches: Evaluating generative AI models often involves combining quantitative and qualitative metrics. This includes traditional metrics like accuracy and diversity alongside qualitative methods such as user surveys and expert reviews. Adaptive evaluation frameworks can adjust metrics based on the specific context and goals of each AI application.

Challenges and Considerations

Ethical and Safety Concerns: Generative AI systems can perpetuate biases or generate inappropriate content. Evaluating these systems requires metrics that can assess ethical considerations and safety risks, ensuring that the models produce outputs that are fair and unbiased.
Human Evaluation and Low-Code Testing: Despite advancements in automated evaluation methods, human judgment remains crucial for assessing aspects that are difficult to capture with automated metrics. Additionally, the rise of low-code testing tools democratizes the testing process, enabling even non-programmers to contribute to the evaluation of AI models. This inclusive approach not only improves software quality but also fosters a culture of shared ownership across development teams
Continuous Monitoring and Adaptation: As generative AI models are deployed, continuous monitoring is essential to ensure they remain effective and unbiased. This involves regular evaluations to detect and mitigate biases and optimize performance, ensuring that the models adapt to changing data environments.

The above is really just an introduction to this subject. I think it is really important to understand these concepts when using Generative AI output in business applications.

The interplay between creativity, subjective assessment, and the ethical considerations unique to AI systems demands innovative evaluation methods that go way beyond traditional software testing. By innovating with AI-assisted tools, fostering inclusive testing practices, and continuously refining our approaches, we can start to ensure that generative AI not only meets technical standards but also aligns with broader societal and business values. The future of AI testing lies in this balanced approach, where adaptability, human insight, and ethical integrity work together to shape reliable, fair, and effective AI solutions.

Phased Approach

375 位关注者

要查看或添加评论，请登录

Richard Skinner的更多文章

Phased Approach | All about Evals

2025年2月24日

Phased Approach | All about Evals

The Hidden Truth About Successful AI Products Evals can be more valuable to a product company than their own codebase…

6 条评论
Phased Approach | Battle of the Deep (Research)

2025年2月18日

Phased Approach | Battle of the Deep (Research)

If you run your own business, especially in technology, one thing you should be doing is more research. Competition…

7 条评论
Phased Approach | Reasoning Models - Thinking Fast and Slow

2025年2月4日

Phased Approach | Reasoning Models - Thinking Fast and Slow

When OpenAI first introduced o1-preview, I was intrigued but didn’t immediately integrate it into my workflow. Its…

3 条评论
Phased Approach: A Crazy Week in AI

2025年1月27日

Phased Approach: A Crazy Week in AI

In a week that felt like watching a decade of AI development unfold in real-time, three seismic shifts have reshaped…
Phased Approach | How Generative AI is Sparking a Scientific Renaissance

2025年1月19日

Phased Approach | How Generative AI is Sparking a Scientific Renaissance

In the later half of the 20th century, the popular discourse was filled with promises of transformative…

1 条评论
Agents - Why Should You Care in 2025?

2024年12月16日

Agents - Why Should You Care in 2025?

If you've been paying even a modicum of attention to the AI hype cycle recently, you've no doubt encountered countless…

3 条评论
Phased Approach | Is AI Progress Slowing?

2024年11月14日

Phased Approach | Is AI Progress Slowing?

In Phased approach this week we look into the new reports that the biggest AI companies are seeing slowing progress in…

1 条评论
Phased Approach | ChatGPT Search - The Great Unbundling of Search Engines

2024年11月3日

Phased Approach | ChatGPT Search - The Great Unbundling of Search Engines

This week ChatGPT announced its long-awaited search function. It is quite a low key feature, that many in the media…

1 条评论
Phased Approach | Business Adoption is heating up

2024年10月30日

Phased Approach | Business Adoption is heating up

Welcome to Phased Approach, the newsletter where we discuss AI topics for business. Apple are gradually adding their…

1 条评论
Phased Approach | Are LLM Wrappers ok now?

2024年10月21日

Phased Approach | Are LLM Wrappers ok now?

Welcome to Phased Approach, a weekly newsletter exploring topics in Generative AI in Business. Are LLM Wrappers ok now?…

See all articles

Phased Approach | How do we Evaluate Generative AI?

Richard Skinner

CEO @ PhasedAI | Helping Enterprise Transform Operations with Generative AI

Are you looking to move an experiment out of the POC zone?

Why and how do we change our approach to evaluating Generative AI Apps?

领英推荐

Deterministic Vs Non Deterministic Output

Complexity and Creativity

Evaluation Techniques

Challenges and Considerations

Phased Approach

375 位关注者

Richard Skinner的更多文章

社区洞察

其他会员也浏览了

Key AI Breakthroughs & Announcements from December 2024: Insights for Developers and Executives

Boost Your AI: Smarter Strategies for Continuous Model Improvement

Transformative AI Advancements: August 2024

The AI Service Suite - What Every Tech Company Needs to Know

Building Trust In AI: The Case For Transparency

Model or Data-Centric Machine Learning: Which is Best For Computer Vision Workflows?

Embracing AI, ML, and Data Science in the Goldilocks Zone: A Pragmatic Approach to Transformative Technology

Navigating the Realm of Artificial Intelligence Consulting: Insights from DataToBiz

Almost Timely News: ??? Generative AI Needs Better Data, Not Bigger Data (2024-04-14)

State of Generative AI

Are you looking to move an experiment out of the POC zone?

Why and how do we change our approach to evaluating Generative AI Apps?

领英推荐

Deterministic Vs Non Deterministic Output

Complexity and Creativity

Evaluation Techniques

Challenges and Considerations

Phased Approach

375 位关注者

Richard Skinner的更多文章

Phased Approach | All about Evals

Phased Approach | Battle of the Deep (Research)

Phased Approach | Reasoning Models - Thinking Fast and Slow

Phased Approach: A Crazy Week in AI

Phased Approach | How Generative AI is Sparking a Scientific Renaissance

Agents - Why Should You Care in 2025?

Phased Approach | Is AI Progress Slowing?

Phased Approach | ChatGPT Search - The Great Unbundling of Search Engines

Phased Approach | Business Adoption is heating up

Phased Approach | Are LLM Wrappers ok now?

社区洞察

其他会员也浏览了

Key AI Breakthroughs & Announcements from December 2024: Insights for Developers and Executives

Boost Your AI: Smarter Strategies for Continuous Model Improvement

Transformative AI Advancements: August 2024

The AI Service Suite - What Every Tech Company Needs to Know

Building Trust In AI: The Case For Transparency

Model or Data-Centric Machine Learning: Which is Best For Computer Vision Workflows?

Embracing AI, ML, and Data Science in the Goldilocks Zone: A Pragmatic Approach to Transformative Technology

Navigating the Realm of Artificial Intelligence Consulting: Insights from DataToBiz

Almost Timely News: ??? Generative AI Needs Better Data, Not Bigger Data (2024-04-14)

State of Generative AI