Phased Approach | How do we Evaluate Generative AI?
Richard Skinner
CEO @ PhasedAI | Helping Enterprise Transform Operations with Generative AI
ANNOUNCEMENTS
Some of you may have noticed an issue with our Webinar signup. Apologies for the problem. If you are interested in learning about moving from POC to production please sign up below...
Are you looking to move an experiment out of the POC zone?
Join us for a free web event on September 27th for companies that want to learn about tools, techniques and frameworks to move Generative AI experiments into working live applications safely. You will learn about
Places are limited, reserve yours now!
EVALUATION APPROACHES
Why and how do we change our approach to evaluating Generative AI Apps?
In our newsletter this week I wanted to keep the article short to answer a basic question. What sets the evaluation of Generative AI applications apart from traditional automated testing methods used in software quality assurance?
领英推荐
Deterministic Vs Non Deterministic Output
Imagine for a second that you have a database with two hundred names of customers. You can write a basic query in SQL to list the names in the database. This query will list the same names every time until a new name is added to the database. This is deterministic and it is very easy to write an automated test to evaluate this.
Now imagine you have a group of PDF documents and transcripts from help desk calls and you prompt an LLM to extract all the names from these documents. Depending on how well you write the prompt and how good the LLM model is you may get different results each time you run it.
In our Example a customer service representative might say "Joe Bloggs" as a generic example of a name when giving an instruction. Will an LLM understand that this is not an actual name?
This is a non-deterministic output. It is really important to understand that the document does have a set amount of names in it just like the database. There is a ground truth. But within the "ground truth" there is a subjective assessment to be made about the intent of the query. Is Joe Bloggs a relevant name in the context of the conversation or in the context of the query?
The LLM may interpret the query or the text in a way that gives different output, including Joe Bloggs as a name or excluding it as a generic example. There may be other questions that you ask an LLM that have no ground truth, for instance the sentiment of a user on a call. A SQL query will never grapple with what you meant by a name. This is what makes generative AI so powerful and yet so potentially challenging in a business setting.
This is a very simple example of why we need to completely re-think testing with Generative AI.
Complexity and Creativity
Evaluation Techniques
There are are methods emerging that help us navigate these waters. Interestingly, part of the answer is to use other AI models to evaluate quality.
Challenges and Considerations
The above is really just an introduction to this subject. I think it is really important to understand these concepts when using Generative AI output in business applications.
The interplay between creativity, subjective assessment, and the ethical considerations unique to AI systems demands innovative evaluation methods that go way beyond traditional software testing. By innovating with AI-assisted tools, fostering inclusive testing practices, and continuously refining our approaches, we can start to ensure that generative AI not only meets technical standards but also aligns with broader societal and business values. The future of AI testing lies in this balanced approach, where adaptability, human insight, and ethical integrity work together to shape reliable, fair, and effective AI solutions.