Phased Approach | How to Evaluate AI Output
Richard Skinner
CEO @ PhasedAI | Helping Enterprise Transform Operations with Generative AI
Welcome to Phased Approach, my weekly musings on using and implementing Generative AI in Business.
This week we will be discussing difficult topic of Generative AI output Evaluation in your business.
Many businesses have now got to the stage where they have run a proof of concept project using an LLM.
When working with customers who have run a POC that shows some promising results, many at this point reach a crucial juncture, they start to ask themselves how does this become a production grade application? Generative AI has introduced a complex challenge that many companies find themselves struggling with: How do we test and verify AI output? Lets discuss
How is this different from software testing?
At first glance, you might think that because this is just code, that it we can just build tests as with normal applications.
Evaluating LLM outputs can be challenging due to their non-deterministic behaviour, like any other machine learning model, LLMs can be biased and make mistakes and that’s why it is important to regularly and systematically evaluate them.
Unlike traditional software, where inputs lead to predictable outputs, generative AI is non deterministic. It's creative, it's unpredictable, and it's constantly pushing boundaries. This is a feature not a bug, we want models that can be creative and understand but we need to keep them bounded in reality. This means our tried-and-true methods of software quality assurance no longer work.
Think about it. In traditional software testing, we're looking for bugs, errors, and deviations from expected behaviour. We can write test cases, run automated scripts, and tick off boxes on a QA checklist. But with generative AI, we're dealing with a shape-shifter. The output isn't just right or wrong – it exists in a spectrum of appropriateness, creativity, and usefulness.
Suddenly, we're not just testing for functionality; we're evaluating nuance, context, and even creativity. It's like trying to grade a piece of art using a multiple-choice quiz. You can check if the paint is on the canvas, but how do you measure its impact, its relevance, its context?
This is the puzzle that's keeping development teams on their toes and sparking heated debates in boardrooms across the globe. How do we bring the rigorous standards of software QA to a technology that's designed to think outside the box?
What are we not testing?
To make this easier lets talk firstly about what we are NOT testing...
Evaluation Pipeline
So what are we testing?
Well we are evaluating how the LLM is interacting with your data and whether it is generating clear, concise, correct and understandable output using your business data as input.
It is worth stating from the outset that this field is in its infancy. Even the experts working at the top fortune 500 companies are only beginning to grapple with how this works. As Generative AI becomes more and more integral to our tech-stacks it is crucial for us to be able to measure and test quality and integrate this into both Continuous Integration / Continuous Deployment (CI/CD Pipelines) as well as live audit.
As this is a new field it is worth diving into some of the concepts and how these evaluations take place.
Gold Standard
To evaluate your model effectively, it’s helpful that you create a dataset that covers various prompts representing key use cases. This Gold Standard output, is output that was either created by a human or was manually reviewed by a human. This standard output is then measured against LLM output by the LLM and checked for using the exact same inputs. From this we can see how close the LLM gets using specific prompt and specific model to the Gold Standard.
Once the model generates responses to these prompts, you can compare them to previous evaluations or seek human review and rating.?This can be a standard dataset that can be used over and over again.
Example use: A prompt to pull key financial metrics out of company annual report PDF using LLM. You would have a sample of 100 documents with all the metrics extracted by humans. Imagine a scenario where you have this application working in production and a new version changes the model from GPT-4 to GPT4o Mini to save money. An eval would have these outputs across 100 documents by a human, then from GPT-4. The test would run the same prompt over the documents using the new model. It would automatically test the new output against the old and give a score.
领英推荐
Metrics
Lets review some of the metrics that can be used in your Eval Pipeline. Please note the names of these are taken from a framework called DeepEval but the concepts in this framework are more universal.
G-EVAL: this is a framework that uses the LLM to evaluate itself using data like the above example. The framework uses chain of thought reasoning to allow the LLM to evaluate its own output using any metrics. This is a very versatile way of applying criteria to test outputs.
Summarisation : This again use a separate LLM to determine whether your LLM (application) is generating factually correct summaries while including the necessary details from the original text. This is basically getting another LLM (Model) to check you application LLMs homework.
Answer Relevancy : When you using a RAG pipeline in your application, often the LLM might fail to pick up the full relevance. This technique uses the source material and the output to check for relevance and gives a score.
Faithfulness: This metric measures the quality of your RAG pipeline's generator by evaluating whether the actual output factually aligns with the contents of your retrieval context. Deepeval's faithfulness metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score. This metric looks for hallucinations in Retrieval Augmented Generation and explains its score.
Context: This metric looks for inconsistency in ranking of data in RAG Pipeline. This can look for context issues with precision, relevance and recall.
Bias: This looks for output that might contain gender, racial or political bias. This is again one LLM checking another with flags sent to Humans in the Loop. This would be particularly useful with chatbots dealing with customer support etc.
Toxicity: This checks for toxic responses in output and would act as a guardrail for allowing such responses.
Compliance: This metric could refer to internal company AI guidelines such as use of certain types of data or it could also check that the application output is in compliance with legislation such as the EU AI Act.
User Feedback: This is when you allow the user to interact with the output and measure what they change or their negative reaction to the output.
How to Implement Eval Pipelines
As mentioned, this is all still quite new and will no doubt consolidate into more defined standards. Companies can start this using frameworks like DeepEval, MLFlow and Deepchecks.
The drawback with all of these frameworks is that they are Open Source libraries that need a lot of work and a deep understanding of how these kinds of evaluations work. Some of these frameworks have great tools that let you test different prompts against different models using the same data but these frameworks are still quite difficult to use and do not include things like compliance, LLM costs or other operation metrics.
To get started your developers can download one of the frameworks for free and start learning. At Phased AI we work with many companies to advise and implement these kinds of frameworks