As I continued to work on this, I've had to simplify this article for the sake of not having to maintain it for changes. Here's the repo. The code is provided without any guarantees. Use it at your own risk.
I used DeepEval documentation, ChatGPT 4o and Claude 3.5 Sonnet for help. The LLM utilized for semantic similarity is the sentence transformers model paraphrase-MiniLM-L6-v2. The target LLM is gpt-3.5-turbo via API.
Iterative completion testing is a method used to evaluate the performance of large language models (LLMs) by repeatedly refining their responses through a series of iterations. This process helps to identify and address potential issues in the model's training.
How Iterative Completion Testing Works
- Define Test Cases: Create a set of test cases that represent various scenarios or prompts the model is expected to handle effectively. I've included a Gherkin based template to use for the creation of test cases, however, it is not executable and there are no step definitions.
- Generate Initial Responses: Configure the test and run it against the LLM to generate responses for each test case.
- Evaluate Responses: Analyze the generated responses against predefined criteria or dynamically created outputs from the target LLM. This can be toggled in the test.
- Iterate and Refine: If the responses are unsatisfactory, adjust the model's parameters, training data, or architecture and repeat the process.
Use Cases for Iterative Completion Testing
- Identifying Biases: Iterative testing can help uncover biases in the model's training data or architecture.
- Ensuring Accuracy: By repeatedly refining responses, the model's accuracy can be improved over time.
- Evaluating Coherence: Iterative testing can help assess the model's ability to generate coherent and relevant responses.
- Measuring Creativity: The process can be used to evaluate the model's creativity and ability to generate novel ideas.
Combining LLMTestCase and SentenceTransformer
I wanted to combine the strengths of LLMTestCase and SentenceTransformer for a more comprehensive evaluation. LLMTestCase provides a structured framework for testing LLM responses against predefined criteria. SentenceTransformer can be used to assess the semantic similarity between the generated responses and expected outputs.
- Context: Provides the scenario or subject area guiding the model’s response, offering insight into the perspective or background used.
- Dynamic Responses Enabled: Indicates whether the expected outputs were dynamically generated during the test (True) or predefined (False).
- Similarity Threshold: Defines the level of precision required for the model’s output to match the expected responses. A higher threshold demands closer alignment in meaning, while a lower one allows for more flexibility.
- Input: The exact prompt provided to the model, serving as the starting point of the test.
- Expected Responses: The set of responses against which the model’s actual output is compared, either dynamically generated or fixed.
- Actual Output: The model’s response to the input prompt, compared to the expected responses in terms of meaning and relevance.
- Result: Indicates whether the model’s output met the required criteria, with ? Pass meaning it did and ? Fail meaning it didn’t.
Evaluation of the Solution
(Thanks to Paul Hope for the eval criteria)
- Accuracy: This solution is accurate in that it offers both semantic similarity and exact match options, including other ways to configure the script. DeepEval is extensive and this script can be improved upon to increase accuracy, but I included features that will should allow for accuracy to be measured. Perhaps I could include something more quantifiable.
- Performance: The solution can be slow due to the resource-intensive nature of semantic similarity calculations, especially with larger models.
- Ease of Adoption: The script is relatively straightforward, with toggles and functions, making it accessible for testers with basic Python knowledge. Again, it can be iterated on, but the overall ease of using Python with DeepEval was really great.
- Integration: The solution integrates well with existing tools like OpenAI's API and sentence-transformers, though it requires managing dependencies.
- Extensibility: It's extensible, allowing for the addition of new contexts, prompts, and testing criteria.
- Resource Requirements: It demands significant resources, particularly for model loading and similarity calculations, which could be a limitation on lower-end systems.
Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer
6 个月DeepEval is a fascinating tool for probing the inner workings of LLMs, allowing us to move beyond simple accuracy metrics and delve into the nuances of their reasoning processes. The ability to generate diverse test cases and analyze the model's responses at a granular level provides invaluable insights into its strengths and weaknesses. It's particularly interesting to see how DeepEval can be used to identify biases or logical fallacies within LLM outputs, highlighting areas that require further refinement. Have you considered using DeepEval to explore the impact of different training datasets on an LLM's performance in specific domains?