登录查看更多内容

Testing in AI Models: An Example of Iterative Completion Testing

Jacob A.

Software Quality Assurance Leadership

发布日期: 2024年8月31日

As I continued to work on this, I've had to simplify this article for the sake of not having to maintain it for changes. Here's the repo. The code is provided without any guarantees. Use it at your own risk.

I used DeepEval documentation, ChatGPT 4o and Claude 3.5 Sonnet for help. The LLM utilized for semantic similarity is the sentence transformers model paraphrase-MiniLM-L6-v2. The target LLM is gpt-3.5-turbo via API.

Iterative completion testing is a method used to evaluate the performance of large language models (LLMs) by repeatedly refining their responses through a series of iterations. This process helps to identify and address potential issues in the model's training.

How Iterative Completion Testing Works

Define Test Cases: Create a set of test cases that represent various scenarios or prompts the model is expected to handle effectively. I've included a Gherkin based template to use for the creation of test cases, however, it is not executable and there are no step definitions.
Generate Initial Responses: Configure the test and run it against the LLM to generate responses for each test case.
Evaluate Responses: Analyze the generated responses against predefined criteria or dynamically created outputs from the target LLM. This can be toggled in the test.
Iterate and Refine: If the responses are unsatisfactory, adjust the model's parameters, training data, or architecture and repeat the process.

Use Cases for Iterative Completion Testing

Identifying Biases: Iterative testing can help uncover biases in the model's training data or architecture.
Ensuring Accuracy: By repeatedly refining responses, the model's accuracy can be improved over time.
Evaluating Coherence: Iterative testing can help assess the model's ability to generate coherent and relevant responses.
Measuring Creativity: The process can be used to evaluate the model's creativity and ability to generate novel ideas.

Combining LLMTestCase and SentenceTransformer

I wanted to combine the strengths of LLMTestCase and SentenceTransformer for a more comprehensive evaluation. LLMTestCase provides a structured framework for testing LLM responses against predefined criteria. SentenceTransformer can be used to assess the semantic similarity between the generated responses and expected outputs.

The Report

Context: Provides the scenario or subject area guiding the model’s response, offering insight into the perspective or background used.
Dynamic Responses Enabled: Indicates whether the expected outputs were dynamically generated during the test (True) or predefined (False).
Similarity Threshold: Defines the level of precision required for the model’s output to match the expected responses. A higher threshold demands closer alignment in meaning, while a lower one allows for more flexibility.
Input: The exact prompt provided to the model, serving as the starting point of the test.
Expected Responses: The set of responses against which the model’s actual output is compared, either dynamically generated or fixed.
Actual Output: The model’s response to the input prompt, compared to the expected responses in terms of meaning and relevance.
Result: Indicates whether the model’s output met the required criteria, with ? Pass meaning it did and ? Fail meaning it didn’t.

Evaluation of the Solution

(Thanks to Paul Hope for the eval criteria)

Accuracy: This solution is accurate in that it offers both semantic similarity and exact match options, including other ways to configure the script. DeepEval is extensive and this script can be improved upon to increase accuracy, but I included features that will should allow for accuracy to be measured. Perhaps I could include something more quantifiable.
Performance: The solution can be slow due to the resource-intensive nature of semantic similarity calculations, especially with larger models.
Ease of Adoption: The script is relatively straightforward, with toggles and functions, making it accessible for testers with basic Python knowledge. Again, it can be iterated on, but the overall ease of using Python with DeepEval was really great.
Integration: The solution integrates well with existing tools like OpenAI's API and sentence-transformers, though it requires managing dependencies.
Extensibility: It's extensible, allowing for the addition of new contexts, prompts, and testing criteria.
Resource Requirements: It demands significant resources, particularly for model loading and similarity calculations, which could be a limitation on lower-end systems.

Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

6 个月

DeepEval is a fascinating tool for probing the inner workings of LLMs, allowing us to move beyond simple accuracy metrics and delve into the nuances of their reasoning processes. The ability to generate diverse test cases and analyze the model's responses at a granular level provides invaluable insights into its strengths and weaknesses. It's particularly interesting to see how DeepEval can be used to identify biases or logical fallacies within LLM outputs, highlighting areas that require further refinement. Have you considered using DeepEval to explore the impact of different training datasets on an LLM's performance in specific domains?

1 次回应

查看更多评论

要查看或添加评论，请登录

Jacob A.的更多文章

ChatGPT Deep Research: Integrating Agile User Stories, Gherkin Scenarios & GPT for Autonomous Testing

2025年2月26日

ChatGPT Deep Research: Integrating Agile User Stories, Gherkin Scenarios & GPT for Autonomous Testing

Provided As Is Thoughts: Decided to try out Deep Research, here's my first task for it. The results are informative and…
Using Gherkin as an SCoT Prompt with o1-preview ( Strawberry)

2024年9月13日

Using Gherkin as an SCoT Prompt with o1-preview ( Strawberry)

I've been using User Stories and Gherkin with AI for a while now and they've come along nicely in terms of consistent…
Reflections on Reflection 70B - A Sample Test Plan

2024年9月10日

Reflections on Reflection 70B - A Sample Test Plan

GitHub: https://github.com/jadm11/llm-testplan/blob/main/README.
The AI Cannibal Plushie

2024年9月8日

The AI Cannibal Plushie

I asked ChatGPT 4o to create a plushie based on the cover from this article I put together last year. I named the…
Client Collaboration with AI in Software Design

2024年4月7日

Client Collaboration with AI in Software Design

If you're interested in a consultation with me on any of these topics and how they may help you and your business…
The Impact of Context Window Limitation on AI and Insights from GPT

2023年6月11日

The Impact of Context Window Limitation on AI and Insights from GPT

"- Hi, I'm Tom. - Hi, I'm Lucy.

1 条评论
PromptPro

2023年6月11日

PromptPro

I've been experimenting with software requirements in ChatGPT. The following is an example of simple behavioral…

See all articles

Jacob A.的更多文章

ChatGPT Deep Research: Integrating Agile User Stories, Gherkin Scenarios & GPT for Autonomous Testing

Using Gherkin as an SCoT Prompt with o1-preview ( Strawberry)

Reflections on Reflection 70B - A Sample Test Plan

The AI Cannibal Plushie

Client Collaboration with AI in Software Design

The Impact of Context Window Limitation on AI and Insights from GPT

PromptPro