Building automated testing for AI chatbots
Leonardo Lanni
Expert in Software Quality Engineering · Founder of QA Roots · International Speaker · Coach · Trainer
Co-written article with Stefano La Cesa
Introduction
In the rapidly evolving landscape of artificial intelligence (AI) and natural language processing, the advent of powerful language models such as ChatGPT has revolutionised human-computer interactions. These models, characterised by their ability to generate coherent and contextually relevant responses, play a pivotal role in various applications, ranging from virtual assistants to content generation.
However, as the deployment of these models becomes increasingly widespread, the inherent challenge of testing their quality assurance (QA), in terms of verifying the quality, correctness, and relatedness of the responses to questions posted in prompts, poses a significant hurdle for software development teams.
At the heart of the QA challenge lies the unique nature of Large Language Models (LLMs) like GPT4, Gemini, Llama 2 and others, which operate on a fundamentally probabilistic basis. Unlike traditional software systems that yield deterministic outputs for a given input, language models like ChatGPT generate responses that are contingent on the input prompt and the inherent uncertainty embedded within their training data. As a result, testing these models becomes a complex endeavour, as the same prompt may elicit varied responses in different attempts, showcasing the inherent randomness in their output.
This article delves into the intricacies of testing Large Language Models, shedding light on the complexities that arise from the probabilistic nature of their responses. More importantly, it explores solutions to address these challenges, particularly within the context of automating test cases in continuous integration and continuous deployment (CI/CD) pipelines.
As companies increasingly integrate LLMs functionalities into their software, the need for a robust testing framework becomes imperative to ensure the reliability, performance, and user satisfaction of applications leveraging such AI technologies.
The challenge
Let's imagine having an application containing a feature that allows users to interact via text with a bot, whose goal is to provide the user the requested information in the most accurate way.
The challenge would be to make sure that such information is always precise, accurate, and fulfilling the user' needs, in other words that to a given request in a prompt, the AI powered bot is able to provide the right response (correct, valid, updated, complete, precise, fulfilling the user request, safe, useful etc.).
When such a bot is powered by a 3rd party LLM, like for instance ChatGPT, the challenge is that the behaviour of the bot can change over time due to changes in the 3rd party AI, that are typically out of the company's control and not always predictable.
Then, it becomes clear the need to build a mechanism able to provide fast feedback, to grant that the bot is always returning the expected responses to prompts/requests by the users, as the company deploys new versions of the application, and/or as the 3rd party LLM can intrinsically change too.
A classic approach in Quality Assurance would be to simply verify, on a regular basis, the quality of the bot responses, by defining a set of requests and, for each, what would be a valid response, and simply providing such requests to the bot as prompts, and manually verifying the responses.
Such activity, assuming that the set of requests grant a sufficient coverage on the range of expected and probable user requests to the bot, is a brilliant way to make sure that bot is always behaving in the right way by providing good quality responses to the users, anyway it will require a manual process and effort to execute the test cases, as well as a subject domain expert able to formulate appropriate questions, based on the data that are feeding the AI behind the Large Language Model.
This is for sure a great step and a valid initial QE process to track and have under control the quality of the bot, anyway for the intrinsically manual and human-based nature, it may suffer or not scale at all as the number of possible prompts can grow (as the functionality expands), or as the frequency of the new application versions can intensify, and it also has a single point of failure in the dependency of a manual tester as the subject matter expert.
It becomes then obvious the need for an automated solution that would allow it to scale up with the growth of the bot domain and the frequency of new releases of the application.
In these regards, it is simply possible to automate the testing of the bot functionality, by thinking of an automated test case (on the UI or API level, using one of the common testing tool like Cypress.io for instance), that sends a series of prompts to the bot, and for each checks the quality/correctness/pertinence of the response.
The challenge is in the fact that, by its nature, LLMs will typically provide, for a given request, different responses, due to the probabilistic approach of the model, which can be semantically equivalent, but different in terms of wording and phrasing.
Therefore, a classic approach in automated testing, such as to verify that for the given input (the prompt), the system (the AI bot) will provide an output (actual response), that the tool will validate to equal the expected output, will clearly not work, as the actual response can vary over time at each attempt.
So, a classic string comparison (actual response equals the expected) will not work, leading to a false negative case (the test fails when the results are correct), and a formally not valid automated test case.
A quick win can be thought of as replacing the text equaling with a more relaxed "contains" check that gets only applied to the most key term expected in the response (example: "What is the capital of England?", the expected response is anything containing "London").
This approach is fur sure better than the previous one, anyway it can also generate false positive cases, imagining that for some strange reason (e.g. a code bug) the LLM is "hallucinating" and providing a response like "For sure it is not London", such answer will pass the test case but it is clearly wrong.
In both cases, it is also possible that from time to time the test will pass as the LLM returns exactly the expected value, leading to a classic "flaky" test case (intermittently green or red) which in this case depends on the probabilistic and typically varying response of the model - A flaky test case is one of the most unwanted result in automated testing, as it is not reliable and typically if not fixed, it ends up being discarded - "quarantined" by the development team until a fix, and often such test cases end up being forgiven leading to a potentially uncovered (at least by the automation feature).
Then, it is needed to think of an approach to test the responses of a LLM based bot, in an automated way, that can be reliable and trustable and deal with the always changing responses.
Simple approach: semantic similarity
Ideally, the purpose of a test set should be to evaluate how closely the meaning of the provided answer aligns with the predetermined correct answer. To do that, let’s introduce the concepts of word embeddings and semantic similarity.
Word embeddings are representations of words as a real-valued vector of n dimensions where each dimension corresponds to a particular feature of the word. This way the words that are closer in the vector space are expected to be similar in at least one of its meanings.?
领英推荐
As an example, every word will be encoded as a real vector array [a, b, c, d, ... , n] where its length can be between 100 and 1000 and each value is a real number. The word “puppy” will have some values similar to the word “dog” (because a puppy will eventually grow up to be a dog) and other values similar to the word “kitten” (because they are both young animals).
Embeddings of words composing a sentence can be aggregated into sentence embeddings that are expected to grasp the semantic meaning of a sentence.
A quick way to get started is using word embeddings that have been pre-trained on large corpuses. A few popular alternatives are GloVe or Word2Vec.
Semantic similarity provides a measure of semantic likeness between vectors, facilitating the evaluation of how closely the meaning of two pieces of text aligns. As an example, cosine similarity is based on the angle between two vectors representing the sentence.
Cosine similarity based on embeddings has the advantage of being very fast to obtain, takes into account synonyms, and ignores the length of a sentence, so the sentences “a puppy” and “a young black dog” will result to be very similar.
Cosine similarity has a few drawbacks:
In many scenarios these drawbacks will not pose a serious obstacle, but if the bot’s answers tend to be more complex, there may be a need to ensure an even better semantic match. In order to achieve this, a technique for evaluating semantic similarity using a BERT language model will be described in the following paragraphs.
More advanced approach: the BERT approach
Once tried and failed using the previous approach, it is time to explore the superior capabilities of BERT (Bidirectional Encoder Representations from Transformers) in addressing the negation problem. Let’s try to understand first why cosine similarity fails to perform correctly:
BERT models can be used to extract a vector representing a sentence embedding that is not only dependent on the embedding of each word, but it also takes into account several other factors, including the position of each word.
A relatively simple way to get started with this approach is through the popular Sentence Transformers library (https://www.sbert.net). The sentences need to be tokenised accordingly so that the BERT model can process them, and return us an embedding for each sentence.?
As with the previous approach, the final step consists in comparing the sentences using the same cosine similarity function used during the previous approach.
While BERT offers substantial advantages in understanding contextual nuances and semantic relationships, there are downsides and challenges associated with its use compared to simple cosine similarity, particularly in terms of resource consumption and interpretability:
First of all BERT models demand significant resources, both in terms of storage space and processing power, making them less suitable for deployment in resource-constrained environments such as mobile devices or edge devices. Usage from a laptop is possible, but a last generation GPU with dedicated RAM is recommended.
Regarding interpretability, BERT models, like all deep neural network models, are often considered "black boxes"
Understanding the inner workings of BERT and interpreting the specific reasons behind its predictions can be challenging, which may be a drawback in scenarios where interpretability is crucial for decision-making or compliance.
In many cases the first thing to do is tweak its parameters and try to vary its inputs, following some heuristics that have proved to be working in similar settings.
Conclusions
In conclusion, the concepts of cosine similarity have been explored using pre-trained word and sentence embeddings, as well as embeddings extracted using BERT, in the context of chatbot testing and semantic understanding.
To choose between the two approaches, consider that simple cosine similarity may suffice for most text similarity tasks or scenarios with limited computational resources, whereas BERT embeddings are preferable when dealing with longer and more complex chatbot responses, especially when handling linguistic nuances like negation.
If eager to begin experimenting, Python libraries such as scikit-learn or gensim offer cosine similarity implementations. These libraries provide user-friendly and computationally efficient functions for calculating cosine similarity between vectors.
Integrating BERT may require more effort and better hardware, but a quick start can be made using the Sentence Transformers library. Exploring frameworks like TensorFlow or PyTorch allows for customisation of existing BERT models to fit specific domain requirements identified in the scenario.
By doing so, a fully automated test can be introduced to check and evaluate the quality of AI-generated responses to prompts. Passing these tests allows for confident release of new software versions, knowing that anything relying on AI-generated content functions correctly, leading to reduced time to market compared to manual verification processes through the automated AI testing pipeline.
However, failure in these automated tests should prompt developers to focus on the code handling the AI component, identifying and fixing any issues, including third-party related ones, and re-executing the automated tests until the AI-generated responses closely match the expected ones.
Digital Platform Consultant at onepoint
1 年Thanks for sharing this, Leonardo Lanni