Pragmatic Mutation Testing and Formal Verification - Two for the Price of One
Prahladavaradan Sampath
Development Manager and Product Lead at The MathWorks
I have been trying to make sense of mutation testing for some time now. There is considerable interest in this approach evidenced by the large amount of literature. At the same time, it is very challenging to apply in practice - This seems to be mainly because of the open-ended cost of generating mutants and executing tests. This schism between "promise and practice" has interesting parallels with formal verification, and I wanted to explore this aspect further.
?Bottom line on top : I have come to the conclusion that mutation testing is a pragmatic approach for deriving (and simultaneously verifying) the specification of a program. If we consider test-cases as program specifications, and test-execution as evaluating whether a program meets a specification; then mutation testing help to evaluate the correctness and completeness of the specification - using program execution instead of symbolic reasoning. A subsequent insight is that program-contracts which are a pragmatic approach for formal-specification and reasoning of programs are also applicable to mutation-testing - contracts play the role of built-in tests that verify detailed algorithmic and data-structure design decisions in the program.
In the rest of this article, I will start with a brief introduction to mutation-testing and formal-verification, followed by an experiment to evaluate a few different settings for applying mutation-testing, culminating in a method that combines mutation-testing and formal-specification and verification in the form of program-contracts.
Mutation-testing
A good starting point into the literature and tools on mutation testing is the Mutation Testing Repository. The survey papers [1,2] are a great summary of the nearly 900 publications on the topic since 1970! Mutation testing is based on the observation that the quality of a test-suite is correlated with its ability to detect mutations in the program. A mutation that is not "killed" by a test-suite (indicated by the test-suite failing) needs to be investigated to assess whether:
?Mutation testing encourages an iterative workflow:
?By following this iterative process, we end up with a?program and a test-suite that captures the functionality of the program with a very high degree of correlation. The following is a brief analysis of the strengths and weaknesses of mutation testing.
Strengths
?Weaknesses
Formal Verification
Formal verification refers to the use of mathematical techniques to prove properties of a program. The main distinguishing characteristic of formal verification is its ability to provide guarantees for absence of defects in a program - which is something that software testing can never achieve! There is a vast body of literature on formal verification (far greater than that for mutation testing), and it spans a very wide range of topics from verification of communication protocols [7], to automated detection of numerical defects in C program to proving that an implementation of a program meets its formal specification [8, 9, 10, 11, 12, 13].
For the purpose of comparison with mutation-testing, the most relevant approach is formal-specification and proof of a program. This in turn depends on having a correct and complete specification of the program - we need a mathematically precise statement of what it means for a program to be correct. This is typically done using Hoare triples [14] which uses logical statements (Boolean valued expressions in programs) to state pre and post-conditions for a program.
Strengths
Weaknesses
At first sight, mutation-testing and formal verification seem to live at two extremes of a spectrum. Mutation testing places emphasis on program execution and testing, while formal verification places emphasis on precise specification and symbolic analysis. However, I hope to show by the end of this article, that they can be meaningfully and effectively combined in a pragmatic development methodology.
Experimental evaluation
The experimental setup evaluates three different development settings, with the aim of developing insights into different ways of integrating mutation-testing into a development workflow:
Coming to tools, I initially wanted to do experiments using publicly available tools, and the repository references nearly 80 tools, and a few of them operate on C/C++ programs. Google search identified a few more tools and after some preliminary investigations I narrowed down my attention to the following tools:
I finally chose Mutate-cpp for this article as I felt it was easy to setup and has a very easy to use interface and fits in with a unit-testing setup I am comfortable with. I also preferred its approach to "opt-in" files to mutate instead of "exclude" files approach of Mull! The python implementation of Mutate-cpp also fits in with a quick-and-dirty experimentation approach I had in mind. I do believe the clang based approach of Mull is intrinsically more powerful and can potentially avoid generating a large number of syntactically invalid mutants!
For formal specification, I decided to use a pragmatic approach and used a simple library for expressing contracts in C++ [4]. This allows writing class-invariants, function pre/post-conditions and even loop-invariants and loop-variant contracts. Is uses the C++ assert statement to halt program execution if a contract is violated. In the exercise below I used mainly class-invariants and method pre/post-condition forms of contracts. Contracts can be used as the basis for formal verification [5,6].
The basis for the experiment is a dictionary class that maintains a collection of words in an efficient trie structure. It has methods to add words, check for whether a word exists in the dictionary and the size of the dictionary. The complexity of the data-structure is largely contained in the method to add a word. The code for this class is available at:?practical-mutation-testing (github.com). The repository has three branches corresponding to the experiments corresponding to the three approaches above:
?Experiment 1 - Mutation testing with ad-hoc unit-test suite
This is the first implementation of the dictionary class which was developed along with a handful of unit-tests: testDictionary.cpp. These unit-tests were not methodically developed - they were written based on a few of the design decisions taken in the dictionary class.?I was able to develop a fair bit of confidence in the implementation based on these unit-tests. However, just looking at the tests at a later point, it is very difficult (even for me) to fathom exactly what is being tested by each test-case, and what motivated that particular test-case to be written!
?At this point, I launched the mutate-cpp tool and performed mutation testing on the main source file (Dictionary.cpp). The mutation testing process implemented by mutate-cpp basically modifies (mutates) the source-file, builds the project, and runs tests. Failures in the build and in the tests are taken as evidence of the mutation being detected ("killed"). The remaining mutations are given a "unknown" classification, and the expectation is that each mutation would be inspected to decide whether it is "benign" or whether it indicates a gap in the test-setup.
?Experimental results
The basic statistics of this session of mutation-testing are as follows:
Manual investigation of the mutations revealed that 104 of the mutations would actually introduce faults and which were not detected by the existing unit-tests. Only 88 of the undetected mutations were benign. This manual investigation was actually very laborious - it took approximately 8-10 hours of effort to individually classify all the mutations.
I was actually surprised by the large number of mutations that escaped detection. However, I did not go ahead and just add unit-tests because the quality of the unit-test was already very poor: the test-cases were unstructured and it was difficult to figure out the intention of a test-case with just a knowledge of the expected functionality of the dictionary class. Adding test-cases for catching mutations in the source would only introduce more difficult to understand test-cases!
?Insights from Experiment 1
I did gain some insights from this laborious exercise of analyzing all the undetected mutants. One insight was a more fine-grained classification of mutants in the program:
?Another insight from this experiment was that executing test-cases was not the only way to test for mutations! Many mutations resulted in code that would trigger warnings in compilers. Based on this insight, I changed the build settings to increase the strictness of compilation - since I was using gcc [15], I added the flags "-Wall -Wextra -Werror". In later experiments, this actually resulted in a noticeable reduction in the number of undetected mutants.
?A related observation was that enabling dynamic instrumentation for detecting memory defects also helped in a few cases. In later experiments I used the gcc flags "-fsanitize=undefined -fsanitize=address" to enable runtime defect checking. There were a few mutants which were detected by these dynamic analysis tools, but as I only enabled them fairly late in the experiment I do not have much data on their potential efficacy.
Experiment 2 - Mutation testing with additional requirements based tests
The second stage of the experiment aimed to be more methodical in terms of identifying and writing unit-tests. I identified a set of requirements for the dictionary class (13 requirements), and wrote tests that targeted each of the identified requirements. The file testDictionaryRequirements.cpp includes the requirements as comments and the tests-cases along with comments that trace test-cases to specific requirements. These test-cases are much more structured and easy to understand. Overall, the quality of the requirement based test-suite is certainly better than the test-suite having ad-hoc tests in the first experiment. For mutation-testing, I enabled both the requirements-based tests and also the ad-hoc unit-tests from the last experiment.
领英推荐
Experimental Results
The basic statistics of this session of mutation-testing are as follows :
(Note: The differences in numbers from the first experiment mutant generation is because I had made some minor changes to the source code and added an additional utility function. However, the main body of the class was un-touched)
?Insights from Experiment 2
As we can see, adding requirements based tests has not significantly improved the ability of the test-suite to detect mutations. The percentage of undetected harmful mutations dropped from 14.4% to 13.8% - not a very significant change!
Experiment 3 : Mutation testing with requirements based tests, DBC, and additional top-up tests
The final stage of the experiment was to incorporate formal specifications in the form of contracts. I started the experiment with an initial set of contracts - a few illustrative one are listed below:
A Digression
For those of you who have not yet used formal specification or contracts, an interesting phenomenon occurs when trying to figure out the contracts for a piece of software - writers block! This is similar to, but much more severe than, an equivalent phenomenon that all of us are familiar with while writing requirements.
If you are familiar with contracts, you may have come across statements such as "contracts define the obligations of a piece of software" - in my experience, these statements make sense only after you have used contracts in a few scenarios. I have personally found that a good way to get started with contracts is to think of "class-invariants" - these are Boolean expressions relating different variables/attributes in a class, which are always true for an external observer of a class. For example, the dictionary class acts as a node in a prefix-based trie, and a simple invariant is that the dictionary is non-empty if and only if the node holds a non-empty prefix!
Another useful way of thinking of contracts - and one which is relevant to mutation testing - is to think of contracts as "built-in tests". These built-in tests take the form of specific conditions that are checked for every execution of the program. The connection with formal verification becomes clear when we realize that a single execution (unit-test) can be thought of a single "case" in a proof of contracts. Unit-tests can be thought of as case-splits in a proof of contracts, and executing the tests without violating any contract is a lemma proving that particular case-split!
Experimental Results
I used the results of mutation testing to evaluate for each harmful mutant, whether I could strengthen contracts, or in some cases, add additional test-cases to ensure the exercise of specific contracts. I added 10 test-cases to target specific mutations (testDictionary_cov.cpp) : I named the test-cases with the numeric id of the mutation as generated by the mutate-cpp tool. Note that in most of the cases, the newly added test-cases actually violated contracts that were already present - the corresponding mutation went undetected simply because the contracts were not exercised completely! There were also situations where I had to strengthen contracts; some examples of the stronger contracts introduced are:
The basic statistics for the first run of mutation testing (before strengthening contracts):
The basic statistics of mutation-testing after a round of strengthening contracts and adding unit-tests targeting specific mutations are as follows :
?Insights from Experiment 3
As we can see there is a significant improvement in the mutation score with the use of contracts. Of the 4 remaining harmful mutations identified in the final round of mutation testing, three mutations trigger "undefined behaviour" of the language - for example, an out-of-range array access. These are very difficult to detect, even using tools like address sanitizers. The final harmful mutation in the code is a genuine mutation, which needs to be addressed with an additional unit-test!
An interesting observation is that the additional test-cases added in this step are similar to the ones I added as part of experiment 1 above - they are very difficult to understand from a functional stand-point. The difference here is that each test-case can be clearly traced to a specific set of contracts they exercise, and a corresponding mutation that is detected!
Another insight of this exercise is that the process of investigating whether specific mutations are harmful or benign is actually very helpful in identifying the contracts that needed to be strengthened - in many cases the mutation I was analyzing actually suggested the contract I should strengthen or add! In hind-sight this is almost obvious; but it was unexpected in the beginning. As a result of this exercise, I am confident I have identified a set of contracts that characterize the design of the dictionary class with a very high degree of completeness and precision!
?Conclusion
This exercise?of evaluating mutation testing and formal verification has led to some very interesting insights. The design problem of a trie-based dictionary is at a sweet-spot of complexity - it is not a trivial problem; and at the same time, it is easy to explain the design decisions with a handful of requirements and contracts. The exercise clearly indicated the need for tool-support for managing mutation-testing and automating much of the drudgery in analyzing mutations:
The most important take-away of this exercise however, was the interplay between mutation testing and formal-specification of contracts - both these approaches are expensive, but they support and strengthen each other in a surprisingly nice manner. I feel this combination of mutation testing and formal-verification can be a very effective methodology for software which require the additional rigor of verification that these methods promise to provide individually - two for the price for one!
References
[1] Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, Mark Harman, Chapter Six - Mutation Testing Advances: An Analysis and Survey, Editor(s): Atif M. Memon, Advances in Computers, Elsevier, Volume 112, 2019, Pages 275-378.
[2] Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development of Mutation Testing. <i>IEEE Trans. Softw. Eng.</i> 37, 5 (September 2011), 649–678.
[3] Pedro Delgado-Pérez, Inmaculada Medina-Bulo, Francisco Palomo-Lozano, Antonio García-Domínguez, Juan José Domínguez-Jiménez, Assessment of class mutation operators for C++ with the MuCPP mutation system, Information and Software Technology, Volume 81, 2017,Pages 169-184
[7] https://spinroot.com/spin/whatispin.html
Director, Research at Oracle Labs Australia
2 年Nice article comparing two different approaches.
Excellent article and the assertion specification library. Have you considered to set up mutation testing as equivalence checking problem between original and mutated design?