Pragmatic Mutation Testing and Formal Verification - Two for the Price of One

I have been trying to make sense of mutation testing for some time now. There is considerable interest in this approach evidenced by the large amount of literature. At the same time, it is very challenging to apply in practice - This seems to be mainly because of the open-ended cost of generating mutants and executing tests. This schism between "promise and practice" has interesting parallels with formal verification, and I wanted to explore this aspect further.

?Bottom line on top : I have come to the conclusion that mutation testing is a pragmatic approach for deriving (and simultaneously verifying) the specification of a program. If we consider test-cases as program specifications, and test-execution as evaluating whether a program meets a specification; then mutation testing help to evaluate the correctness and completeness of the specification - using program execution instead of symbolic reasoning. A subsequent insight is that program-contracts which are a pragmatic approach for formal-specification and reasoning of programs are also applicable to mutation-testing - contracts play the role of built-in tests that verify detailed algorithmic and data-structure design decisions in the program.

In the rest of this article, I will start with a brief introduction to mutation-testing and formal-verification, followed by an experiment to evaluate a few different settings for applying mutation-testing, culminating in a method that combines mutation-testing and formal-specification and verification in the form of program-contracts.

Mutation-testing

A good starting point into the literature and tools on mutation testing is the Mutation Testing Repository. The survey papers [1,2] are a great summary of the nearly 900 publications on the topic since 1970! Mutation testing is based on the observation that the quality of a test-suite is correlated with its ability to detect mutations in the program. A mutation that is not "killed" by a test-suite (indicated by the test-suite failing) needs to be investigated to assess whether:

  1. The mutation is semantically benign - for example a mutation that deletes a redundant initialization statement.
  2. The mutation is not benign and can lead to incorrect behaviour, but this is not detected by the test-suite. This requires either the design to be changed or the test-suite to be enhanced to detect this potential faulty behaviour.

?Mutation testing encourages an iterative workflow:

  • Author a test-suite for the program
  • Mutate the program and evaluate whether all mutations are "killed"
  • Modify the design to eliminate this family of mutations or add test-cases to "kill" these mutants.

?By following this iterative process, we end up with a?program and a test-suite that captures the functionality of the program with a very high degree of correlation. The following is a brief analysis of the strengths and weaknesses of mutation testing.

Strengths

  • Bar for getting started is very low. Very light-weight automation based on executing test-cases. Any project with an automated build process can be adapted to perform mutation-testing.
  • Can scale very easily - we can use embarrassingly parallel execution of the collection of mutations!
  • Can potentially consider non-functional/performance requirements - using tools like google benchmark to write performance tests.
  • The manual inspection of surviving mutations and the process of adding test-cases also often identifies and clarifies design decisions that would otherwise have been obscured.

?Weaknesses

  • Manual inspection of each mutation is very time-consuming
  • Difficult to expose internal details of design to enable testing
  • Simple mutation process can generate a lot of noise - the number of mutations can be large, and can create a false-sense of security by just sheer numbers. The fact that thousands of mutants are detected by the test-suite does not mean much; the real measure of quality is the number of mutants that are not detected by the test-suite. Tool support with appropriate/intuitive affordances becomes very important to overcome this drudgery.

Formal Verification

Formal verification refers to the use of mathematical techniques to prove properties of a program. The main distinguishing characteristic of formal verification is its ability to provide guarantees for absence of defects in a program - which is something that software testing can never achieve! There is a vast body of literature on formal verification (far greater than that for mutation testing), and it spans a very wide range of topics from verification of communication protocols [7], to automated detection of numerical defects in C program to proving that an implementation of a program meets its formal specification [8, 9, 10, 11, 12, 13].

For the purpose of comparison with mutation-testing, the most relevant approach is formal-specification and proof of a program. This in turn depends on having a correct and complete specification of the program - we need a mathematically precise statement of what it means for a program to be correct. This is typically done using Hoare triples [14] which uses logical statements (Boolean valued expressions in programs) to state pre and post-conditions for a program.

Strengths

  • Analysis results have a very high degree of assurance - the results are mathematically sound, and there is a clear "audit trail" for any conclusion that is reached.
  • The formalism of Hoare-logics is quite simple, and the reasoning principles and rules for Hoare-triples are well established. Many of the reasoning patterns for Hoare-triples can be easily expressed within a program in the form of assertions on Boolean expressions (contracts).

Weaknesses

  • Depends on having a good specification. In general, it is very difficult to get a high quality formal specification. It is very easy to overlook some aspect of the design and end up with a weak specification which may end up masking defects in the software.
  • Difficult to scale - formal verification tends to be computationally expensive, and is often applied at the level of small units.
  • Formal verification tools are not easily available - formal verification depends on a precise encoding of programming language semantics which makes it difficult to build such tools.

At first sight, mutation-testing and formal verification seem to live at two extremes of a spectrum. Mutation testing places emphasis on program execution and testing, while formal verification places emphasis on precise specification and symbolic analysis. However, I hope to show by the end of this article, that they can be meaningfully and effectively combined in a pragmatic development methodology.

Experimental evaluation

The experimental setup evaluates three different development settings, with the aim of developing insights into different ways of integrating mutation-testing into a development workflow:

  • An ad-hoc unit-test driven development approach
  • A systematic requirements driven approach with requirements based tests
  • Using formal specifications of the design

Coming to tools, I initially wanted to do experiments using publicly available tools, and the repository references nearly 80 tools, and a few of them operate on C/C++ programs. Google search identified a few more tools and after some preliminary investigations I narrowed down my attention to the following tools:

  • Mutate-cpp - A python based implementation with a web based UI to review results
  • Mull - A clang based mutation tool
  • mucpp - Based on some publications [3], but no source available
  • Mut-tools - No source code available, and I was unable to download the binary

I finally chose Mutate-cpp for this article as I felt it was easy to setup and has a very easy to use interface and fits in with a unit-testing setup I am comfortable with. I also preferred its approach to "opt-in" files to mutate instead of "exclude" files approach of Mull! The python implementation of Mutate-cpp also fits in with a quick-and-dirty experimentation approach I had in mind. I do believe the clang based approach of Mull is intrinsically more powerful and can potentially avoid generating a large number of syntactically invalid mutants!

For formal specification, I decided to use a pragmatic approach and used a simple library for expressing contracts in C++ [4]. This allows writing class-invariants, function pre/post-conditions and even loop-invariants and loop-variant contracts. Is uses the C++ assert statement to halt program execution if a contract is violated. In the exercise below I used mainly class-invariants and method pre/post-condition forms of contracts. Contracts can be used as the basis for formal verification [5,6].

The basis for the experiment is a dictionary class that maintains a collection of words in an efficient trie structure. It has methods to add words, check for whether a word exists in the dictionary and the size of the dictionary. The complexity of the data-structure is largely contained in the method to add a word. The code for this class is available at:?practical-mutation-testing (github.com). The repository has three branches corresponding to the experiments corresponding to the three approaches above:

?Experiment 1 - Mutation testing with ad-hoc unit-test suite

This is the first implementation of the dictionary class which was developed along with a handful of unit-tests: testDictionary.cpp. These unit-tests were not methodically developed - they were written based on a few of the design decisions taken in the dictionary class.?I was able to develop a fair bit of confidence in the implementation based on these unit-tests. However, just looking at the tests at a later point, it is very difficult (even for me) to fathom exactly what is being tested by each test-case, and what motivated that particular test-case to be written!

?At this point, I launched the mutate-cpp tool and performed mutation testing on the main source file (Dictionary.cpp). The mutation testing process implemented by mutate-cpp basically modifies (mutates) the source-file, builds the project, and runs tests. Failures in the build and in the tests are taken as evidence of the mutation being detected ("killed"). The remaining mutations are given a "unknown" classification, and the expectation is that each mutation would be inspected to decide whether it is "benign" or whether it indicates a gap in the test-setup.

?Experimental results

The basic statistics of this session of mutation-testing are as follows:

  • Mutations generated - 724
  • Mutations detected by the build failing - 346 (47.8%)
  • Mutations detected by tests failing - 186 (25.7%)
  • Undetected but benign mutations - 88 (12.2%)
  • Undetected but harmful mutations - 104 (14.4%)

Manual investigation of the mutations revealed that 104 of the mutations would actually introduce faults and which were not detected by the existing unit-tests. Only 88 of the undetected mutations were benign. This manual investigation was actually very laborious - it took approximately 8-10 hours of effort to individually classify all the mutations.

I was actually surprised by the large number of mutations that escaped detection. However, I did not go ahead and just add unit-tests because the quality of the unit-test was already very poor: the test-cases were unstructured and it was difficult to figure out the intention of a test-case with just a knowledge of the expected functionality of the dictionary class. Adding test-cases for catching mutations in the source would only introduce more difficult to understand test-cases!

?Insights from Experiment 1

I did gain some insights from this laborious exercise of analyzing all the undetected mutants. One insight was a more fine-grained classification of mutants in the program:

  • Mutations that are benign because they are actually semantically equal to the original code. For example replacing an assignment `x = false` with the statement `x &= false`. Both?of these statements are semantically equivalent in C++
  • Mutations that are benign because of the design of the software. For example a mutation deleting an assignment `x = false` when the design would guarantee that `false` is the only possible evaluation of `x` at that point in the code.
  • Mutations that are actually bugs, but are nearly impossible to detect with test-cases because they trigger "undefined behaviour" in C++ - for example accessing beyond valid ranges in arrays/strings etc. In many cases, these mutations result in all tests passing; but they persist as insidious bugs in the code which could trigger serious consequences (many security vulnerabilities in software are the result of this kind of insidious bugs!)

?Another insight from this experiment was that executing test-cases was not the only way to test for mutations! Many mutations resulted in code that would trigger warnings in compilers. Based on this insight, I changed the build settings to increase the strictness of compilation - since I was using gcc [15], I added the flags "-Wall -Wextra -Werror". In later experiments, this actually resulted in a noticeable reduction in the number of undetected mutants.

?A related observation was that enabling dynamic instrumentation for detecting memory defects also helped in a few cases. In later experiments I used the gcc flags "-fsanitize=undefined -fsanitize=address" to enable runtime defect checking. There were a few mutants which were detected by these dynamic analysis tools, but as I only enabled them fairly late in the experiment I do not have much data on their potential efficacy.

Experiment 2 - Mutation testing with additional requirements based tests

The second stage of the experiment aimed to be more methodical in terms of identifying and writing unit-tests. I identified a set of requirements for the dictionary class (13 requirements), and wrote tests that targeted each of the identified requirements. The file testDictionaryRequirements.cpp includes the requirements as comments and the tests-cases along with comments that trace test-cases to specific requirements. These test-cases are much more structured and easy to understand. Overall, the quality of the requirement based test-suite is certainly better than the test-suite having ad-hoc tests in the first experiment. For mutation-testing, I enabled both the requirements-based tests and also the ad-hoc unit-tests from the last experiment.

Experimental Results

The basic statistics of this session of mutation-testing are as follows :

  • Mutations generated - 768
  • Mutations detected by the build failing - 386 (50.3%)
  • Mutations detected by tests failing - 211 (27.5%)
  • Undetected but benign mutations - 69 (8.9%)
  • Undetected but harmful mutations - 106 (13.8%)

(Note: The differences in numbers from the first experiment mutant generation is because I had made some minor changes to the source code and added an additional utility function. However, the main body of the class was un-touched)

?Insights from Experiment 2

As we can see, adding requirements based tests has not significantly improved the ability of the test-suite to detect mutations. The percentage of undetected harmful mutations dropped from 14.4% to 13.8% - not a very significant change!

Experiment 3 : Mutation testing with requirements based tests, DBC, and additional top-up tests

The final stage of the experiment was to incorporate formal specifications in the form of contracts. I started the experiment with an initial set of contracts - a few illustrative one are listed below:

  • Class invariant : Each dictionary node can have at most 26 children nodes (corresponding to characters a-z)
  • Class invariant : If a dictionary node has a child node, the dictionary must be non-empty
  • Pre/post for "add" operation: If a word is not in a dictionary when "add" in invoked, the size of the dictionary has to increase by exactly 1 as a post-condition of "add".

A Digression

For those of you who have not yet used formal specification or contracts, an interesting phenomenon occurs when trying to figure out the contracts for a piece of software - writers block! This is similar to, but much more severe than, an equivalent phenomenon that all of us are familiar with while writing requirements.

If you are familiar with contracts, you may have come across statements such as "contracts define the obligations of a piece of software" - in my experience, these statements make sense only after you have used contracts in a few scenarios. I have personally found that a good way to get started with contracts is to think of "class-invariants" - these are Boolean expressions relating different variables/attributes in a class, which are always true for an external observer of a class. For example, the dictionary class acts as a node in a prefix-based trie, and a simple invariant is that the dictionary is non-empty if and only if the node holds a non-empty prefix!

Another useful way of thinking of contracts - and one which is relevant to mutation testing - is to think of contracts as "built-in tests". These built-in tests take the form of specific conditions that are checked for every execution of the program. The connection with formal verification becomes clear when we realize that a single execution (unit-test) can be thought of a single "case" in a proof of contracts. Unit-tests can be thought of as case-splits in a proof of contracts, and executing the tests without violating any contract is a lemma proving that particular case-split!

Experimental Results

I used the results of mutation testing to evaluate for each harmful mutant, whether I could strengthen contracts, or in some cases, add additional test-cases to ensure the exercise of specific contracts. I added 10 test-cases to target specific mutations (testDictionary_cov.cpp) : I named the test-cases with the numeric id of the mutation as generated by the mutate-cpp tool. Note that in most of the cases, the newly added test-cases actually violated contracts that were already present - the corresponding mutation went undetected simply because the contracts were not exercised completely! There were also situations where I had to strengthen contracts; some examples of the stronger contracts introduced are:

  • Class invariant : for each child node of a dictionary, the first character of the prefix stored in a child has to match the index in the array of the parent node where the child is stored.
  • Pre/post for "add" operation: If a word is in a dictionary before the add operation, then the add operation should not remove it.

The basic statistics for the first run of mutation testing (before strengthening contracts):

  • Mutations generated - 847
  • Mutations detected by the build failing - 397 (50.2%)
  • Mutations detected by tests failing - 244 (27.5%)
  • Undetected but benign mutations - 138 (16.3%)
  • Undetected but harmful mutations - 70 (8.3%)

The basic statistics of mutation-testing after a round of strengthening contracts and adding unit-tests targeting specific mutations are as follows :

  • Mutations generated - 847
  • Mutations detected by the build failing - 475 (50.2%)
  • Mutations detected by tests failing - 292(27.5%)
  • Mutations undetected - 80 (22.7%)
  • Undetected but benign mutations - 76 (8.9%)
  • Undetected but harmful mutations - 4 (0.5%)

?Insights from Experiment 3

As we can see there is a significant improvement in the mutation score with the use of contracts. Of the 4 remaining harmful mutations identified in the final round of mutation testing, three mutations trigger "undefined behaviour" of the language - for example, an out-of-range array access. These are very difficult to detect, even using tools like address sanitizers. The final harmful mutation in the code is a genuine mutation, which needs to be addressed with an additional unit-test!

An interesting observation is that the additional test-cases added in this step are similar to the ones I added as part of experiment 1 above - they are very difficult to understand from a functional stand-point. The difference here is that each test-case can be clearly traced to a specific set of contracts they exercise, and a corresponding mutation that is detected!

Another insight of this exercise is that the process of investigating whether specific mutations are harmful or benign is actually very helpful in identifying the contracts that needed to be strengthened - in many cases the mutation I was analyzing actually suggested the contract I should strengthen or add! In hind-sight this is almost obvious; but it was unexpected in the beginning. As a result of this exercise, I am confident I have identified a set of contracts that characterize the design of the dictionary class with a very high degree of completeness and precision!

?Conclusion

This exercise?of evaluating mutation testing and formal verification has led to some very interesting insights. The design problem of a trie-based dictionary is at a sweet-spot of complexity - it is not a trivial problem; and at the same time, it is easy to explain the design decisions with a handful of requirements and contracts. The exercise clearly indicated the need for tool-support for managing mutation-testing and automating much of the drudgery in analyzing mutations:

  • simple affordances like mass-justification of mutations for a line of code add value.
  • At a deeper level, the ability to identify semantically invariant mutations (which do not change the semantics of a expression/statement) would help considerably reduce the effort of analysis.
  • Another important capability is to support changes in the source-code - being able to make fixes in source code, and mapping analysis results of mutation testing performed on the original source code, and highlighting changes in mutation status would speed up analysis considerably!

The most important take-away of this exercise however, was the interplay between mutation testing and formal-specification of contracts - both these approaches are expensive, but they support and strengthen each other in a surprisingly nice manner. I feel this combination of mutation testing and formal-verification can be a very effective methodology for software which require the additional rigor of verification that these methods promise to provide individually - two for the price for one!

References

[1] Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, Mark Harman, Chapter Six - Mutation Testing Advances: An Analysis and Survey, Editor(s): Atif M. Memon, Advances in Computers, Elsevier, Volume 112, 2019, Pages 275-378.

[2] Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development of Mutation Testing. <i>IEEE Trans. Softw. Eng.</i> 37, 5 (September 2011), 649–678.

[3] Pedro Delgado-Pérez, Inmaculada Medina-Bulo, Francisco Palomo-Lozano, Antonio García-Domínguez, Juan José Domínguez-Jiménez, Assessment of class mutation operators for C++ with the MuCPP mutation system, Information and Software Technology, Volume 81, 2017,Pages 169-184

[4] BrightProgress/dbc: Design by Contract C++ library (github.com)

[5] Design by contract - Wikipedia

[6] Java Modeling Language - Wikipedia

[7] https://spinroot.com/spin/whatispin.html

[8] Frama-C - Framework for Modular Analysis of C programs

[9] Polyspace - MATLAB & Simulink (mathworks.com)

[10] AbsInt: Cutting-Edge Tools for Static Analysis of Safety-Critical Software

[11] Astrée Static Analyzer for C and C++ (absint.com)

[12] The Coq Proof Assistant (inria.fr)

[13] Isabelle (proof assistant) - Wikipedia

[14] Hoare logic - Wikipedia

[15] GCC, the GNU Compiler Collection - GNU Project

Paddy Krishnan

Director, Research at Oracle Labs Australia

2 年

Nice article comparing two different approaches.

Excellent article and the assertion specification library. Have you considered to set up mutation testing as equivalence checking problem between original and mutated design?

回复

要查看或添加评论,请登录

Prahladavaradan Sampath的更多文章

  • The Structure of Tests - The Experiment

    The Structure of Tests - The Experiment

    In this post, I would like to extend the previous two posts on the same topic and perform numerical experiments to…

    5 条评论
  • The Structure of Tests - The Model

    The Structure of Tests - The Model

    In my last post, I identified a few attributes of a test suite that indicate the quality of the test suite: fault…

  • The Structure of Tests - The Problem

    The Structure of Tests - The Problem

    In this series, I want to talk about test-suites. Test-suites are important because we spend as much time testing…

    3 条评论
  • The Limits of Formal Modelling : models meet real-life

    The Limits of Formal Modelling : models meet real-life

    We have all been affected by the pandemic and our lives are slowly limping back to some degree of normalcy. As…

    3 条评论

社区洞察

其他会员也浏览了