The Structure of Tests - The Model
Prahladavaradan Sampath
Development Manager and Product Lead at The MathWorks
In my last post, I identified a few attributes of a test suite that indicate the quality of the test suite: fault detection, fault localization, and efficiency. In this post, I will construct a simple mathematical model for these attributes and explore the behavior of the model with a few examples.
A set based model
Let's start off by identifying two distinguished sets, S and T, modeling software and the test suite, respectively. The set S represents the collection of software components, and T represents the collection of test components.
By using sets to model software and test components, we are making some assumptions—that there is no structure other than identity that we care about for software and test components. For example, we do not model any notion of dependency between software components. This will help keep the model simple.
We additionally model a "covering relation" between test and software components: (s C t) represents the fact that test (t) is a test for (covers) the software component (s).
With this basic setup, we are now in a position to mathematically model some of the attributes of a test suite.
Recall that the cost of executing a test suite is a measure of its efficiency. This can be easily calculated as the summation of the size of source components covered by each test:
Viewing the relation (C) as a Boolean matrix, this is the number of true entries in the matrix.
Defect Detection
Suppose a component (s) has a defect. A reasonable measure of whether it will be detected is the number of tests that exercise the component. Similarly, if a defect is in the interaction between a set of components—an assembly (A)—a reasonable measure of whether it will be detected is the number of tests that exercise the assembly (A).
The value of this metric ranges from 0 to (|T| \times (2^{|S|} - 1)). Each assembly can be tested by up to (|T|) tests, and there are ((2^{|S|} - 1)) assemblies (ignoring the empty assembly). The higher the number of this metric for a test suite, the better the defect-detection capability of the test suite.
Defect Localization
Defect localization is about how well test failures can be used to triangulate defects. If a test (t) fails, this indicates a defect in the assembly (C^{-1}(t)) - the "inverse image" of the test (t) in the relation (C). And if a collection of tests, say (M), fails, the defect should be in the assembly consisting of the intersection of all assemblies that are inverse images, with respect to the relation (C), of the tests in (M).
For the moment, let us assume that there is only a single defect in the software—and additionally that tests fail only because of defects in the software! The situation becomes more complex if we have to deal with multiple defects or defective tests! For an assembly (A), we define the smallest localizable assembly larger than (A) as:
Now, we can measure the capacity of a test suite to localize a defect by considering each assembly in turn (ignoring the empty assembly), and measuring how close the test suite can get to identifying this assembly using test failures:
This metric has a value in the interval ([0, (2^{|S|} - 1)]) - each assembly has a value in the interval ([0,1]), and there are ((2^{|S|} - 1)) assemblies. A value of 0 indicates perfect localization, while a large value indicates poor localization. (I have been lazy and given an asymptotic value as the upper-bound of the metric - it is not precise).
Some Examples
Let us evaluate the metrics defined above against a few scenarios to check that it models our intuition of these metrics. Consider (m) source components and a test suite of size (n).
System Tests Scenario
Let us consider the situation where all the tests are system tests - the matrix representing the covering relation is a full matrix - every test covers every component. In this case:
Unit Tests Scenario
Let us now consider the situation where every test is a unit-test : a test for just a single source component. In this case
Comprehensive Tests Scenario
Finally, let us consider a third case, where we have one test for every possible assembly of software components, i.e. (n = ((2^{m}) - 1)). In this case:
Next steps
Based on the definitions above, we can now study different kinds of test suites, trying to gain insights into the effectiveness of testing—in particular, I am really curious whether we will be able to justify the popular recommendation of structuring tests as a "test pyramid".
I plan to now run a simulation study based on these definitions and will report on this in my next post.
I'd love to hear your comments on this post. Is there some insight I am missing? or maybe I have made an error in the calculations? I was quite surprised initially by the poor defect-localization metric calculated for unit-tests, but I feel the definition is right - and my initial intuition was wrong! What do you think?