Beyond Quality: Testing for Confidence
Courtesy of uihere.com

Beyond Quality: Testing for Confidence

A Conceptual and Practical Framework for Quality Measurement

Abstract

#softwarequality may be viewed differently by various stakeholders, but one thing is common to all: The goal of #qualityassurance is to minimize the gap between the actual perceived result and the expected result. This article proposes a conceptual and practical #riskanalysis and #riskmanagement #framework for #qualitymeasurement that overcomes limitations which are inherent to the process of #softwaretesting and #lifecyclemanagement. The framework, which was implemented with success at IBM XIV back in 2010, offers the following potential benefits. First, it suggests a quite simple method for differential test planning by means of assigning weights to tests; Second, it also indicates how to calculate risk and its coverage by testing activities for each component, module or feature in a simple and generic fashion; Third, the framework can be an effective alternative to the traditional RBT method which is arguably lacking in accuracy, internal consistency and informativeness; Finally, it can enable Management to make informed decisions in a more efficient way, and hence better cope with the challenges of the Application Lifecycle.

Introduction

The question of how to measure the quality of a product is raised quite often, but there is a more fundamental issue that must be clarified first: What is Quality? Is it the same in the eyes of different stakeholders of the development or manufacturing process? What is Quality for the consumers or the end-users?

Goldsmith (2015) suggests that “the trick to measuring software quality is focusing on real business requirements and established engineering standards”. He describes the problem of defining Quality, with different stakeholders and participants focusing on separate aspects of the process. Software developers are more inclined to stress the more technical facets of the process, such as fitting specifications, efficiency, robustness, compliance with standards, using state-of-the-art technology and scalability.

On the other hand, users and managers are more prone to think software is of high quality when it “does what needs to be done correctly”, performs well, is reliable and consistent, easy to use, has quick and effective support, and is delivered on time and within budget. Goldsmith (ibid) concludes that because of the divergent views among the groups, it may seem to members of both that the others do not care about Quality. However, he asserts that “both views are part of software quality”, but that “the user and manager view is more important” (ibid), because if it is not satisfied then the developers view is irrelevant.

One may disagree with this claim, because these views are orthogonal in the sense that the developers view refers to the quality of the process, while the users and managers, who cannot be automatically put in the same category, refer to the product manufactured by that process. For instance, managers may give more weight to schedule and budget than to ease of use and performance, factors that surely are of great concern to the users. On the other hand, developers and technical managers may be more concerned about compliance with requirements and specifications but also with well-established design and coding practices and standards. All groups, of course, have a high concern regarding functionality; after all, the system must do what it was designed to do. In this case, however, orthogonality does not imply independence; obviously, process quality does affect product and service quality. For example, a flawed development process may result, besides a defective system, in inefficient defect resolution processes which might affect responsiveness to customer tickets, whether these are defects or change requests.

Despite these variances in the views of different stakeholders about Quality, I think that one thing is clear: Quality Assurance is concerned with minimizing the gap between the actual perceived result and the expected result on any of the relevant dimensions (process, product, budget, service, security etc.) based on some independent criterion, such as compatibility with requirements, specifications, standards, budget and schedule. The smaller this gap is, the more confident we (and the customer) can be that the system will perform as prescribed by the requirements. Hence, Quality Assurance is about measuring this gap to increase our confidence level. In other words, when we perform testing (Quality Control) we actually measure, indirectly, our confidence in the product.

Both the actual and the expected result on any given dimension must be measurable on a standard scale in order that we may have an estimate of the gap between them. In fact, a common practice is to have different metrics which are considered to reflect the state of affairs in a software development project. For instance, Lowe (2017b) lists nine metrics divided into three categories (Agile process metrics: leadtime, cycle time, team velocity, open/close rates; Production analytics: Mean time between failures (MTBF), Mean time to recover/repair (MTTR), Application crash rate; Security metrics: Endpoint incidents, mean time to repair (MTTR).

Quality Assurance is concerned with minimizing the gap between the actual perceived result and the expected result

However indicative these metrics may be of the state of affairs in the course of the application life cycle, they do not provide a predictive estimate about the Quality of the system as a whole or of any of its components as tested under controlled conditions. That is, they are after the fact, ad-hoc metrics which are more concerned with the efficiency and effectiveness of the process, but do not tackle the specific risks which the system faces, and the impact associated with them. For example, if the application crash rate is low and the mean time to repair is short (or within the SLA maximum limit), it tells us nothing about the losses associated with such crashes. A one-time crash that results in a customer losing vital data or denying service from thousands of users for a few hours can be devastating for the customer. The fact that the problem was solved in a relatively short time would not add to the Quality side of the balance. It is the mere occurrence of the crash that makes the value function sink into the loss area. Managing the process correctly, hence, does not necessarily result in a high level of Quality. As Lowe (2017a) rightly points out, “Software development is analogous to manufacturing, except that we don't make the same identical widgets over and over. We can't just measure for defects, reject some products, and ship the others. In software development, everything we build is a snowflake: unique, valuable, and incomparable.” That is, the specifics of a system are important in assessing and mitigating the risks associated with design, development and deployment. There is no one-size-fits-all formula.

"In software development, everything we build is a snowflake: unique, valuable, and incomparable" (Lowe, 2017a)


The Challenge of Quality Measurement

The challenge of Quality measurement becomes, consequently, more complex than one may have thought before. For instance, just keeping track of functional coverage has its limitations; after all, with the systems becoming more complex and the time-to-market shorter, with an increasing demand for continuous delivery, there will never be enough resources to assimilate the changes in existing features and the new ones in full. In the end, it all boils down to risk management and prioritization. Following the snowflake analogy proposed by Lowe (ibid.) the question arises: How can we define a uniform method of Quality measurement that would be agnostic to the unique features of any system? That is, a method that can provide Management with the information necessary to assess how confident they can be about the system behavior and consequently - make an informed decision about an upcoming release, without having to get into the specifics of the system or the details of the testing efforts. In what follows I will propose a new approach and method to implement what is well known as Risk Based Testing (RBT). The new approach introduces a new operational definition of the concept (or hypothetical construct) of Quality which relates to the conceptual definition given above: “Quality Assurance is concerned with minimizing the gap between the actual perceived result and the one that was expected”. Such a gap, if found, is what we call a defect, or by the popular term – a bug.


Risk Management and Defect Probability

Given that such gaps are found in practice anyway due to undetected defects in a system, a purely risk-based approach would attempt to assess the a priori probability of such faults. Probability is the expected relative frequency of an event, standardized between the range of zero and one, where zero means there is no chance that the event would occur, 0.5 that it is expected to occur half of the time and finally one, that it would occur all of the time. To have such an estimation, however, would require actual sampling measurements of the frequency of defects on customer site. This, of course, is impractical because obviously this is the very thing we are set to prevent or minimize prior to a release. If so, then, how can we have an estimate of the probability of failure? In what follows I shall argue that even if such a method is found (which I think would not be cost effective), it is not strictly necessary.

Hence the obvious question arises: How can we manage the risk if we cannot have an estimate of its severity? The method proposed entails a completely different perspective on risk. We know that risk does exist. If we don’t do anything to mitigate it – then we take the whole risk. If we take steps to prevent the dangerous event from occurring or to reduce its impact in the case it becomes an actuality, then we feel it was reduced. However, the risk didn’t necessarily change. It is just that its eventuality and impact was handled. In the terminology of software testing we say that we have covered the functionality, the requirements. But if measures are not taken at the development stage, the risk remains the same. Nevertheless, it is risk what we cover, or mitigate, by defining testing procedures to answer the question: "is this faulty?" for every aspect of the system. Because as mentioned above, it is “the gap between the actual perceived result and the expected result” what defines a defect in the system.


Testing Metrics: Lies, Damn Lies, and (simple) Statistics

Traditionally, the result of a test is binary: it may either pass or fail. Testing metrics thus usually report the percentage of tests performed and the percentage of tests that passed and failed. Grizzaffi and Kono (2019) rightly point out that executives expect straightforward answers that reflect simple statistics (such as the number or percentage of test cases remaining, the expected average number of tests per day, etc.), but “As those with experience have learned, however, these types of answers don't always provide the appropriate information.” They go on explaining that these numbers may act as pitfalls, because they are easy to misconstrue, don't tell the whole story, are one-dimensional and reflect outdated information.


Balancing the Wheels: Taking a "Go-No Go" Decision

To illustrate the problem with simple statistics, imagine checking tires. You check them one by one, and report, consequently, that 0%, 25%, 50%, 75%, 100% of them pass or fail. Obviously, you must have all four tires in pass status to be able to drive the car safely. Suppose that one tire is faulty. With simple statistics, you would state that the quality level is 100%-25% = 75%. However, this is absurd, since you cannot really drive the car in such situation. The method proposed here is different: A test that fails cancels out a test that passed, such that in the case of the car, if a single tire is found defective, the overall quality would be 50%, not 75%. Why the car example is relevant? Because a car’s wheels must be balanced, so a defective tire really cancels out the fact that its twin was found to be in good condition. This is also true of software systems: The fact, that passed tests accumulate to give a simple number or percentage of “good” quality, does not compensate for the negative impact of the tests that failed.

The fact, that passed tests accumulate to give a simple number or percentage of “good” quality, does not compensate for the negative impact of the failed tests

 Moreover, as studies such as those of Nobel laureate Daniel Kahneman and his late partner Amos Tversky demonstrated, the weight of the negative experience (the subjective loss) is greater than the weight of the positive experience (the subjective gain). That is, the value function is steeper in the negative area as compared with the positive area. Moreover, satisfaction tends to reach an asymptotic level. Hence you cannot tell the customer that “you are aware” that such and such features are broken and “you are making every effort” to provide a solution as soon as possible, and expect that the customer will feel happy because 95%, or even 99% of the system work as expected. Of course, this depends on the criticality of the business process, the frequency of usage, and the severity of the defect. These three parameters, of course, relate to the risk factors mentioned below.

The weight of the negative experience (the subjective loss) is greater than the weight of the positive experience (the subjective gain)

Within this framework, we seek to supply Management with business-oriented information about the system quality. That is, quantitative data which will enable the executive team to assess the situation at one glance, without the need to delve into the raw data of the testing program, and hence to make a go-no go decision with respect to the upcoming system version release, or take the appropriate steps in order to amend the situation. Since testing is a procedure to mitigate the risk, the data that is important to Management should indicate the risk level to which the company is exposed, if the system is released as is. 

The data that is important to Management should indicate the risk level to which the company is exposed if the system is released as is


Calculating the Risk

Now, the whole risk, or the Total Risk (TR), is basically an abstract concept, just as Quality. We cannot easily quantify it a priori for a software system, and it is virtually impossible (or at least impractical) to derive it by comparison to other systems. In order to achieve this goal, we need to estimate the contribution of each system module, feature or area (or any relevant aspect of the system, such as security) to the Total Risk. Two questions arise, then:

1.     How do we calculate the Total Risk?

2.     How do we calculate the contribution of each relevant aspect of the System to the Total Risk?

The key to solve this seemingly chicken-and-egg problem, is by a surprisingly simple method that enables us to calculate the risk for each component and then its percentage of the Total Risk. The method goes as follows:

The following formula defines the basic calculus for the Total Risk:

No alt text provided for this image

                                                          Formula 1. Total Risk

Where:

No alt text provided for this image


is the Total Risk, which is a constant of the value of 100%.

The percentage of risk attributed to the ith System Component


is the percentage of the Total Risk attributed to the ith System Component [1]

The number of components, areas or topics


is the number of components, areas or topics which contribute to the Total Risk.

Note [1]: By Component it is meant a module, feature or area which is pertinent to the system functionality.

That is, the total risk is always equal to 100%, and it is the sum of all the components’ risk percentage of the Total Risk. You may ask at this stage – well, but how do we know how much each component contributes to the Total Risk?

The following formula defines the basic calculus for the Component Raw Risk (CRR) of a component:

The raw risk level of a component

                                            Formula 2. Feature/Component Risk

Where:

No alt text provided for this image


is the i-th component’s risk level, and finally

No alt text provided for this image


are the risk factors estimations (for example, in the scale of 1 to 10, where 1 stands for “not relevant” or “not applicable”).

That is, a component’s risk level is the product of all its relevant risk factors.

Factors are, for example, the estimated complexity, the expected frequency of use by end users, the expected number of users, and the estimated impact of a defect in that component, module or feature. For example, in a storage system data integrity is crucial, so its estimated impact should be high. You can have your own, custom risk factors, according to the characteristics of your system (of course, if it’s a back end component, the number of users may be replaced by number of transactions, data volume, etc.).

After providing estimates to the risk factors (if a factor is not relevant to any of the components/features/areas then simply assign it the value of 1) and calculating the raw risk level for each Component or Feature, the next step is to get the sum-total of all components, which gives us the Total Raw Risk (TRR). Then, as we have defined the Total Risk as 100%, we simply divide each Component Raw Risk (CRR) level by the Total Raw Risk and multiply by 100. As a result of this simple procedure, we’ll obtain a measure of relative or weighted risk levels for all the components. Though we will not be able to ascertain what is the probability of a defect being found in each component or area, we will be able to grasp which components or areas need more attention, if their contribution to the Total Risk is significantly higher than that of others.

After establishing the a priori weighted risk levels for each component, we will delve into the details of the next step: How to estimate the actual level of risk that remains uncovered? This must be an ongoing, daily, procedure that will assist Management in their decision-making process. In previous sections, some hidden truths about testing metrics and their limitations were already discussed. Now we shall delve into how Management can assess the exposure to risk based upon the tests results.


Test Coverage and Risk Management

In a previous section I described how to calculate the estimated raw risk and the contribution of each component, feature or aspect of the system to the total risk. In this section we shall describe the method to calculate the proportion of the risk that was covered by the test results, both for each component and for the whole system. Please notice that coverage, in this framework, refers to the risk, not the functionality as such.

Coverage, in this framework, refers to the risk, not the functionality

At the outset, it must be stressed that the method assumes that the tests cover all system aspects in full. That is, ideally, the planned tests did not leave out any feature of the system that contributes to the total risk, and they address all the relevant questions about all the features. Basically, any test is a procedure designed to provide an answer to a specific question regarding some feature. The questions would be, of course, whether the feature behaves correctly, is displayed correctly, etc., and the answer is always yes if the test passes, and no if it fails. The question remains, how to digest these raw results to get an accurate picture of how well the system functions.


Test Results and Test Weights

In order to provide a practical answer to the question above, within this framework a test that is marked as passed would get the grade of 1, while a test that fails will be given -1, as suggested above. Assuming all tests have equal importance (or weight), then each failed test would cancel out one that passed. Nevertheless, we may wish to assign them different weights to stress their importance in our decision-making process. For instance, a test that seeks to ensure no data loss is caused by some action on the GUI might well have a crucial role in our assessment of the system quality, as compared to some other, trivial feature. Hence, such a test would be assigned a greater weight than other tests.

The Total Risk Coverage Grade (TRCG) for a component is the net percentage of passed tests, multiplied by the component’s contribution to the risk. That is, if our net result for a component is 50% and its contribution to the total risk is 10%, then the remaining Uncovered Risk for that component is 50%, and 5% for the whole system. Again, assuming equal weights, a net result of 50% for a component would derive from a 75% pass rate minus the 25% fail rate. Another peculiar consequence of this method of calculus is that, in theory, the Net Confidence Level can fall below zero! For example, if the pass rate is 25% and the fail rate is 75%, then this means that we should be much more prone not to be confident about the system than the opposite. So the Net Confidence Level in such a case would fall down to -50%.

As a result of such calculations which can be carried out online with the help of a dashboard, Management can get updated data about the risk level of releasing the system at any time. Now, by setting a risk threshold for each aspect and for the whole of the system, executives can make informed decisions in regard to the system quality. For instance, if a threshold is set at net 90% risk coverage, it means that Management is prepared to face a risk of 5% that defects may exist (recall that failed tests cancel out passed tests, so a 95% success rate is equal to a net 90% success rate).

You may rightly ask at this point: How this prevents crucial defects from leaking to the customer? Well, this should be taken care by setting the test weights properly. If specific tests must pass in any case, then assigning them with heavy weights would drop the risk coverage and hence – not allow it to meet the threshold (but see the note below about the constraints on setting weights). Provided that the tests are well designed (which I think is not too much to ask...), this might be an effective strategy.

A practical example follows here below to illustrate how to implement the method and what such a dashboard would look like, conceptually. Table 1 shows three features/modules and their corresponding risk coefficients calculated based on Formula 1 and also the standardized risks for each. The risk factors taken into account in this example were complexity, frequency of use and impact of a functional failure. The table also shows the Uncovered Risk for each component/module at each testing cycle, which is precisely what we seek to disclose to the Management Team for their decision making process. The Net Confidence Level is the difference between the Uncovered Risk at the current testing cycle and the Standard Risk, which basically means how much of the risk was actually covered. In the example below the Uncovered Risk goes from 80% at the first two cycles and then is reduced gradually to 40% (cycle 3), 27% (cycle 4) and 0% (cycle 5).

Table 1. Sample Risk Management Dashboard

                            Table 1. Equally Weighted Test Total Results Sheet (Dashboard)

Table 2 simply shows a weighted tests results data sheet. It is also shown that a readiness parameter can be used to cancel the influence of specific tests in extremely rare conditions, by setting its value to zero (0).

Table 2. Equally Weighted Test Detailed Results Data Sheet

                            Table 2. Equally Weighted Test Detailed Results Data Sheet

The above example shows the simple case of equally weighted tests. In tables 3 and 4 you can see the effect of assigning unequal weights to different tests.

Table 3 shows the same three features/modules and their corresponding risk coefficients and standardized risks, with the risk factors being the same as above in Table 1. Notice that the Net Confidence Level here differs from Table 1 just by changing the weight of a single test (Test A2) that covers Module A. In the example below the Uncovered Risk goes from 88% (all values here rounded up) at the first two cycles and then is reduced gradually to 37% (cycle 3), 25% (cycle 4) and 0% (cycle 5).


Table 3. Unequally Weighted Test Total Results Sheet (Dashboard)

                            Table 3. Unequally Weighted Test Total Results Sheet (Dashboard)

Table 4. Unequally Weighted Test Detailed Results Data Sheet

                            Table 4. Unequally Weighted Test Detailed Results Data Sheet

Note [2]: It is worth noting that the weights should meet two limitations:

  1. The Uncovered Risk for each component/feature must be less than or equal to the Standard Risk calculated prior to the execution of the tests.
  2. The Total Uncovered Risk for the whole system must be less than or equal to the Total Standard Risk calculated prior to the execution of the tests.

This is simply because within this risk management framework the Uncovered Risk cannot, by definition, exceed 100% which is our reference point for the Total Standard Risk (TSR).


Summary

This article intended to prescribe a general framework to implement a true Risk Based approach to quality measurement and control of software products. The introduction provided an analysis of the concept of Quality from the perspective of the different stakeholders. From this analysis, a distinction was drawn between the quality of the process and the quality of the product of the process, which has an effect on the metrics used to measure Quality. The limitations of the metrics typically used in the industry were reviewed from other sources (Goldsmith (2015), Grizzafi & Kono (2019), Lowe (2017a)) and the case was made for the need to a new approach to enhance Management's capacity to get a clear and concise picture of the software product at hand, and hence make informed decisions about the Application Lifecycle without having to delve into the technicalities of the testing process or the particular defects found in the product.

The reader may rightly ask whether the method has been tested in practice. Indeed it was, but not extensively. Just on one occasion I had the opportunity to implement the method for a customer (IBM XIV in Tel Aviv) back in 2010 with success. It allowed the team to cut the estimated time for a release by half, from six months to merely three. For lack of support for the method in the test management tool and the shortage of time, I managed to run the whole operation with... a single Excel file well designed to meet the framework requirements.

Nevertheless, I acknowledge that the proposed method does have its shortcomings. These include:

  1. There are several basic underlying assumptions on which it is founded that must be met, namely: all components, modules and features must be covered, and the tests must cover all their functions, in the sense that all the risk factors are addressed.
  2. At the practical level it requires the development of new features in existing ALM/Test Management tools, as well as training staff the skills required to provide good, reliable estimates of the different risk factors and the differential weights for tests.

Despite these shortcomings, I believe they can be overcome, as shown in the IBM XIV "case study" mentioned above. The new approach proposed here brings along the following advantages:

  1. It suggests a quite simple method for differential test planning by means of assigning weights to tests according to their estimated impact on customer satisfaction.
  2. The proposed method also indicates how to calculate risk for each component, module or feature in a simple fashion by redefining the concept of risk from probability theory to a more workable and practical conceptual framework.
  3. The proposed framework is suggested as an alternative to the traditional RBT method, which yields a way to prioritize tasks, but lacks accuracy, internal consistency and does not provide an easy way to compile the results to produce a concise report. Such a report, with real, relevant and focused information rather than shallow data, can enable Management to make informed decisions in a more efficient way, and hence better cope with the challenges of the Application Lifecycle.

In summary, the benefits of the proposed method and underlying framework seem to overcome its shortcomings. Nevertheless, further research and development of tools and actual implementation in real ALM contexts are required to ensure its robustness and effectiveness.

On a final note, I'd appreciate if you share your own thoughts on this important topic in the comments below and I thank you for bearing with me through this long article.


References

Goldsmith, R.F. (2015), An expert suggests how to measure software quality

Grizzaffi, P. & Kono, M. (2019), Need a testing metric? Put points on your test cases

Lowe, S.A. (2017a), Why Metrics don't Matter in Software Development (unless you pair them with business goals)

Lowe, S.A. (2017b), 9 metrics that can make a difference to today’s software development teams


Acronyms

  • ALM: Application Lifecycle Management
  • CRR: Component Raw Risk
  • MTBF: Mean time between failures
  • MTTR: Mean time to recover/repair
  • SLA: Service Level Agrement
  • TR: Total Risk
  • TRCG: Total Risk Coverage Grade
  • TRR: Total Raw Risk
  • TSR: Total Standard Risk
Amir Haimpour

CPO | Product Expert | Product Lead

2 个月

???? ??? ?? ?? ??????. ??? ????? ???? ?????? ???: ?????? ????? ??? ??????? ?????? ??????, ?????? ?????? ??????,?????? ????? ????????. https://chat.whatsapp.com/IyTWnwphyc8AZAcawRTUhR

Adam Avnon

Owner at Plan(a-z) | Leading Marketing & Business Dev. for premium brands | Ex. CEO of Y&R Israel

3 个月

???? ??? ?? ?? ???????? ??? ????? ???? ?????? ???: ?????? ????? ??? ??????? ?????? ??????, ?????? ?????? ??????,?????? ????? ????????. https://chat.whatsapp.com/IyTWnwphyc8AZAcawRTUhR

回复
Maria Rose

Vice President | Agile Leader

2 年

Very good and thorough.

Asiq Ahamed

Principal Consultant @ Codoid Innovations with expertise in Gen AI & Test Automation

5 年

Nice one Meir Bar-Tal

Shahar Behagen

Freelance full stack software developer at Self-Employed

5 年

Meir, this is very interesting! My two cents: From my experience, the trickiest part in SW development is writing the requirements. Reason is, strange as it might sound, that the customer usually only vaguely knows what he wants or worse - what he needs, or worse yet - what exactly his business is… That’s perfectly OK, “The IT Business” *should be* a very dynamic and amorphic entity. Otherwise it won’t survive. (Hence the entire Agile blossom) But under such circumstances assessing the quality of the development process is the only thing we could ever hope to achieve. IMHO, you cannot approach the assessment of the “Quality for the consumers or the end-users” as you say in the Introduction, with our “software testing” paradigms.

要查看或添加评论,请登录

Meir Bar-Tal的更多文章

  • Fairness in Contracted and Salaried Jobs

    Fairness in Contracted and Salaried Jobs

    After commenting on this article on 14 reasons to turn down a salaried position, I thought the following remarks and…

    1 条评论

社区洞察

其他会员也浏览了