Green Light or Red Alert? How do your performance test targets look?
Writing test requirements can be a fairly easy process. We examine the business requirements and write out some form of criteria to check whether it's been met or not.The test itself is usually a black or white decision process. It either meets the requirement or it does not: Pass or fail. Simple! Well, maybe not quite that simple, (okay, it's a lot more complicated than that) but in general, that's the process.
Performance testing is no different. When things are done correctly there should be specific, measurable objectives for each of the performance requirements. One of these might start off looking something like:
- Response time for transaction x must not exceed two seconds.
Which might look perfectly reasonable to an end user when they're asked to write some performance requirements. End-users generally think in terms of single-user response times. For them as a user, this is what they would expect from the system. From a performance perspective though, there are some pretty major gaps here. Performance testers think in terms of multi-user response times under differing loads. So, with a bit of coaching, we might get them to rework their requirement to be something like:
- At 100% user and transaction load, transaction x response time must not exceed 2.00 seconds.
That's better. But, there's still room for improvement here. What does that two second response actually mean? Is this for all responses? Is it an average response time, is it the worst case out of ten? So, (let's not get into debating averages v percentiles here) let's take this to the next level and add some more performance-related criteria:
- At 100% user and transaction load, transaction x response times must not exceed 2.00 seconds at the 90th percentile, with not more than a 1% failure rate.
Now we're finally getting somewhere. This is specific and measurable, and reduces potential ambiguity to something that has quantifiable performance test criteria all over it. Okay, admittedly there are a few more conditions we could throw in there, but by now, you should be able to see that if we were to test the system under load, we'd be able to see whether it passes the old binary pass/fail criteria and it would have better relevance to a performance test than the original requirement.
Unfortunately, though, in the world of performance testing, things are rarely as simple as black or white, pass or fail. Just like old-school photography, it's the myriad shades of grey that complete the picture and tell the whole story. A picture that is either pure black and white shows a lot less detail than one with shades of grey. It's how we capture and interpret those in-between shades that makes the difference.
I find the best and easiest way to interpret in-between conditions is using a traffic light system, (and this is where my black and white analogy fails) where green is good, amber means proceed with caution, and red is an obvious failure. You can use however many shades you like, but the concept should be the same if you use simple boundary conditions around where the colour changes.
Using the previous example, where the target transaction response time was 2.00 seconds, I might ask the business to rework their requirement along the lines of, say, if 9-out-of-10 meet the 2 second target then that can be considered good (Green), if only 8-out-of-10 achieve the target then it would be okay but not great (Amber), but anything less than 8-out-of-10 meeting the target or if there was an error rate greater than 1% then that would definitely be bad (Red).
So the original requirement hasn't changed:
- At 100% user and transaction load, transaction x response times must not exceed 2.00 seconds at the 90th percentile, with not more than a 1% failure rate.
The requirement (90% must not exceed 2.00 seconds) is still the same, but how it is reported and interpreted will differ. I usually map this (and all the other performance test criteria) in my performance test plan something like this:
I find that using this kind of Green/Amber/Red reporting criteria for measurable performance test criteria helps enormously when results are presented back. Simply reporting a list of passed/failed test criteria often adds little value to projects where performance test results are subject to interpretation.
For example results presented as just pass or fail like this:
is a lot harder to interpret than the same data presented in traffic light form:
Both tables report exactly the same result set.
Using the traffic light approach really helps focus on results that are definitely bad and cannot be interpreted as anything but. Amber areas with occasional red highlights can paint a very different picture to what otherwise might appear to be a sea of red. I find this is much easier to focus in on what are true performance issues without the distraction of things that are borderline and (usually argued to be) acceptable.
Let me know what you think. Do you use the traffic light system or something else? What works best for you?