Load Testing: A Walk on the Plank
Paul McLean - based on drawing by Howard Pyle in 1921

Load Testing: A Walk on the Plank

One of the most difficult aspects of load test design is determining the amount of time that a Load Test needs to run.  What is the difference between a two hour and a 12 hour test run in terms of value to the project?

Some managers get a little nervous as 'go-live' approaches, and they dream of a successful load test result.  If the Non Functional Requirements state that the system should processes 'peak hourly load' for one hour, then there is a tendency to limit the duration of the load test to that single hour.  The rationale is simple but flawed.  The defective logic generally includes an irrational fear that a failure will occur in the extra time, which is over and above the minimal 'one hour requirement'.  Such a failure will cast a dark shadow on what could otherwise have been a successful result - so the motivation for minimal testing generally comes from a fear that the system will fail.  I tend to think of this as a 'Walking The Plank' mentality.  Wikipedia as the following quote about 'Walking the Plank':

Given the occasions on which it was known to have been employed, it appears more likely to have been an elaborate and unusual form of sadistic entertainment rather than a regular method of murdering unwanted captives.

Most Load Testing professionals have, at some point in their career, been viewed by key project stakeholders as using their skills as an 'elaborate and unusual form of sadistic entertainment' against an innocent Application System that would work just fine if no one stressed it out.

There is no doubt, that from the perspective of the unwanted captive, being forced to 'walk the plank', each step would be terrifying.  The victim's whole mental capacity would be focused on each successive step - fearing that it would be that final step into Davey Jones Locker.  It seems that some key project staff take on a type of 'Vicarious Trauma' as they take on an empathic engagement with the Application Under Test, and fear each additional hour of testing, as if it is like taking another step on a plank to the ultimate abyss of an untimely Sev-1 defect. 

Each additional hour that a load test runs is an hour that could lead to the demise of the application.  So how many hours should the test be configured to run. What is the technical basis for justifying the selection an appropriate run duration.

The graph in the header image shows analysis from a test processing transactions at a rate of 100 transactions per second for six hours, before abruptly failing to zero transactions per second. If the test duration was just five hours, the failure would not have been observed.  What is the difference between a five hour test run with no catastrophic failure and a six hour test run with a failure?  I propose that projects where ongoing reliability needs to be well understood should include multiple long running tests. If possible each night and most weekend opportunities should be utilised to test the number of hours that the system under test can run with no failure. 

A case study example:
Let's consider a relevant situation of testing a major upgrade to an application that has been in production for a year, but the same principles apply to any system.  

The non-functional requirements for the upgraded system have not changed since the original launch 12 months earlier.  Full Peak Hourly Load is considered to be 360,000 transactions in an hour, which equates to 100 TPS.  For the purpose of testing, Peak hourly load should average between 95 and 105 TPS for the entire hour but can vary between 90TPS and 110TPS over the hour based on 1 minute granularity. Over the first 12 months of operation, there have been 100 'Full Peak Hours'.  The number of Full Peak Hour periods is expected to double over the coming months,  so Management want to be confident that upgraded application is no less resilient than the current version, but hope testing can show that it is more resilient. 

Over the past 12 months there were four major application failures.  Three occurred while the application was processing Full Peak Hourly load and one occurred when processing 70% - 90% of Peak Hourly Load.  Significant Root Cause Analysis work was undertaken and the project team are confident that they have solved the main problem, but acknowledge that the problem may still occur.  The reason the problem was difficult to solve was due to the interplay of real time transaction processing and the background workflow tasks and associated processing arising from recently processed work load, and the interplay of this activity with the Enterprise Service Bus and the Database that supports the ESB, the associated authentication and authorization and the audit logging process.  The RCA/Diagnostics team identified and resolved a race condition that only occurred when the system was under heavy load following a period of significant load.  

The breakdown of load by hour for the previous year was as follows.

  • 100 hours at 90% - 110% of Peak Hour Load
  • 400 hours at 70% - 90%  of Peak Hour Load
  • 1,000 hours at 50% - 70% of Peak Hour Load 
  • 2,000 hours at 30% - 50% of Peak Hour Load 


This suggests that each hour at 90%-110% of peak load had a 3% of encountering an application failure, but less than a tenth of that chance of failure at the lower level of 70% - 90% of Peak Load.  However, almost every hour with 90%-110% of Peak Hourly load was preceded with an hour of 70% -90% of peak load.  This means that we must run our high rate test (90%-110% of Peak Hour Load) immediately after an hour with load at 70%- 90%. 

It was agreed that while a Load Test should be constructed to run one hour at 70%-90% of Peak Load to be immediately followed by an hour at 90%-110% of Peak load, the worst case scenario is really multiple consecutive hours at 90% - 110% of peak hourly load, especially given the expectation that the number of peak hours the system needs to support will double. So the load test was configured to 'warm up' with one hour at full load, and then each hour after that warmup period was considered as an 'hour' of continuous operation.

The big question, however is - how long should the test run for, in order to be confident that the incidence of failures will be the same or less than the current system.  

Running a few long multi-cycle running tests gives far better information than running a large number of single cycle tests.  This is because single independent test runs would each require two hours of test execution time, and because each run would then need to be considered against the 3% probability that that particular test run could have been an hour with a failure. A very large number of test runs would be required to give a reasonable level of confidence that the problem had been resolved. However, by running a long running test, we get the results for several hours in less time, but more importantly, we get to see 'how many good hours' we get in a sequence.

If each hour has a 3% chance of encountering a catastrophic failure, then the chance of surviving 10 consecutive hours works out to be 73.7%.  (0.97 ^ 10 = 0.737)  The table below shows the probability that a test will survive a given number of hours based on a designated failure rate for each hour.  Note that the 50% chance of a complete system failure during a multi hour run is highlighted in dark green and 25% chance of a failure is highlighted in light green.

As is clear from the above table, running a test for a couple of hours fails to give any level of confidence that the system will performance as it has over the past 12 months, or if it can actually perform more reliably. 

Repeatability is critical in any Load Testing approach, so it is important that the system can be subjected to a very similar set of tests in the future to determine if stability is improved or degraded.  With this in mind, the project decided to run the following sequence of tests, with an expectation that they will probably see one major failure if the race condition still exists with a 3%  likelihood of a failure occurring in any Full Peak Hour.  If necessary, the same (or similar) sequence of tests could be planned and executed for subsequent releases.

  • 16 Hours: Wednesday @ 16:00 (with 37% chance of observing defect)
  • 12 Hours: Thursday @ 18:00 (with 29% chance of observing defect)
  • 60 Hours: Friday @ 18:00 (with 84% chance of observing defect)

After workshopping this approach with the relevant stakeholders, they were more concerned with determining if the race condition problem still exists than simply hoping for a quick and clean  test run, and made the necessary provisions for a long weekend test run.  They now understood that executing long running tests was not a "elaborate and unusual form of sadistic entertainment" but a reasonable means of validating the reliability of a complex system.

It takes time to establish best practice execution of long running load tests.  Storage issues, for the Application Under Test as well as the Test Infrastructure are common problems, and the tests need to be 'optimised' for long running operation by limiting logging and unnecessary transaction details.  

Running multi-day tests for the first time in an environment frequently highlights environment issues and need to be resolved before the results can be properly utilised. 

Analysing multi-day test runs can be difficult due to granularity issues with such large numbers of transactions.  For this reason, it is often good to run multiple separate tests back to back rather than a single test run.  From a SUT perspective this is the effectively the same, but post test analysis is much simpler. We did this for a client over a Christmas break, and were able to run 14 whole days of tests, using a repeating test that took just under 24 hours to execute.  Running this 14 day test gave the project a very high level of confidence that the entire application stack was stable.

The concepts in this Post should help you negotiate a reasonable duration for your long running Load Test so that your experience can be more like 'a walk in the park' than 'walking the plank'.

Miljan Grujic

Senior Performance Analyst at Department of Human Services

9 年

Mathematical and Statistical principles go hand in hand. They play important role in analysing and interpreting test results and collected measurements around underlying hardware utilisation. This article shows good implementation of statistics and theory of probability in test preparation and test run design phase.

回复
Scott Hysmith

Consultant at Foulk Consulting

9 年

First of all, if you're NOT designing your test with knocking it down in mind, you're not trying hard enough. This highlights the fact that for most stakeholders, "performance test" is a catch-all term. They typically make no distinction between load, capacity, stress and--in the case of this article--endurance. We should be keeping all these types in mind when designing our load model. One thing the article did not highlight was that not only are you going to uncover the hard to find cracks in the system under test, but the first time you run an endurance test you're also going to shine a light on your own load infrastructure. Your load generators should be robust enough to spin up a test lasting days, you should have enough drive space for all those raw results (and woe unto you if you forgot to turn extreme logging off that one vuser you were doing forensics on), and make sure your analysis tool is pointed to a data source that won't choke and die on data sets larger than 2GB.

回复

要查看或添加评论,请登录

Paul McLean的更多文章

社区洞察

其他会员也浏览了