登录查看更多内容

Death Throws - Signs of a failing system, and the value of dynamic monitoring - for both IT Systems and Disease.

Paul McLean

IT Performance Consultant at RPM Solutions & TestDataServices

发布日期: 2018年12月23日

One of my favorite memories of Load Testing was back at a bank around the year 2000. I was part of a team that was Load and Stress testing one of the Largest Online Banking Systems in Australia. The moment I remember most vividly was when one of the guys in the team would yell out, "Death Throws!". We had all grown accustomed to seeing this behavior. As we gradually increased workload demand on the system, some aspects of the system was start to falter, causing little rises in response time, or lock wait time, or queue depth, or various other markers.

Sometimes the system would resolve this mini crisis, and everything would return to normal. But we would often not give the system much time to 'settle', because we knew that in the real-world, hundreds of thousands of users would not 'back-off' after sensing that a systemic problem has manifested. End users can be brutal when it comes to driving a workload onslaught - just when the system needs a break, so we continued to press on.

When workload is increased, the various components and sub-systems supporting in the entire Internet Banking Application do their best to cope, but little blips soon turn into longer and more extreme spikes. After a few minutes of degradation, we would see the entire system begin to oscillate, with wild variations of extreme activity that eventually goes completely 'out of control'. As each cycle got worse the words "Death Throws" would be heard, first, quietly, under peoples breath, and then as it became more obvious, someone would shout it out.

Even after the proclamation had been shouted out, test test was not over. Stakeholders in the room were keen to see a successful test run, and would cross their fingers and hope that the system would recover. A few moments later, someone would call out "response time is improving", and another would call out "queue depth is reducing", giving the stakeholders a cruel and false hope that a positive outcome was imminent, but the reality was that this was just the a short 'trough' in problems, as the system desperately tried to resolve its issues.

Once the 'Death Throws' started, the outcome was always bad. Sometimes it would take 5 minutes, other times, 10 minutes, but the system would eventually seize up completely stop processing customer requests.

One of the main things I learned from this was that you could tell if a system was about to fail catastrophically if you could observe that a 'wave form' of various performance markers was progressively becoming more extreme, providing a very limited intervention window. This however required properly configured instrumentation. If you could only see CPU utilization as an average of the past minute, you may not have enough data points to 'resolve' a cyclic pattern. It is critical that you can see each metric on a second by second basis, even if the periodicity of the 'death throws' was much longer, and measurable in minutes.

This is why I was thrilled to discover this old paper from 2014, by Roxana S Dronca, Alexey A Leontovich, Wendy K Nevala, and Svetomir N Markovic, entitled "Personalized therapy for metastatic melanoma: could timing be everything?" I have been very interested in health and our immune system for some time, and recently had some skin cancer identified and removed, so I am naturally interested in what is going on from a systemic perspective. The usual paradigm for medical treatment is to take various "Point In Time' measures, and use that data to inform clinical decision making. But this paper is different, because it shows that various immune system markers, if measured frequently, could expose their own distinct 'period' of oscillation, especially if the patient is in the final stages of a very serious disease, such as metastatic melanoma. One day, CRP (a commonly used marker of inflammation) may be down (in the normal range), but some other cytokines may be extremely high. Three days (or three weeks or three months) later, these values may be reversed. This means that the traditional 'point in time' approach is completely inadequate to 'comprehend' the dynamic nature of the disease, and any decisions flowing from Point In Time measures could be counter productive. If measures are taken before and after a tumor is remove (or a course of treatment is delivered) then the assessment of the intervention could be rendered meaningless if the patients markers are varying wildly over time.

When I adopted a Low Carbohydrate diet a year ago, I lost a lot of excess weight (40Kg over 13 months), and dramatically improved my health, but my Cholesterol rose, raising concerns for my doctor. I (like the doctor) assumed that cholesterol values change very slowly over time, but I was wrong. As shown in the graph, when I measured by Blood Glucose, Ketone and Lipids multiple times over a few days, I could see significant variation, in response to eating and subsequent periods of Intermittent Fasting.

I have run more than 250 Lipid Tests on myself over the past year, and one thing I have learned is that Cholesterol values, even during the recommended 'fasting hours range', can vary by a factor of 2, for an individual. What you have eaten and how you have behaved over the previous three days make a big different to the Triglycerides, HDL and LDL concentrations in your blood. The pseudo random cholesterol results you get at each test may be misleading and result in very poor clinical advice. Imagine you get a randomly 'bad' result in December, make poor food and exercise choices for 6 months, and then get a randomly 'good' result in June. You may internally decide that, based on the blood test evidence, that you do not need to care look after your health.

This is why it is critical, to both the assurance of high quality system performance, and high quality health outcomes, that some measures are collected and analysed with very fine granularity. For example, in modern Cloud Deployed IT Systems, auto scaling events (that add additional capacity) should probably fire, based on very granular monitoring, when certain errors are observed, or when response time variation exceeds a given threshold, rather than waiting for the simplistic average CPU to exceed an arbitrary threshold. Likewise, in some situations in health, it may be important to take multiple measurements over the course of a day or even over a week, to give an accurate picture of a person's dynamic response to a crisis.

This is why I really liked the paper by by Dronca Et. Al., as it is a means of visualizing the entire dynamic immune system, in response to the many assaults of metastatic cancer. The paper then explains that various treatments are much more effective a certain points within the cycles of this dynamic response, and advocated for predicting and targeting treatment for the optimum point in time, based on the expected immune system state.

It seems to me that running a blood test for various inflammatory markers every day, for a week, would be a very prudent course of action, if one had cancer and was trying to figure out if their immune system is a violent struggle to overcome the cancer, or if it was peacefully stable. The resulting diagnostic information would then make a much more nuanced contribution to the process of selecting the most effective cancer treatment.

Most systems are dynamic, and if one wants to really understand how such systems are performing, then one needs to be able to capture as many data points as possible, over time. Prompt analysis of this data increases the window of opportunity to make an appropriate intervention to resolve a problem that would otherwise lead to a very bad outcome.

Jacob Alex

5 年

Spot on Paul ! I feel strongly that medical sciences is not well equipped to handle dynamic systems . It's an irony. Perhaps the ICU room is the closest it gets to a "dynamic system" monitoring, however this too is unlikely to be over long periods of time. I think in home testing has thrown a lot of doubt into current protocols for testing and treatment. There is a whole world of dynamic systems that need to be mastered before modern medicine is equipped to play in the space of prevention. Add to that personalisation of medicine. You are making a great contribution to this field Paul!

1 次回应

Udyan Sharma

6 年

Very good post. I really like the comparison. It really summed up the value of 'dynamic'

Nolan Vincent (Vince) Jones

IT Director of Portfolio Architecture

6 年

Completely agree here. Some great early prediction oscillator algorithms like Fisher Transforms monitoring small platform behavioral changes with fine grained markers can really make a difference. Great article that is spot on towards what is needed. Too often I see people pursue historical capacity models on combined markers based on averages over history that are just not sensitive enough. Fisher and other short term trend algorithms alert before things go awry.

3 次回应

Andrew Bearsley

Cloud Security and AI Wrangler

6 年

Great post Paul!

查看更多评论

要查看或添加评论，请登录

Paul McLean的更多文章

Drilling into root cause of response time Spikes with X-Ray

2020年6月22日

Drilling into root cause of response time Spikes with X-Ray

I recently executed a 14,000 Virtual User Load Test against a Serverless Application hosted on AWS which was…

3 条评论
So you’ve been asked to run a Coronavirus inspired VPN LoadTest.

2020年2月25日

So you’ve been asked to run a Coronavirus inspired VPN LoadTest.

So you’ve been asked to run a #COVID19 inspired #VPN #LoadTest – just in case the whole organization needs to…
Easy Test Automation of Email and MFA

2019年10月29日

Easy Test Automation of Email and MFA

Imagine a scenario where a potential client says they need a new system Load and Stress Tested in the next few days…

1 条评论
Auto Scaling, as if your life depended on it.

2018年10月27日

Auto Scaling, as if your life depended on it.

Biological Auto Scaling is amazing technology that significantly pre-dates our modern cloud based solutions. Our life…
Applying Application Load Testing Skills to my Metabolic System

2018年3月16日

Applying Application Load Testing Skills to my Metabolic System

I have been amazed by the changes in my body over the past five months, as I took some practices from my Load Testing…

7 条评论
Video walkthrough - Load Testing on AWS

2018年1月2日

Video walkthrough - Load Testing on AWS

I just completed a YouTube video that walks through all of the steps to launch a brand new LoadRunner instance, with…
A year of disruption and a Christmas Gift

2017年12月24日

A year of disruption and a Christmas Gift

As the end of 2017 approaches, it is clear that 2017 has been a year of disruption. It seems to me that ‘Cloud’ has…

1 条评论
Generating unique and randomized values in large tests

2017年5月17日

Generating unique and randomized values in large tests

Many systems that we load test need some unique values. For example, email addresses are frequently used to uniquely…

2 条评论
Generating Complex and Repeating Workloads.

2017年3月26日

Generating Complex and Repeating Workloads.

Our eyes are wonderful at spotting patterns, and experienced performance engineers are good at spotting patterns and…

6 条评论

See all articles

Death Throws - Signs of a failing system, and the value of dynamic monitoring - for both IT Systems and Disease.

Paul McLean

IT Performance Consultant at RPM Solutions & TestDataServices

Paul McLean的更多文章

社区洞察

其他会员也浏览了

Technology for more human-centered financial services

COCC Launches with FedNow? Service for Instant Payments

A Different Take on Bank Tech

The Shorthand – will we see a banking 'Big Bang'?

Data is the Weapon to Defend Banking’s Market Position

Episode "Resilience Under Pressure: The Digital Ecosystem Awakens"

Our Decision Matrix and the 5 Things To Consider For Digital Account Opening

The Right Way ~ Not The "Waka Flocka" Way

American Banker announces finalists for the Innovators of the Year award

How to Raise Clear and Concise Issues During a Project

Paul McLean的更多文章

Drilling into root cause of response time Spikes with X-Ray

So you’ve been asked to run a Coronavirus inspired VPN LoadTest.

Easy Test Automation of Email and MFA

Auto Scaling, as if your life depended on it.

Applying Application Load Testing Skills to my Metabolic System

Video walkthrough - Load Testing on AWS

A year of disruption and a Christmas Gift

Generating unique and randomized values in large tests

Generating Complex and Repeating Workloads.

社区洞察

其他会员也浏览了

Technology for more human-centered financial services

COCC Launches with FedNow? Service for Instant Payments

A Different Take on Bank Tech

The Shorthand – will we see a banking 'Big Bang'?

Data is the Weapon to Defend Banking’s Market Position

Episode "Resilience Under Pressure: The Digital Ecosystem Awakens"

Our Decision Matrix and the 5 Things To Consider For Digital Account Opening

The Right Way ~ Not The "Waka Flocka" Way

American Banker announces finalists for the Innovators of the Year award

How to Raise Clear and Concise Issues During a Project