Death Throws - Signs of a failing system, and the value of dynamic monitoring - for both IT Systems and Disease.
One of my favorite memories of Load Testing was back at a bank around the year 2000. I was part of a team that was Load and Stress testing one of the Largest Online Banking Systems in Australia. The moment I remember most vividly was when one of the guys in the team would yell out, "Death Throws!". We had all grown accustomed to seeing this behavior. As we gradually increased workload demand on the system, some aspects of the system was start to falter, causing little rises in response time, or lock wait time, or queue depth, or various other markers.
Sometimes the system would resolve this mini crisis, and everything would return to normal. But we would often not give the system much time to 'settle', because we knew that in the real-world, hundreds of thousands of users would not 'back-off' after sensing that a systemic problem has manifested. End users can be brutal when it comes to driving a workload onslaught - just when the system needs a break, so we continued to press on.
When workload is increased, the various components and sub-systems supporting in the entire Internet Banking Application do their best to cope, but little blips soon turn into longer and more extreme spikes. After a few minutes of degradation, we would see the entire system begin to oscillate, with wild variations of extreme activity that eventually goes completely 'out of control'. As each cycle got worse the words "Death Throws" would be heard, first, quietly, under peoples breath, and then as it became more obvious, someone would shout it out.
Even after the proclamation had been shouted out, test test was not over. Stakeholders in the room were keen to see a successful test run, and would cross their fingers and hope that the system would recover. A few moments later, someone would call out "response time is improving", and another would call out "queue depth is reducing", giving the stakeholders a cruel and false hope that a positive outcome was imminent, but the reality was that this was just the a short 'trough' in problems, as the system desperately tried to resolve its issues.
Once the 'Death Throws' started, the outcome was always bad. Sometimes it would take 5 minutes, other times, 10 minutes, but the system would eventually seize up completely stop processing customer requests.
One of the main things I learned from this was that you could tell if a system was about to fail catastrophically if you could observe that a 'wave form' of various performance markers was progressively becoming more extreme, providing a very limited intervention window. This however required properly configured instrumentation. If you could only see CPU utilization as an average of the past minute, you may not have enough data points to 'resolve' a cyclic pattern. It is critical that you can see each metric on a second by second basis, even if the periodicity of the 'death throws' was much longer, and measurable in minutes.
This is why I was thrilled to discover this old paper from 2014, by Roxana S Dronca, Alexey A Leontovich, Wendy K Nevala, and Svetomir N Markovic, entitled "Personalized therapy for metastatic melanoma: could timing be everything?" I have been very interested in health and our immune system for some time, and recently had some skin cancer identified and removed, so I am naturally interested in what is going on from a systemic perspective. The usual paradigm for medical treatment is to take various "Point In Time' measures, and use that data to inform clinical decision making. But this paper is different, because it shows that various immune system markers, if measured frequently, could expose their own distinct 'period' of oscillation, especially if the patient is in the final stages of a very serious disease, such as metastatic melanoma. One day, CRP (a commonly used marker of inflammation) may be down (in the normal range), but some other cytokines may be extremely high. Three days (or three weeks or three months) later, these values may be reversed. This means that the traditional 'point in time' approach is completely inadequate to 'comprehend' the dynamic nature of the disease, and any decisions flowing from Point In Time measures could be counter productive. If measures are taken before and after a tumor is remove (or a course of treatment is delivered) then the assessment of the intervention could be rendered meaningless if the patients markers are varying wildly over time.
When I adopted a Low Carbohydrate diet a year ago, I lost a lot of excess weight (40Kg over 13 months), and dramatically improved my health, but my Cholesterol rose, raising concerns for my doctor. I (like the doctor) assumed that cholesterol values change very slowly over time, but I was wrong. As shown in the graph, when I measured by Blood Glucose, Ketone and Lipids multiple times over a few days, I could see significant variation, in response to eating and subsequent periods of Intermittent Fasting.
I have run more than 250 Lipid Tests on myself over the past year, and one thing I have learned is that Cholesterol values, even during the recommended 'fasting hours range', can vary by a factor of 2, for an individual. What you have eaten and how you have behaved over the previous three days make a big different to the Triglycerides, HDL and LDL concentrations in your blood. The pseudo random cholesterol results you get at each test may be misleading and result in very poor clinical advice. Imagine you get a randomly 'bad' result in December, make poor food and exercise choices for 6 months, and then get a randomly 'good' result in June. You may internally decide that, based on the blood test evidence, that you do not need to care look after your health.
This is why it is critical, to both the assurance of high quality system performance, and high quality health outcomes, that some measures are collected and analysed with very fine granularity. For example, in modern Cloud Deployed IT Systems, auto scaling events (that add additional capacity) should probably fire, based on very granular monitoring, when certain errors are observed, or when response time variation exceeds a given threshold, rather than waiting for the simplistic average CPU to exceed an arbitrary threshold. Likewise, in some situations in health, it may be important to take multiple measurements over the course of a day or even over a week, to give an accurate picture of a person's dynamic response to a crisis.
This is why I really liked the paper by by Dronca Et. Al., as it is a means of visualizing the entire dynamic immune system, in response to the many assaults of metastatic cancer. The paper then explains that various treatments are much more effective a certain points within the cycles of this dynamic response, and advocated for predicting and targeting treatment for the optimum point in time, based on the expected immune system state.
It seems to me that running a blood test for various inflammatory markers every day, for a week, would be a very prudent course of action, if one had cancer and was trying to figure out if their immune system is a violent struggle to overcome the cancer, or if it was peacefully stable. The resulting diagnostic information would then make a much more nuanced contribution to the process of selecting the most effective cancer treatment.
Most systems are dynamic, and if one wants to really understand how such systems are performing, then one needs to be able to capture as many data points as possible, over time. Prompt analysis of this data increases the window of opportunity to make an appropriate intervention to resolve a problem that would otherwise lead to a very bad outcome.
Spot on Paul ! I feel strongly that medical sciences is not well equipped to handle dynamic systems . It's an irony. Perhaps the ICU room is the closest it gets to a "dynamic system" monitoring, however this too is unlikely to be over long periods of time. I think in home testing has thrown a lot of doubt into current protocols for testing and treatment. There is a whole world of dynamic systems that need to be mastered before modern medicine is equipped to play in the space of prevention. Add to that personalisation of medicine. You are making a great contribution to this field Paul!
Customer Success | Solutions Engineering | Professional Services | Observability | Technology Consulting | Baseline & NV1 Clearance
6 年Very good post. I really like the comparison. It really summed up the value of 'dynamic'
IT Director of Portfolio Architecture
6 年Completely agree here. Some great early prediction oscillator algorithms like Fisher Transforms monitoring small platform behavioral changes with fine grained markers can really make a difference. Great article that is spot on towards what is needed. Too often I see people pursue historical capacity models on combined markers based on averages over history that are just not sensitive enough. Fisher and other short term trend algorithms alert before things go awry.
Cloud Security and AI Wrangler
6 年Great post Paul!