Do you know the difference between reliability and resilience?
If you want to know the difference between reliability and resilience, look to the Moon. Specifically, look at the two best known Moon missions, Apollo 11 and Apollo 13.
Although Apollo 11 was famously successful, this was almost not the case. In the final minutes of the descent, the guidance computer crashed repeatedly, throwing error after error at the two astronauts, who waited tensely on instructions from Mission Control, and wondered whether to abort the mission.
Later, it was found that the computer was receiving unexpectedly large amounts of data from one of the instruments, overloading its memory and processing capacity. Yet, despite the nerve wracking series of errors, the computer behaved exactly as designed. When overwhelmed, it displayed an error message and restarted itself, giving priority to the most important programmes.
Moreover, the astronauts and their counterparts in Mission Control behaved in line with their training, skills and character. If you listen to the famous recording of the descent, you would never guess that this was a moment of historical uncertainty, made all the more uncertain by unexpected errors in critical equipment.
The whole system, including the guidance computer, the astronauts, and their support back on Earth, was designed to be reliable. Failures were expected: that’s why the computer had error codes, and that’s why the astronauts had been trained to deal with them. The system was built out of unreliable components (all individual components are unreliable) but was reliable in aggregate.
Unlike Apollo 11, Apollo 13 is famous for what went wrong. An explosion damaged many systems and blew part of the oxygen supply into space. There was no question of landing on the Moon, and the mission switched to getting the crew home safely through a triumph of planning, endurance and improvisation. (Although not perfectly historically accurate, Apollo 13 remains high on my list of films I can always watch again: if I turn on the television and it’s showing, it’s hard to look away, no matter where it is in the story.) However, it was not all improvisation. The big decision to shut down the command module and preserve its power for landing, using the lunar module as a lifeboat, had been considered prior to launch. It was not thought to be a likely scenario, but it was thought about, so the crew did not have to invent the approach from scratch.
The whole system in Apollo 13 turned out to be resilient. When an unexpected catastrophe occurred, the crew and Mission Control used existing plans to respond. They preserved what was most important (the lives of the crew) while accepting that they could no longer achieve everything they planned (landing on the Moon).
领英推荐
I think that these two examples help us think through the difference between reliability and resilience in enterprise computing. Given the degree to which we rely on computers to run our lives, our companies and our society, it is imperative that we think about both. This wasn’t always the case: in the early days of my career, having failover equipment was an optional extravagance, and disaster recovery was, for many companies, a new discipline. Now we attempt to design our systems so that they keep going when things go wrong.
However, I think that sometimes, when we attempt to design our systems to keep going, we fail to distinguish sufficiently between reliability and resilience. This is particularly apparent in the current wave of adoption of public cloud. Because cloud is new to many companies (or, for companies who have been using cloud for a while, they are reaching new levels of usage and dependence), they are asking, ‘What happens when things go wrong?’
This is almost, but not quite, the right question to ask. If we only ask what happens when things go wrong in a broad, undefined way, then we don’t know what type of failure we are talking about. Are we talking about the failure of a server? A zone? A region? Or the entire global platform? Are we talking about incidents that are short lived? Persistent? Irrecoverable? Are we talking about technical failures? Or commercial failures? Failures to deliver change? Or failure to deliver service?
Each of these failure modes has a different likelihood and a different impact. Some are guaranteed to occur: hardware will always fail at some point. Some may never occur (where ‘never’ means beyond a reasonable planning horizon): a global platform outage with no prospect of recovery may happen so infrequently that it doesn’t happen in our lifetimes (the results are not in yet).
More importantly, as we traverse these failure modes with their different impacts and likelihoods, we cross the threshold from reliability to resilience. We move from asking how we keep services running well despite expected failures (how we get Apollo 11 to land on the Moon) to asking how we survive despite unexpected catastrophes (how we get the Apollo 13 crew home safely). If, when planning the adoption of new technologies such as public cloud, we mix up reliability and resilience, we run a high risk of either over-engineering, to attempt to preserve service in the event of all imaginable scenarios, or under-preparing, and relying on our reliability measures in the face of events which overwhelm them.
There’s a lot more to say about this distinction: I’ll explore further next week.
(Views in this article are my own.)
Amazing Metaphor!
Good one David Knott ! Reliability and Resilience - excellent choice with explanation in simple terms. The questions you posed towards the end of the article are interesting. Hope orgs that deliver critical services take them up in their journey to cloud and have answers!
Author of 'Enterprise Architecture Fundamentals', Founder & Owner of Caminao
2 年Reliability looks inward, resilience looks outward https://caminao.blog/book-pick-evolution-resilience/
EPAM交付项目经理, PMP, SSM, CLP, CSM
2 年Good clarification, thanks David!
Delivery Head, 5x AWS, GCP Cloud Architect, Sun Certified Enterprise Architect, PMI-ACP, OCP.
2 年that was an excellent article David !!!