登录查看更多内容

Do you know the difference between reliability and resilience?

David Knott

CTO for UK Government

发布日期: 2022年7月14日

If you want to know the difference between reliability and resilience, look to the Moon. Specifically, look at the two best known Moon missions, Apollo 11 and Apollo 13.

Although Apollo 11 was famously successful, this was almost not the case. In the final minutes of the descent, the guidance computer crashed repeatedly, throwing error after error at the two astronauts, who waited tensely on instructions from Mission Control, and wondered whether to abort the mission.

Later, it was found that the computer was receiving unexpectedly large amounts of data from one of the instruments, overloading its memory and processing capacity. Yet, despite the nerve wracking series of errors, the computer behaved exactly as designed. When overwhelmed, it displayed an error message and restarted itself, giving priority to the most important programmes.

Moreover, the astronauts and their counterparts in Mission Control behaved in line with their training, skills and character. If you listen to the famous recording of the descent, you would never guess that this was a moment of historical uncertainty, made all the more uncertain by unexpected errors in critical equipment.

The whole system, including the guidance computer, the astronauts, and their support back on Earth, was designed to be reliable. Failures were expected: that’s why the computer had error codes, and that’s why the astronauts had been trained to deal with them. The system was built out of unreliable components (all individual components are unreliable) but was reliable in aggregate.

Unlike Apollo 11, Apollo 13 is famous for what went wrong. An explosion damaged many systems and blew part of the oxygen supply into space. There was no question of landing on the Moon, and the mission switched to getting the crew home safely through a triumph of planning, endurance and improvisation. (Although not perfectly historically accurate, Apollo 13 remains high on my list of films I can always watch again: if I turn on the television and it’s showing, it’s hard to look away, no matter where it is in the story.) However, it was not all improvisation. The big decision to shut down the command module and preserve its power for landing, using the lunar module as a lifeboat, had been considered prior to launch. It was not thought to be a likely scenario, but it was thought about, so the crew did not have to invent the approach from scratch.

The whole system in Apollo 13 turned out to be resilient. When an unexpected catastrophe occurred, the crew and Mission Control used existing plans to respond. They preserved what was most important (the lives of the crew) while accepting that they could no longer achieve everything they planned (landing on the Moon).

领英推荐

Astronauts stuck as Boeing’s Starliner expenses rocket…

Interesting Engineering 7 个月前

Breitling Orbiter 3's Logbook: Day 7,8 and 9.

Bertrand Piccard 1 年前

?? SpaceX vs Boeing – A test of evolutionary fitness

Azeem Azhar 6 个月前

I think that these two examples help us think through the difference between reliability and resilience in enterprise computing. Given the degree to which we rely on computers to run our lives, our companies and our society, it is imperative that we think about both. This wasn’t always the case: in the early days of my career, having failover equipment was an optional extravagance, and disaster recovery was, for many companies, a new discipline. Now we attempt to design our systems so that they keep going when things go wrong.

However, I think that sometimes, when we attempt to design our systems to keep going, we fail to distinguish sufficiently between reliability and resilience. This is particularly apparent in the current wave of adoption of public cloud. Because cloud is new to many companies (or, for companies who have been using cloud for a while, they are reaching new levels of usage and dependence), they are asking, ‘What happens when things go wrong?’

This is almost, but not quite, the right question to ask. If we only ask what happens when things go wrong in a broad, undefined way, then we don’t know what type of failure we are talking about. Are we talking about the failure of a server? A zone? A region? Or the entire global platform? Are we talking about incidents that are short lived? Persistent? Irrecoverable? Are we talking about technical failures? Or commercial failures? Failures to deliver change? Or failure to deliver service?

Each of these failure modes has a different likelihood and a different impact. Some are guaranteed to occur: hardware will always fail at some point. Some may never occur (where ‘never’ means beyond a reasonable planning horizon): a global platform outage with no prospect of recovery may happen so infrequently that it doesn’t happen in our lifetimes (the results are not in yet).

More importantly, as we traverse these failure modes with their different impacts and likelihoods, we cross the threshold from reliability to resilience. We move from asking how we keep services running well despite expected failures (how we get Apollo 11 to land on the Moon) to asking how we survive despite unexpected catastrophes (how we get the Apollo 13 crew home safely). If, when planning the adoption of new technologies such as public cloud, we mix up reliability and resilience, we run a high risk of either over-engineering, to attempt to preserve service in the event of all imaginable scenarios, or under-preparing, and relying on our reliability measures in the face of events which overwhelm them.

There’s a lot more to say about this distinction: I’ll explore further next week.

(Views in this article are my own.)

A Lot to Learn

22,967 位关注者

Mamta Byakod

2 年

Amazing Metaphor!

Suresh Packiam

2 年

Good one David Knott ! Reliability and Resilience - excellent choice with explanation in simple terms. The questions you posed towards the end of the article are interesting. Hope orgs that deliver critical services take them up in their journey to cloud and have answers!

Rémy Fannader

Author of 'Enterprise Architecture Fundamentals', Founder & Owner of Caminao

2 年

Reliability looks inward, resilience looks outward https://caminao.blog/book-pick-evolution-resilience/

刘悠舒

EPAM交付项目经理, PMP, SSM, CLP, CSM

2 年

Good clarification, thanks David!

Vijaya Raghava Vuligundam

Delivery Head, 5x AWS, GCP Cloud Architect, Sun Certified Enterprise Architect, PMI-ACP, OCP.

2 年

that was an excellent article David !!!

1 次回应

查看更多评论

要查看或添加评论，请登录

David Knott的更多文章

Adventures in ignorance

2025年3月20日

Adventures in ignorance

Nobody knows anything. That’s the number one rule in Adventures in the Screen Trade, the book by the late screenwriter…

25 条评论
Worry about the dumb machines as well as the smart ones

2025年3月13日

Worry about the dumb machines as well as the smart ones

We have been warned about the dangers of intelligent machines for over 150 years. In his satirical novel Erewhon by…

18 条评论
There's always a bigger goat: don't let big problems stop you solving smaller problems

2025年3月6日

There's always a bigger goat: don't let big problems stop you solving smaller problems

In the story of the three billy goats gruff, the goats want to cross a bridge guarded by a troll. They manage this by…

17 条评论
Which is more dangerous: slides or sticky notes?

2025年2月27日

Which is more dangerous: slides or sticky notes?

We’ve all been in that meeting. Perhaps you are planning a programme or designing an architecture.

22 条评论
The language illusion, doubled

2025年2月20日

The language illusion, doubled

Is programming a computer more like language or more like maths? Neither, it turns out. In recent research…

22 条评论
Technologists are always crying wolf (because of all the wolves)

2025年2月13日

Technologists are always crying wolf (because of all the wolves)

The computer had failed. Unfortunately, it was the Apollo Guidance Computer (AGC), the machine that controlled the…

32 条评论
Coping with volatility: don't panic; seek truth; release frequently

2025年2月6日

Coping with volatility: don't panic; seek truth; release frequently

If you’re in the last stages of a multi-year digital delivery programme, then you probably feel frazzled. That’s the…

12 条评论
It's more complicated on the inside than it is on the outside

2025年1月30日

It's more complicated on the inside than it is on the outside

We don’t need time machines to create paradoxes in technology: they are built into the way we work. One of these…

24 条评论
Precision + prediction = the other type of centaur

2025年1月23日

Precision + prediction = the other type of centaur

Are we all centaurs now? ‘Centaur’ is the term used to describe someone who works in tandem with AI. It is part of the…

2 条评论
Learn to fail fast? Technologists fail all the time

2025年1月16日

Learn to fail fast? Technologists fail all the time

From time to time, organisations attempt to learn new ways of working. They attempt to become digital or agile or…

24 条评论

See all articles

Do you know the difference between reliability and resilience?

David Knott

CTO for UK Government

领英推荐

A Lot to Learn

22,967 位关注者

David Knott的更多文章

社区洞察

其他会员也浏览了

What Does Catching a Rocket Booster Mean for the Future?

Farewell Peregrine... and SLIM!

HawkEye 360 October Newsletter

Farewell Peregrine... and SLIM!

2025 Tech News

Business Chief Magazine: Latest News & Insights

Items That Astronauts Are Banned from Taking into Space

From Orbit to Earth: The Engineering behind Spacecraft Re-entry

BIG IDEA 2015: UK's Quantum Aerospace Leap

领英推荐

A Lot to Learn

22,967 位关注者

David Knott的更多文章

Adventures in ignorance

Worry about the dumb machines as well as the smart ones

There's always a bigger goat: don't let big problems stop you solving smaller problems

Which is more dangerous: slides or sticky notes?

The language illusion, doubled

Technologists are always crying wolf (because of all the wolves)

Coping with volatility: don't panic; seek truth; release frequently

It's more complicated on the inside than it is on the outside

Precision + prediction = the other type of centaur

Learn to fail fast? Technologists fail all the time

社区洞察

其他会员也浏览了

What Does Catching a Rocket Booster Mean for the Future?

Farewell Peregrine... and SLIM!

HawkEye 360 October Newsletter

Farewell Peregrine... and SLIM!

2025 Tech News

Business Chief Magazine: Latest News & Insights

Items That Astronauts Are Banned from Taking into Space

From Orbit to Earth: The Engineering behind Spacecraft Re-entry

BIG IDEA 2015: UK's Quantum Aerospace Leap