登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

Technologists are always crying wolf (because of all the wolves)

David Knott

CTO for UK Government

发布日期: 2025年2月13日

The computer had failed. Unfortunately, it was the Apollo Guidance Computer (AGC), the machine that controlled the flight of a small, fragile spacecraft to the Moon and back. Fortunately, it wasn’t in space: it was on the ground, in a simulator.

Margaret Hamilton, the leader of the MIT team programming the AGC, often had to work weekends to meet the urgent schedule of the Apollo programme, and sometimes brought her daughter, Lauren, to work with her. Lauren liked to play in the simulator.

Somehow, while in simulated spaceflight, Lauren had caused the AGC to jettison all of its navigational data. When Hamilton investigated, she found that Lauren had told the computer to load program 01: the program that prepared the craft for launch. The computer did what it was told: it forgot all of the data about the simulated flight in progress, and reset as if it was sitting on the launchpad.

Hamilton realised that if the mission had been real, rather than a simulation, the Command Module would have been lost, drifting through space with no idea of where it was. She tried to persuade NASA to build safeguards and controls into the system, but they told her that they didn’t have time - and, besides, astronauts don’t make mistakes. All she could do was add a note to the manual: ‘Do not select program 01 during spaceflight.’

In the very next flight an astronaut made a mistake. Jim Lovell was part of the crew of Apollo 8, the first mission to orbit the Moon. On the way back to Earth, after several days in cramped conditions with little sleep, he was entering star positions into the computer. He was supposed to enter the program number, 23, and then the star whose position he wanted to record. On one of these cycles, though, instead of selecting program 23, he entered the number of the star first. It was number 01.

The computer behaved just as it had on the ground. It forgot all of its navigational data, and reset itself as if ready for launch. It took a tense half hour of manual observation, communication with Mission Control and careful data entry to reconstruct the data and bring the craft back under control - an experience similar to that which Lovell would have later when he commanded Apollo 13.

NASA agreed to let Hamilton and her team build more error handling into the AGC. It helped save the Moon landing when the computer became overloaded in the last minutes of Apollo 11’s descent.

This might seem like a cautionary tale from the early days of computing. Back then, it may have seemed reasonable that trained experts would not make mistakes and that computers would not go wrong. Today, surely, we know better.

And yet . . .

I believe that Hamilton’s experience is replicated today, in thousands - perhaps millions - of routine decisions about computer systems. Some of these decisions are deliberate and overt, but many more are passive and silent.

The deliberate decisions typically appear in the design and build phases of development. The architect asks the sponsor what level of availability they would like to have, and the sponsor naturally replies that they would like 100% availability. Then the architect shows them the cost, and they change their mind. Do we really need that level of redundancy? Do we really need to backup the data to a different location? And, as the system approaches launch, and time is crunched, they start to ask different questions. Do we really need to spend that much effort on testing? If it is coded properly, won’t it just work? The architect and the product manager try to explain everything that could go wrong, but it doesn’t seem real - unlike the time, money and resources which are leaking away.

However, the most dangerous choices are those which are not taken out loud. They are the implicit choices not to maintain currency, or to apply upgrades, or apply patches, or to sustain a team that can continuously improve a product. They are the choices which manifest in risk registers which are slowly turning red, but which are not used to drive action. Why spend time, effort and resources on something which does not appear to be broken?

Our challenge is that the business sponsor’s reasonable instincts often appear to be right - for a time. Systems run for remarkably long periods without failing. Attacks and breaches - and their consequences - may not immediately be apparent, or may never come to light at all. Disasters rarely strike - and when they strike, most frequently take the form of unspectacular power and network failures rather than floods and fires. It is easy to see why many business sponsors come to believe that the technologists are crying wolf.

But the wolves are real. Jim Lovell and the crew of Apollo 8 were unlucky, but their bad fortune was good for the Apollo programme. If they had not shown that there really was a wolf in the cockpit, then the problems on Apollo 11 may not have been anticipated - and the first Moon landing would have ended very differently.

As technologists, it is our job is to point out the wolves that other people can’t see: the errors and vulnerabilities in the code; the inevitability of hardware failure; and the consequences of disasters. To help our business sponsors see the wolves, we need to talk two languages.

First, we must speak the objective, quantitative language of risk management. Such language enables us to take rational decisions and make sensible compromises. It enables us to see that risk is a resource, just like time, money and people - and to figure out how to balance each of them.

Second, though, we must speak the language of stories. Numbers are powerful, but systems failures have real impacts on real lives. Explaining these impacts helps sponsors understand the consequences of their choices. There are many stories to tell - the story of how we once went to the Moon, and what we learnt on the way, is just one of them.

(Views in this article are my own.)

A Lot to Learn

22,851 位关注者

Darragh O'Grady

Technology Strategy & Architecture Advisory at Protiviti

2 周

Great insights. The key message is that the more risk you carry, the less (good) luck you can expect. So know the risk you are carrying! Resilience is partially about luck, but as the saying goes, you make your own luck..

1 次回应

Tom Natt

Director of Engineering at Macmillan Cancer Support

2 周

As others have said, we all have stories like this (albeit usually less exciting, fortunately). We certainly need to be better at telling them - especially with the destructive potential of AI - but the first step is creating a forum in which to tell them. Concerns are easily dismissed when filtered through the PM / arch (who shouldn't have to shoulder all this) in the context of a specific project / delivery. We desperately need org culture to be more interested in tech and our stories - of success as well as doom - so these wolves are considered a normal part of discussion, not fear-mongering. That's on everyone - we in tech need to be better at coming out of our corner, and maybe through that we can help the wider org be more interested in what we have to say.

3 次回应

Mohammed Brueckner

Strategic IT-Business Interface Specialist | Microsoft Cloud Technologies Advocate | Cloud Computing, Enterprise Architecture

2 周

The AGC saga reveals software’s eternal paradox - we build systems assuming infallible users while knowing humans excel at creative failure modes. Hamilton’s battle wasn’t just coding—it was convincing others that “trained experts don’t make mistakes” is as mythical as lunar cheese. NASA’s eventual pivot from “don’t press P01” to error resilience mirrors today’s DevOps mantra: expect chaos, bake in recovery. Yet half a century later, we still ship MVPs with “don’t click that” sticky notes—proof that technological progress walks hand-in-hand with selective amnesia. Maybe our apps need more Lauren simulators…and fewer “move fast and hope” mantras. After all, space taught us: gravity always wins—so do buffer overflows.

2 次回应

Himanshu Tiwari

Senior Delivery Manager @ Material | Business Analytics * Decision Intelligence | Data Tech Delivery Solution * Change Management

2 周

Effective risk management necessitates a thorough consideration of potential downsides before evaluating potential benefits. Presenting risk slides prior to benefit slides ensures a comprehensive understanding of vulnerabilities and informs more robust decision-making.

1 次回应

Ramesh Kuppuswamy

Associate Consultant at Tata Consultancy Services || Computational Engineering || Mehanical Engineering

2 周

the effort associated with error and exception handling are high. the point of contention is "Would the effort for these are worth it and could it be afforded?".

2 次回应

查看更多评论

要查看或添加评论，请登录

David Knott的更多文章

Which is more dangerous: slides or sticky notes?

2025年2月27日

Which is more dangerous: slides or sticky notes?

We’ve all been in that meeting. Perhaps you are planning a programme or designing an architecture.

21 条评论
The language illusion, doubled

2025年2月20日

The language illusion, doubled

Is programming a computer more like language or more like maths? Neither, it turns out. In recent research…

19 条评论
Coping with volatility: don't panic; seek truth; release frequently

2025年2月6日

Coping with volatility: don't panic; seek truth; release frequently

If you’re in the last stages of a multi-year digital delivery programme, then you probably feel frazzled. That’s the…

12 条评论
It's more complicated on the inside than it is on the outside

2025年1月30日

It's more complicated on the inside than it is on the outside

We don’t need time machines to create paradoxes in technology: they are built into the way we work. One of these…

24 条评论
Precision + prediction = the other type of centaur

2025年1月23日

Precision + prediction = the other type of centaur

Are we all centaurs now? ‘Centaur’ is the term used to describe someone who works in tandem with AI. It is part of the…

2 条评论
Learn to fail fast? Technologists fail all the time

2025年1月16日

Learn to fail fast? Technologists fail all the time

From time to time, organisations attempt to learn new ways of working. They attempt to become digital or agile or…

24 条评论
Are LLMs the air fryers of AI?

2025年1月9日

Are LLMs the air fryers of AI?

Do you know someone who got an air fryer for Christmas? Or did you get one yourself? If you know someone who got an air…

40 条评论
On the 2025 to-do list: figure out AI agents

2025年1月2日

On the 2025 to-do list: figure out AI agents

Recent years have seen waves of AI innovation breaking faster than we can figure out good practice. Organisations…

34 条评论
Embrace the gift of boredom this Christmas

2024年12月26日

Embrace the gift of boredom this Christmas

It’s Boxing Day today, which means that, if you live in the United Kingdom, you are entering the limbo period between…

13 条评论
All I want for Christmas is speed and reliability

2024年12月19日

All I want for Christmas is speed and reliability

What were your Christmas lists like as a child? Were they modest requests for improving books and educational toys? Or…

9 条评论

See all articles

A Lot to Learn

22,851 位关注者

David Knott的更多文章

Which is more dangerous: slides or sticky notes?

The language illusion, doubled

Coping with volatility: don't panic; seek truth; release frequently

It's more complicated on the inside than it is on the outside

Precision + prediction = the other type of centaur

Learn to fail fast? Technologists fail all the time

Are LLMs the air fryers of AI?

On the 2025 to-do list: figure out AI agents

Embrace the gift of boredom this Christmas

All I want for Christmas is speed and reliability