Technologists are always crying wolf (because of all the wolves)
The computer had failed. Unfortunately, it was the Apollo Guidance Computer (AGC), the machine that controlled the flight of a small, fragile spacecraft to the Moon and back. Fortunately, it wasn’t in space: it was on the ground, in a simulator.
Margaret Hamilton, the leader of the MIT team programming the AGC, often had to work weekends to meet the urgent schedule of the Apollo programme, and sometimes brought her daughter, Lauren, to work with her. Lauren liked to play in the simulator.
Somehow, while in simulated spaceflight, Lauren had caused the AGC to jettison all of its navigational data. When Hamilton investigated, she found that Lauren had told the computer to load program 01: the program that prepared the craft for launch. The computer did what it was told: it forgot all of the data about the simulated flight in progress, and reset as if it was sitting on the launchpad.
Hamilton realised that if the mission had been real, rather than a simulation, the Command Module would have been lost, drifting through space with no idea of where it was. She tried to persuade NASA to build safeguards and controls into the system, but they told her that they didn’t have time - and, besides, astronauts don’t make mistakes. All she could do was add a note to the manual: ‘Do not select program 01 during spaceflight.’
In the very next flight an astronaut made a mistake. Jim Lovell was part of the crew of Apollo 8, the first mission to orbit the Moon. On the way back to Earth, after several days in cramped conditions with little sleep, he was entering star positions into the computer. He was supposed to enter the program number, 23, and then the star whose position he wanted to record. On one of these cycles, though, instead of selecting program 23, he entered the number of the star first. It was number 01.
The computer behaved just as it had on the ground. It forgot all of its navigational data, and reset itself as if ready for launch. It took a tense half hour of manual observation, communication with Mission Control and careful data entry to reconstruct the data and bring the craft back under control - an experience similar to that which Lovell would have later when he commanded Apollo 13.
NASA agreed to let Hamilton and her team build more error handling into the AGC. It helped save the Moon landing when the computer became overloaded in the last minutes of Apollo 11’s descent.
This might seem like a cautionary tale from the early days of computing. Back then, it may have seemed reasonable that trained experts would not make mistakes and that computers would not go wrong. Today, surely, we know better.
And yet . . .
I believe that Hamilton’s experience is replicated today, in thousands - perhaps millions - of routine decisions about computer systems. Some of these decisions are deliberate and overt, but many more are passive and silent.
The deliberate decisions typically appear in the design and build phases of development. The architect asks the sponsor what level of availability they would like to have, and the sponsor naturally replies that they would like 100% availability. Then the architect shows them the cost, and they change their mind. Do we really need that level of redundancy? Do we really need to backup the data to a different location? And, as the system approaches launch, and time is crunched, they start to ask different questions. Do we really need to spend that much effort on testing? If it is coded properly, won’t it just work? The architect and the product manager try to explain everything that could go wrong, but it doesn’t seem real - unlike the time, money and resources which are leaking away.
However, the most dangerous choices are those which are not taken out loud. They are the implicit choices not to maintain currency, or to apply upgrades, or apply patches, or to sustain a team that can continuously improve a product. They are the choices which manifest in risk registers which are slowly turning red, but which are not used to drive action. Why spend time, effort and resources on something which does not appear to be broken?
Our challenge is that the business sponsor’s reasonable instincts often appear to be right - for a time. Systems run for remarkably long periods without failing. Attacks and breaches - and their consequences - may not immediately be apparent, or may never come to light at all. Disasters rarely strike - and when they strike, most frequently take the form of unspectacular power and network failures rather than floods and fires. It is easy to see why many business sponsors come to believe that the technologists are crying wolf.
But the wolves are real. Jim Lovell and the crew of Apollo 8 were unlucky, but their bad fortune was good for the Apollo programme. If they had not shown that there really was a wolf in the cockpit, then the problems on Apollo 11 may not have been anticipated - and the first Moon landing would have ended very differently.
As technologists, it is our job is to point out the wolves that other people can’t see: the errors and vulnerabilities in the code; the inevitability of hardware failure; and the consequences of disasters. To help our business sponsors see the wolves, we need to talk two languages.
First, we must speak the objective, quantitative language of risk management. Such language enables us to take rational decisions and make sensible compromises. It enables us to see that risk is a resource, just like time, money and people - and to figure out how to balance each of them.
Second, though, we must speak the language of stories. Numbers are powerful, but systems failures have real impacts on real lives. Explaining these impacts helps sponsors understand the consequences of their choices. There are many stories to tell - the story of how we once went to the Moon, and what we learnt on the way, is just one of them.
(Views in this article are my own.)
Technology Strategy & Architecture Advisory at Protiviti
2 周Great insights. The key message is that the more risk you carry, the less (good) luck you can expect. So know the risk you are carrying! Resilience is partially about luck, but as the saying goes, you make your own luck..
Director of Engineering at Macmillan Cancer Support
2 周As others have said, we all have stories like this (albeit usually less exciting, fortunately). We certainly need to be better at telling them - especially with the destructive potential of AI - but the first step is creating a forum in which to tell them. Concerns are easily dismissed when filtered through the PM / arch (who shouldn't have to shoulder all this) in the context of a specific project / delivery. We desperately need org culture to be more interested in tech and our stories - of success as well as doom - so these wolves are considered a normal part of discussion, not fear-mongering. That's on everyone - we in tech need to be better at coming out of our corner, and maybe through that we can help the wider org be more interested in what we have to say.
Strategic IT-Business Interface Specialist | Microsoft Cloud Technologies Advocate | Cloud Computing, Enterprise Architecture
2 周The AGC saga reveals software’s eternal paradox - we build systems assuming infallible users while knowing humans excel at creative failure modes. Hamilton’s battle wasn’t just coding—it was convincing others that “trained experts don’t make mistakes” is as mythical as lunar cheese. NASA’s eventual pivot from “don’t press P01” to error resilience mirrors today’s DevOps mantra: expect chaos, bake in recovery. Yet half a century later, we still ship MVPs with “don’t click that” sticky notes—proof that technological progress walks hand-in-hand with selective amnesia. Maybe our apps need more Lauren simulators…and fewer “move fast and hope” mantras. After all, space taught us: gravity always wins—so do buffer overflows.
Senior Delivery Manager @ Material | Business Analytics * Decision Intelligence | Data Tech Delivery Solution * Change Management
2 周Effective risk management necessitates a thorough consideration of potential downsides before evaluating potential benefits. Presenting risk slides prior to benefit slides ensures a comprehensive understanding of vulnerabilities and informs more robust decision-making.
Associate Consultant at Tata Consultancy Services || Computational Engineering || Mehanical Engineering
2 周the effort associated with error and exception handling are high. the point of contention is "Would the effort for these are worth it and could it be afforded?".