What could possibly go wrong?

What could possibly go wrong?

Are technology people particularly short-sighted? The story of the millennium bug seems to say so. In case you’re not familiar with it (and, although it loomed large in my life, I have to remember that it was over twenty years ago now), the millennium bug was caused by people like me building computer systems which only used two digits to store the year. This seemed like a great way of saving storage and memory in the 1970s and 1980s, but less so when the millennium loomed, and we realised that we were going to need a bigger date. It took millions of people, hours and dollars to fix things so that systems carried on working on 1st January 2000 (and if anyone tries to tell you that the whole thing was a hoax, try asking someone who worked on a millennium project or ran tests that night).

I think that the story of the millennium bug is not a story of short-sightedness: rather it is a reminder that we are still at the early stages of integrating computers into our society. From the perspective of 2022, it seems obvious that systems will run for decades and that they need to be capable of handling all future dates. From the perspective of the 1970s and 1980s, those systems were brand new, and it seemed certain that they would have limited lifespans. Surely nobody would still be running that code twenty years later!

This article is a short epilogue to a series I recently wrote to address the round trip question : what happens when you press ‘send’ on the mobile banking app on your phone. It told the story of how things are supposed to work. This article considers what happens when things go wrong - even when we have no idea how things will go wrong in the future.

I believe that responsible technologists must plan for failure, and, furthermore, that they must plan for failure in two distinct ways.

They must plan for the failures that they know may happen. There are well documented and understood ways in which computer systems may fail, ranging from hardware failure, to software faults, to malicious cyber attacks, to catastrophic environmental failures which destroy entire facilities. And there are well established ways to to deal with these types of failure. We deploy systems on multiple machines. We test software and have rollback mechanisms and incident processes. We erect network defences, monitoring regimes and many other security controls. And we place systems in physically separated sites, so that even floods, fires and explosions cannot stop them working. A major part of the cost of building a computer system is not just getting it to work: it’s ensuring that it keeps on working, even when bad things happen.

And yet, as the story of the millennium bug shows, this is not enough. We do not just need to protect ourselves against the types of failure we know about: we need to protect ourselves against the failures we don’t know about. These come from building technology today which will run in the future, when the environment will be very different.

Let me illustrate this with another example. A common form of attack today is known as ‘crypto-jacking’. The motivation for this attack is the increasing price of bitcoin, and the increasing quantity of computational power required to ‘mine’ bitcoin. I won’t attempt to explain bitcoin mining here (that might be the subject of another article). For now it’s enough to say that the phenomenon of crypto currency has made it very lucrative for attackers to take control of computing resources, particularly when somebody else is paying the bill. So, these attackers find various ways to penetrate computer networks and steal processing power - a bit like someone stealing electricity from the grid (another thing that bitcoin miners sometimes do).

The point of this example is not just that technologists should defend themselves against crypto-jacking. Rather, the point is that it is hard to see how, in the early days of computing and networking, anyone could have predicted this particular way of things going wrong. Crypto-jackers don’t typically take over mainframes, but their theft of resources may disrupt mainframe processing. Back in the 1980s, when some of that mainframe code was being written, the developers couldn’t have predicted that, one day, their system would be part of a global public network, connected to billions of devices all over the world, some of them operated by people who wanted to steal resources in order to generate units of something that they regarded as a whole new currency, based on exotic mathematics.

When we built the systems that suffered from the millennium bug, we were naive about how long our systems would last. Today, we still can’t predict the future, but we cannot deny that our code must learn to survive in that future.

So, what do we do? How do we build systems that can cope with types of failure we can’t even imagine today? First, of course, we must accept that things will go wrong, and that designing for when things go wrong is at least as important as designing for when things go right. Second, we can make our systems much more independent and resilient. The trend towards smaller, more self-reliant systems has been going on for a long time, and is generally regarded as good software design and development practice.

Back when I started my programming career, we wrote ‘suites’ of programmes: programmes that knew their place in a strict sequence, that relied on the programmes earlier in the sequence to do their jobs perfectly. Such suites could be efficient, but were fragile: small errors could bring everything down. Furthermore, they typically ran on a single machine in a single location.

Well-designed modern systems are much more self-contained and encapsulated: they present a tightly constrained interface to the world (whether that interface is used by humans or machines), they check that everyone and everything that talks to them is allowed to. They trust nothing and no-one. They have no ties to the machines that they happen to be running on, and are running across many machines in many locations at the same time. They assume that any piece of infrastructure could fail, that any interaction could be malicious, and that any other system they talk to might not be available. This approach to building systems may seem paranoid, but paranoia is warranted when the environment is always changing.

The unpredictable nature of the future in the computing age may seem disconcerting, but it is a good thing. It comes from our creativity and ingenuity, coupled with the power that computing gives us: we don’t know what we will invent tomorrow. But it also brings new ways that things can go wrong. This places even more responsibility on technologists: we don’t just have a duty to explain, we have a duty to design and build things that will survive failures we can’t predict.

(Views in this article are my own.)

James Linsell-Fraser

Principal Industry Architect

2 年

The best bit about the year 2000 bug was when they paid me triple time to sit outside the computer room for many hours ready to do 'someting' if there were an issue at midnight !

Barry O'Reilly

Founder at Black Tulip Technology

2 年

You might find my research on residuality theory interesting, a complexity science based approach for designing software for unknown futures.

Phil Starrett

Technology Executive, and Transformational Business Leader; Chief Technology Officer | Chief Architect | Chief Digital Officer

2 年

Maybe it’s not about predicting the future (past lessons has taught us that doesn’t always work, non more so than the last 2 years with the pandemic), more so creating the future… is maybe a better way of looking at it. The data landscape is changing, whereby leveraging augmented and connected intelligence is leading the way (IMHO), as organisations evolve through the stages from reactive, to predictive, to prescriptive analytics.

?? Anton J. Coetzee - ABCP, BEM, CBA, DTM, EEng, FA??

Board and Senior Leadership Member, Strategic and Tactical Resiliency and BCM Consultant and Public Speaker. A Crisis Leader -saving Time and Money...Mastering Chaos with Strategy, Actions and Deeds... NOT just Words.

2 年

Great article... You looked and "spoke" about all that could go wrong... But in my mind you missed one crucial element. That of human error. Unintentional or deliberate. Thoughts?

回复
David Martin

More Work Done, Same Staff – Automate Boring Work – RPA & AI - Productivity by Automation - Software Robots

2 年

Some good points David Knott, I think one aspect not covered is the "Rate of Technology Change". I believe that as Technology Change has increased, the useful lifetime of code being developed is getting shorter. I am not suggesting quality should be ignored, but for me the lack of investment in change was the problem with the Millennium bug. All of the expenditure came in 1999, to address the need for minimal data size in early programme developments. Similarly, the big mainframe systems that still exist in some banks, air traffic control, etc. are historic challenges which newer financial institutions for example have avoided by using newer technological approaches but still delivering the same services.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了