Tech without us: Why there wasn’t an outage today

Tech without us: Why there wasn’t an outage today

What if everyone left?

In “The World Without Us”, author Alan Weisman explored what would happen to the planet if humans were to suddenly disappear.

Within a few short days the power grid and infrastructure would start to fail. New York’s Subway would flood following the failure of pumps controlling groundwater. Within weeks, the planet’s 400+ nuclear power stations would start to melt down, creating lakes of radioactive lava rendering the surrounding areas uninhabitable to most remaining species for centuries.

Over millennia, the planet would recover, eventually thriving, but it’s fair to say that it’d be a bumpy ride.

Meanwhile, back at the office…

How long might your technology systems continue to run if supporting staff left?

There’s no need for a cataclysm, rapture or lava lake. A badly managed org restructure, or an unfortunate breakdown in vacation planning would do it.

How long?

A day? A week? A month?

Anyone taking this question seriously might glance at their recent MTBF (mean time between failures) figures, perhaps gaining confidence from the historical irregularity of serious incidents. Infrequent downtime is great news, it tells us that the socio-technical system, comprising the people, their relationships, the technology, and the peoples’ relationships to the technology is, at the very least, keeping the lights on under prevailing market conditions.

But how "hand’s off” exactly are those lights? Are they confidently shining day and night, week by week, month by month? Or do they require the constant supervision of a team of skilled people, attentive to every flicker: prodding, probing and administering treatment as if the lights were on life support?

Keeping the lights on

It’s likely that your systems are more similar to the latter than the former. That’s not to say you’re doing badly, it’s just to recognise that more often than not, complex systems run in degraded mode. This makes intuitive sense. We’re all familiar with the bugs in production, the ticking time bombs, the “explodey bits” and the systems that need close supervision and frequent intervention to stop them going bang.

However, these continual efforts are easy to miss. David Woods put it best in his Law of Fluency, which states:

“Well-adapted cognitive work occurs with a facility that belies the difficulty of resolving demands and balancing dilemmas. The adaptation process hides the factors and constraints that are being adapted to or around.”

In other words, your people are good. Your people are so good that their critical activities to keep systems running frequently aren’t noticed, and even if you did notice, they’d look like nothing! And what do you do when your systems stay up and you didn’t notice that anything was wrong? That’s right, nothing! And therefore, what do you learn…?

Every once in a while, in-spite of best efforts, systems will fail, customers will be impacted, perhaps a root cause analysis will take place and learning will commence. The commitment to learning is likely proportional to the impact of the incident, with the greatest commitment reserved for the gravest of impacts. This is totally understandable but ultimately undesirable if the goal is learning, reliable systems and uptime.

Resilience as a verb

Erik Hollnagel, one of the founders of resilience engineering stated that ‘Resilience is something you do, not something you have”. David Woods refined this idea further, reframing the word “resilience” as a verb rather than a noun.

We often colloquially think of ‘resilience’ as a synonym for ‘robustness’ or ‘reliability’, but whereas these nouns represent static properties of a system, resilience is more dynamic, concerning how a system adapts in the face of strain or adversity. If you think of resilience in this way, you can start asking the question, “What resilience are we doing?” Simply asking this question gets one thinking about all the activities, visible and hidden, that are occurring every day to nurture the adaptability required to flex around the inevitable organisational challenges that are happening all the time.

If you can surface what’s actually going on, rather than keeping it hidden, you have half a chance of improving your capacity to adapt, and that will serve you well when systems become stretched.

So what kind of things might you do to encourage this way of thinking?

Here are some ideas:

  • Study near misses
  • Talk to practitioners rather than just managers, practitioners really know whats going on
  • Approach low impact incidents with the same commitment to learning as high impact ones
  • Run periodic resilience retrospectives, especially when serious incidents haven’t occurred
  • Use on-call handovers to surface interesting and uninteresting things that happened
  • Make friends with your customer services people. They’ll tell you about issues that never appeared in your observability systems
  • Give folks with an interest in resilience engineering the opportunity to rotate around different teams, sharing their knowledge and learning from others
  • Document incidents as compelling stories at people will want to read or listen to and share
  • Implement processes that facilitate fast, safe change such as continuous delivery
  • Create measurements/KPIs that encourage the reporting of issues rather than the suppression of issues (e.g. don’t have a ‘number of incidents’ KPI where low=good and high=bad as you’ll end up with more incidents but fewer reports)
  • Practice incident response

What else? We’re interested to hear what you do.

Here’s to the Humans

Regardless of where such activities occur in your organisation, these activities ARE your resilience in action. Whats more, they’re mostly human in nature. Yes you’ve doubtless got technological redundancy, failover and fault tolerance, but resilience is in your people and it’s probably hidden.

In these times when cost is under scrutiny and faith is being placed in automation and AI, it’s more important than ever to discover, and recognise the vital role your staff play in your organisation’s resilience.

So with that in mind, how long might your technology systems continue to run if supporting staff left?

Maybe not as long as you’d think.

要查看或添加评论,请登录

Uptime Labs的更多文章

社区洞察

其他会员也浏览了