登录查看更多内容

Tech without us: Why there wasn’t an outage today

Uptime Labs

The world’s first realistic incident drill platform

发布日期: 2024年6月12日

What if everyone left?

In “The World Without Us”, author Alan Weisman explored what would happen to the planet if humans were to suddenly disappear.

Within a few short days the power grid and infrastructure would start to fail. New York’s Subway would flood following the failure of pumps controlling groundwater. Within weeks, the planet’s 400+ nuclear power stations would start to melt down, creating lakes of radioactive lava rendering the surrounding areas uninhabitable to most remaining species for centuries.

Over millennia, the planet would recover, eventually thriving, but it’s fair to say that it’d be a bumpy ride.

Meanwhile, back at the office…

How long might your technology systems continue to run if supporting staff left?

There’s no need for a cataclysm, rapture or lava lake. A badly managed org restructure, or an unfortunate breakdown in vacation planning would do it.

How long?

A day? A week? A month?

Anyone taking this question seriously might glance at their recent MTBF (mean time between failures) figures, perhaps gaining confidence from the historical irregularity of serious incidents. Infrequent downtime is great news, it tells us that the socio-technical system, comprising the people, their relationships, the technology, and the peoples’ relationships to the technology is, at the very least, keeping the lights on under prevailing market conditions.

But how "hand’s off” exactly are those lights? Are they confidently shining day and night, week by week, month by month? Or do they require the constant supervision of a team of skilled people, attentive to every flicker: prodding, probing and administering treatment as if the lights were on life support?

It’s likely that your systems are more similar to the latter than the former. That’s not to say you’re doing badly, it’s just to recognise that more often than not, complex systems run in degraded mode. This makes intuitive sense. We’re all familiar with the bugs in production, the ticking time bombs, the “explodey bits” and the systems that need close supervision and frequent intervention to stop them going bang.

However, these continual efforts are easy to miss. David Woods put it best in his Law of Fluency, which states:

领英推荐

Emerging AI: Roundup for September and October 2024

Peterson Technology Partners 4 个月前

Why Atlantic Canada is (mostly) refusing AI's embrace

Public Policy Forum 3 个月前

Industry Insights

BPS World 7 个月前

“Well-adapted cognitive work occurs with a facility that belies the difficulty of resolving demands and balancing dilemmas. The adaptation process hides the factors and constraints that are being adapted to or around.”

In other words, your people are good. Your people are so good that their critical activities to keep systems running frequently aren’t noticed, and even if you did notice, they’d look like nothing! And what do you do when your systems stay up and you didn’t notice that anything was wrong? That’s right, nothing! And therefore, what do you learn…?

Every once in a while, in-spite of best efforts, systems will fail, customers will be impacted, perhaps a root cause analysis will take place and learning will commence. The commitment to learning is likely proportional to the impact of the incident, with the greatest commitment reserved for the gravest of impacts. This is totally understandable but ultimately undesirable if the goal is learning, reliable systems and uptime.

Resilience as a verb

Erik Hollnagel, one of the founders of resilience engineering stated that ‘Resilience is something you do, not something you have”. David Woods refined this idea further, reframing the word “resilience” as a verb rather than a noun.

We often colloquially think of ‘resilience’ as a synonym for ‘robustness’ or ‘reliability’, but whereas these nouns represent static properties of a system, resilience is more dynamic, concerning how a system adapts in the face of strain or adversity. If you think of resilience in this way, you can start asking the question, “What resilience are we doing?” Simply asking this question gets one thinking about all the activities, visible and hidden, that are occurring every day to nurture the adaptability required to flex around the inevitable organisational challenges that are happening all the time.

If you can surface what’s actually going on, rather than keeping it hidden, you have half a chance of improving your capacity to adapt, and that will serve you well when systems become stretched.

So what kind of things might you do to encourage this way of thinking?

Here are some ideas:

Study near misses
Talk to practitioners rather than just managers, practitioners really know whats going on
Approach low impact incidents with the same commitment to learning as high impact ones
Run periodic resilience retrospectives, especially when serious incidents haven’t occurred
Use on-call handovers to surface interesting and uninteresting things that happened
Make friends with your customer services people. They’ll tell you about issues that never appeared in your observability systems
Give folks with an interest in resilience engineering the opportunity to rotate around different teams, sharing their knowledge and learning from others
Document incidents as compelling stories at people will want to read or listen to and share
Implement processes that facilitate fast, safe change such as continuous delivery
Create measurements/KPIs that encourage the reporting of issues rather than the suppression of issues (e.g. don’t have a ‘number of incidents’ KPI where low=good and high=bad as you’ll end up with more incidents but fewer reports)
Practice incident response

What else? We’re interested to hear what you do.

Here’s to the Humans

Regardless of where such activities occur in your organisation, these activities ARE your resilience in action. Whats more, they’re mostly human in nature. Yes you’ve doubtless got technological redundancy, failover and fault tolerance, but resilience is in your people and it’s probably hidden.

In these times when cost is under scrutiny and faith is being placed in automation and AI, it’s more important than ever to discover, and recognise the vital role your staff play in your organisation’s resilience.

So with that in mind, how long might your technology systems continue to run if supporting staff left?

Maybe not as long as you’d think.

Tech without us: Why there wasn’t an outage today

Uptime Labs

The world’s first realistic incident drill platform

What if everyone left?

Meanwhile, back at the office…

领英推荐

Resilience as a verb

Here’s to the Humans

Uptime Edge

403 位关注者

Uptime Labs的更多文章

社区洞察

其他会员也浏览了

Benefits and drawbacks of Amazon’s return to office | Microsoft revives nuclear reactor to power data centers

El Capitan Replaces Frontier at the Top Spot, but at What Cost?

Dutch Topsector Newsletter - Issue #15

Big Tech|Big Nukes, and Go South Central, young remote worker

What can PG&E teach us about AI adoption?

What's Coming in 2025?

Europe's hidden energy crisis: Data centers

Weekly Tech Update: November 5th, 2024

#0003 - AI in Hollywood - The Next Big Plot Twist

Why the Smart Grid Needs Real-Time Whole-System Simulations, Predictive and Probabilistic AI, and Supercomputing to Power the Future

What if everyone left?

Meanwhile, back at the office…

领英推荐

Resilience as a verb

Here’s to the Humans

Uptime Edge

403 位关注者

Uptime Labs的更多文章

Discover the ONE Thing You Can Do to Avoid Future Incidents

Don't Wait For Chaos to Strike to Start Thinking About Incident Response

Learning from Aviation: Ways to Enhance Incident Response in Software Engineering

Can Automation Solve All Incidents?

Looking beyond MTTR

Why so mean about MTTR?

Navigating Incidents with Clarity Through Grounding

How We Learn: The Value of Simulation in Incident Response

The Most Common Incident Management Problems

The Power of Grounding: Insights from the Details Matter Challenge Drill

社区洞察

其他会员也浏览了

Benefits and drawbacks of Amazon’s return to office | Microsoft revives nuclear reactor to power data centers

El Capitan Replaces Frontier at the Top Spot, but at What Cost?

Dutch Topsector Newsletter - Issue #15

Big Tech|Big Nukes, and Go South Central, young remote worker

What can PG&E teach us about AI adoption?

What's Coming in 2025?

Europe's hidden energy crisis: Data centers

Weekly Tech Update: November 5th, 2024

#0003 - AI in Hollywood - The Next Big Plot Twist

Why the Smart Grid Needs Real-Time Whole-System Simulations, Predictive and Probabilistic AI, and Supercomputing to Power the Future