登录查看更多内容

Discover the ONE Thing You Can Do to Avoid Future Incidents

Uptime Labs

The world’s first realistic incident drill platform

发布日期: 2024年7月24日

Peer into the post-incident toolkit of many a self respecting Incident Manager and you’ll undoubtedly find a collection of Root Cause Analysis (RCA) tools. You’ll likely find fishbone diagrams, fault trees and an assortment of “whys” (which by convention tend to come in sets of 5). Such tools are wielded in pursuit of the root cause, which lurks hidden beneath more proximate causes, offering the tantalising promise of long term, systemic fixes rather than shallow “sticking plaster” remedies.

Root cause analysis is hungry business however, and there’s nothing better for one’s post RCA recovery than a cheese sandwich. I personally favour Emmental, but any cheese with holes in it will do. Such moments of dairy and carb fuelled downtime offer a satisfying opportunity to ponder the Swiss Cheese Model of Accident Causation.

The Swiss Cheese Model, coined by Psychology Professor James T Reason imagines systems and their defences as multiple slices of swiss cheese, stacked on top of each other. Each slice contains holes, representing risks, vulnerabilities or flaws, but for a threat (accidental or intentional) to pass through all the layers and endanger the system as a whole, the holes in each layer need to line up. Most holes in one layer will lay on top of a solid, cheesy barrier, protecting the system from failure.

Each layer has holes, but it’s unlikely anything’s going to get through this stack…

The model serves as an appetising reminder that catastrophe requires multiple failures – single point failures are not enough. Even if an incident appears on the surface to be caused by a single point failure e.g., a hardware failure or ‘human error’, it’s rare that such events are the single, solitary ‘cause’ in a system that is impervious in every other aspect. What led the hardware failure to result in an outage? What system properties led to the human error?

Of course this is the precise purpose of root cause analysis: to peer beyond proximate causes the find the fundamental ‘root’ of the problem. The issue with this tends to be twofold : –

Root cause analysis efforts seldom go deep enough.
There is rarely, if ever, a single root. Rather, there are multiple contributing factors.

领英推荐

More Crisis Control at Boeing and the FAA – Some…

Lean Enterprise Institute 10 个月前

Root Cause Analysis Explained

testRigor 1 个月前

Once Again…Yay PPOST!

Steve Newton 2 个月前

A Swiss cheese metaphor is useful here too. Imagine a typical triangular wedge of cheese. It has a sharp end, and a blunt end. The sharp end represents the events leading directly to the incident (that hardware failure again). Such events in pathological or bureaucratic organisations typically receive a painful poke from the bony finger of blame. The blunt end on the other hand represents system attributes such as culture, policies, procedures, resources, constraints. Examples include management style, hiring policies, training, performance management, salaries, working hours, pressure etc. Events at the sharp end emerge from the conditions at the blunt end. It’s also worth noting that events at the sharp end feed back into attributes at the blunt end, sometimes quickly (knee-jerk management actions) and sometimes more slowly.

It’s an unfortunate reality that many root cause analysis efforts following failure, tend to stop near the sharp end (she did this, they failed to do that, process x did the other). In contrast, most efforts to understand the contributing factors of success tend to land at the blunt end (it was well managed, it was resourced well, the processes were just divine!). While this tendency says more about common motivations behind RCA than the practice itself, even when motivations are benevolent, the idea that a single cause can be found is misguided.

That’s not the say that root cause analysis should be consigned to the bin. When done effectively, RCA can surface the myriad of sharp and blunt end conditions, leading to the disposition or propensity that in turn led to the holes aligning, resulting in an outage or incident. The ‘root cause’ is not really what you’re looking for anyway. You’re looking for the contributing factors (plural), such that you can learn from events and nudge your improvement efforts in the right direction.

So the next time someone asks, “What’s the ONE THING we can do to avoid future incidents?”, maybe your answer should be, “Eat a cheese sandwich”.

References

This blog makes multiple citations from Dr Richard Cook’s paper How Complex Systems Fail. This very short, readable paper isn’t specifically about resilience in technology, but if you read it, you’ll notice that it may as well have been.

Uptime Edge

403 位关注者

Yoav Chudnoff

Digital Transformation Strategist | Business Development

7 个月

This is a great article by Stuart Rimell at IG Group. There is no doubt that Root Cause Analysis (RCA) is essential for incident managers aiming to achieve long-term, systemic fixes rather than short-term remedies. Tools like fishbone diagrams and fault trees help uncover underlying issues. However, RCA often doesn't delve deep enough, and there isn't a single root cause but multiple contributing factors. James T. Reason’s Swiss Cheese Model shows that incidents result from multiple failures aligning, not just single points of failure. Remember: A proactive approach and continued training will always be the least expensive option.

Discover the ONE Thing You Can Do to Avoid Future Incidents

Uptime Labs

The world’s first realistic incident drill platform

领英推荐

Uptime Edge

403 位关注者

Uptime Labs的更多文章

社区洞察

其他会员也浏览了

Beyond the Five Whys: Embracing the Five Hows in Root Cause Analysis

Unraveling the Threads of Incident Prevention: A Deep Dive into Root Cause Analysis (RCA) and the Critical Role of Remediation

Resolving Major Incidents – Understanding & Communicating Business Impact

Pause to Progress: The Case for a 'Technical Timeout' in IT

In the news - episode 1

CROWDSTRIKE INCIDENT- 12 DAYS LATER

How to Avoid Finger-Pointing and Uncover the True Cause of Incidents

KPMG`s EMA+ Exchange of DORA Leaders in Brussels

A Business Analysts as a Firefighter - The Southwest Airlines scenario

领英推荐

Uptime Edge

403 位关注者

Uptime Labs的更多文章

Don't Wait For Chaos to Strike to Start Thinking About Incident Response

Learning from Aviation: Ways to Enhance Incident Response in Software Engineering

Can Automation Solve All Incidents?

Looking beyond MTTR

Why so mean about MTTR?

Navigating Incidents with Clarity Through Grounding

Tech without us: Why there wasn’t an outage today

How We Learn: The Value of Simulation in Incident Response

The Most Common Incident Management Problems

The Power of Grounding: Insights from the Details Matter Challenge Drill

社区洞察

其他会员也浏览了

Beyond the Five Whys: Embracing the Five Hows in Root Cause Analysis

Unraveling the Threads of Incident Prevention: A Deep Dive into Root Cause Analysis (RCA) and the Critical Role of Remediation

Resolving Major Incidents – Understanding & Communicating Business Impact

Pause to Progress: The Case for a 'Technical Timeout' in IT

In the news - episode 1

CROWDSTRIKE INCIDENT- 12 DAYS LATER

How to Avoid Finger-Pointing and Uncover the True Cause of Incidents

KPMG`s EMA+ Exchange of DORA Leaders in Brussels

A Business Analysts as a Firefighter - The Southwest Airlines scenario