登录查看更多内容

How to Perform Incident Post-Mortems: Identify Root Cause with “Five Whys”

S?ren Pedersen

发布日期: 2021年11月12日

Even with careful design and extensive testing—incidents happen.

An incident happens any time software behaves differently than expected. It can be as simple as one user not being able to download a CSV file, or it can be as severe as none of the users of an application being able to login.

“The important thing to realize is that failure is going to happen. It’s not a question of if, it’s a question of when.” —Paul Hammond’s from his 2009 seminal talk with John Allspaw, “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr.”

In a DevOps culture, where continuous integration and deployment (CI/CD) of cloud-based services increase deployment frequency, the probability of incidents is high. It’s just a price to pay for delivering value as often as possible. Consequently, the only realistic approach to decrease downtime is to create an organized plan for incident response and management by performing incident post-mortems to identify root cause.

Why You Should Invest in Incident Analysis

A production incident usually means the system is down for some time. But downtime means dissatisfied customers, damage to brand reputation, and revenue loss. For the Fortune 1000 companies, the average total cost of unplanned application downtime is $1.25B to $2.5B per year.?

But instead of panicking about the inevitability of incidents and their cost to your organization, try looking at them in a different light: incidents are learning opportunities.?

Analyzing incidents reveal insights. It will uncover relevant information about your team’s processes, and contribute to a better understanding of what's failing. Therefore, analyzing incidents lays the foundations to prevent the same from happening again. You can learn from the past to build a better future.

If you feel sure about an incident’s cause, you might feel tempted to skip a formal analysis. But following through with the process will allow other people in your team to have a picture as clear as yours, therefore impacting their future contribution to the team and customers. Sharing your findings with the customers then builds trust.

Why You Should Figure Out The Root Cause of An Incident

In a fast-paced environment, tools like version control and continuous delivery make it easy to “undo” an incident. Often, incidents happen when a bug is pushed into production and rolling back that change can quickly revert the situation. While this is helpful for the teams and gets the service working correctly again in a short amount of time, it doesn’t provide any intelligence on why the incident happened in the first place.

In order to generate learning opportunities, any incident management process should include 5 phases:

1. Detection?

It’s essential for any DevOps team to be prepared for eventual incidents. Identifying weaknesses in the system and setting up monitoring tools and system alerts help team members know what to do when an incident is detected.?

2. Response

DevOps teams usually have several members on-call available for escalations. If the on-call engineers cannot solve the issue, they can bring in the right people to escalate the problem to facilitate the incident resolution.

3. Resolution

This step is all about taking the necessary measures to fix the issue. It is when the problem is solved and the system goes back to functioning properly again.

4. Analysis

The analysis phase of incident management is often referred to as “post-mortem” or Root Cause Analysis (RCA).

While, historically, this phase has mostly been about performing RCA, as systems grow in complexity, teams increasingly look towards models that address complexity, such as the Cynefin framework.

5. Readiness

This is when the process comes full circle. Once an incident has been fixed and the system is restored, the team should reassess its readiness for the next incident. Ideally, everyone will be more prepared and will have learned important lessons from the post-mortem, that equip them to better deal with upcoming incidents.

How the “Five Whys” Can Help You Reach Root Cause

An effective way to perform this incident analysis is to identify the root cause of an incident?by adopting the “Five Whys.” Start by asking why an incident happened and keep repeating the question “Why?” to each answer, five times in a row. Here is an example:

Q: What was the error?

A: I opened the website and got an error trying to log in with my Facebook account.

领英推荐

From Incident Response to Capacity Planning: Exploring…

TaUB Solutions 3 周前

Beyond the ITIL Framework: Navigating the Maze

SkillSets Online 5 个月前

In an organization you will one day face an incident...

onpoint - Africa 2 年前

Q: Why did it return an error? (first Why)

A: The Facebook login hasn’t worked since the release on February 1.

Q: Why not? (second Why)

A: The API key for our Facebook login was incorrect.

Q: Why was the API key for the Facebook login incorrect? (third Why)

A: One configuration setting was incorrect on LIVE, meaning that Facebook rejected our login requests in their API.

Q: Okay, how could that happen?

A: The wrong configuration file was copied from the developer’s workstation without anybody noticing that it was the wrong file.

Q: Why didn’t anybody notice? (fourth Why)

A: We don’t test before a release that the configuration files are the right ones for our environment.

Q: Why don’t you perform regression tests prior to a release? (fifth Why)

A: There is no process in place to ensure that only valid configuration files are copied into production.

When we say “Five Whys”, this is not a strict number. Sometimes, you’ll reach the root cause with only three “Whys,” other times you might need seven. The point is to keep asking until you reach the root cause. By gaining this knowledge will you be able to correct the problem. This is how you learn and improve incident management in your organization.

How to Perform A Post-Mortem

As mentioned, the analysis phase of incident management is often called an incident post-mortem. If this is not something your teams are used to doing, introducing post-mortems to an organization can be challenging. To avoid it turning into a game of blaming and pointing fingers, there are some guidelines to follow:

Stay Away from Finger Pointing

This is the most crucial rule to follow when performing an incident post-mortem. Focusing on finding the guilty people and blaming them for what happened causes more harm than good. Instead, focus on making sure that the whole team learns from the incident and performs better next time.

Appoint a Dedicated Incident Lead

Appoint a dedicated lead whose focus is enforcing post-mortem for each incident. Having a dedicated lead responsible for handling the incident from start to finish will ensure the likelihood of all important details being captured when doing the post-mortem, thus contributing to its success.

Share your Findings

Document the knowledge acquired with each post-mortem in a way that is accessible to the whole organization, for example in a wiki page. This way, every team member and, consequently, the organization as a whole, will benefit from the lessons learned.

Start Small

If you don’t have a culture of post-mortems in your organization yet, start with small steps. Not all incidents are equal. They are usually put into categories of low, moderate, and severe. The categorization is mostly related to the impacted functionality, the number of users affected and the duration of downtime. Naturally, start by analyzing severe incidents, as these cause bigger damage to your organization and your customers. As the post-mortem culture gets ingrained into your organization, you can proceed to analyze medium and low-severity incidents.

Conclusion

DevOps culture leads to more frequent deployments—and generates more incidents. Therefore, you could ask why you would even want to adopt DevOps. Maybe, instead, you should just do larger deployments more often??

Depending on the type of service you provide, this could be a better option. But not in all cases. Either way, don’t be afraid to fail. Be afraid of not learning from your organization’s failures. They will always happen but if you do your post-mortems right. With each new incident, your teams will be stronger and better equipped to handle them.

要查看或添加评论，请登录

S?ren Pedersen的更多文章

Why Retrospectives Go Wrong and How To Increase Follow-Through

2022年7月13日

Why Retrospectives Go Wrong and How To Increase Follow-Through

Retrospectives are important but they are not always done right. Especially teams that are Agile immature encounter…
70% of Digital Transformations Fail?—?Don’t Be One of Them

2022年1月13日

70% of Digital Transformations Fail?—?Don’t Be One of Them

This 2020 study by Boston Consulting Group analyzed digital transformation in 825 organizations, showing an alarming…

1 条评论
How to Avoid 8 Common Agile Anti-Patterns Hurting Your Team

2022年1月3日

How to Avoid 8 Common Agile Anti-Patterns Hurting Your Team

Agile isn’t just a framework, it’s a philosophy. However, many companies miss this fundamental difference when…
7 Tips on How to Speed Up Your Build Pipeline

2021年12月13日

7 Tips on How to Speed Up Your Build Pipeline

In today’s fast-paced software world, organizations need to deliver fast, as fast as multiple times per day. CI/CD…

1 条评论
Agile Adoption Patterns: 6 Common Breaking Points and How To Fix Them

2021年11月24日

Agile Adoption Patterns: 6 Common Breaking Points and How To Fix Them

When companies started embracing Agile, they would bring in a consultancy firm to help come up with a strategy for the…
5 Easy Tricks to Kickstart Your DevOps Transformation

2021年10月7日

5 Easy Tricks to Kickstart Your DevOps Transformation

If there’s a buzzword in technology in the last few years, it’s DevOps. As soon as someone mentions DevOps, a list of…
How to Avoid the Ice Cream Cone of Test Automation

2021年9月27日

How to Avoid the Ice Cream Cone of Test Automation

The testing process is key to delivering quality software. But as the demand for faster delivery increases, it becomes…
Transformational Leadership: How to Build a Vision Your Teams Will Embrace

2021年9月17日

Transformational Leadership: How to Build a Vision Your Teams Will Embrace

In the media, “Vision” quickly became an overused word, so if you consider it a cliché, meaningless term, we understand…
Scaling Agile Frameworks: Creating Solutions or Scaling Problems?

2021年8月24日

Scaling Agile Frameworks: Creating Solutions or Scaling Problems?

When the Agile Manifesto was published in 2001, it brought together several lightweight methods under one umbrella…
How to Combine Developer Skills and Automation to Achieve DevOps Success

2021年8月11日

How to Combine Developer Skills and Automation to Achieve DevOps Success

Adopting DevOps practices opens up a wide range of new possibilities to improve software development from human…

See all articles

How to Perform Incident Post-Mortems: Identify Root Cause with “Five Whys”

S?ren Pedersen

Why You Should Invest in Incident Analysis

Why You Should Figure Out The Root Cause of An Incident

1. Detection?

2. Response

3. Resolution

4. Analysis

5. Readiness

How the “Five Whys” Can Help You Reach Root Cause

领英推荐

How to Perform A Post-Mortem

Stay Away from Finger Pointing

Appoint a Dedicated Incident Lead

Share your Findings

Start Small

Conclusion

S?ren Pedersen的更多文章

社区洞察

其他会员也浏览了

Crowdstrike: A devil’s advocate view

From Chaos to Clarity: How SRE Improves Operational Culture

What is DevSecOps and why is it so important?

ITIL v4 Framework

Complete Guide: SRE Director

Introduction to ITIL 4 Foundation

Introduction to ITIL 4 Foundation

An Approach to AIOPs Driven SRE Solution

What Can You Learn in the SRE Space in a Month?

Monitoring Tools Alone Have Never And Will Never Be Enough in Incident Response

Why You Should Invest in Incident Analysis

Why You Should Figure Out The Root Cause of An Incident

1. Detection?

2. Response

3. Resolution

4. Analysis

5. Readiness

How the “Five Whys” Can Help You Reach Root Cause

领英推荐

How to Perform A Post-Mortem

Stay Away from Finger Pointing

Appoint a Dedicated Incident Lead

Share your Findings

Start Small

Conclusion

S?ren Pedersen的更多文章

Why Retrospectives Go Wrong and How To Increase Follow-Through

70% of Digital Transformations Fail?—?Don’t Be One of Them

How to Avoid 8 Common Agile Anti-Patterns Hurting Your Team

7 Tips on How to Speed Up Your Build Pipeline

Agile Adoption Patterns: 6 Common Breaking Points and How To Fix Them

5 Easy Tricks to Kickstart Your DevOps Transformation

How to Avoid the Ice Cream Cone of Test Automation

Transformational Leadership: How to Build a Vision Your Teams Will Embrace

Scaling Agile Frameworks: Creating Solutions or Scaling Problems?

How to Combine Developer Skills and Automation to Achieve DevOps Success

社区洞察

其他会员也浏览了

Crowdstrike: A devil’s advocate view

From Chaos to Clarity: How SRE Improves Operational Culture

What is DevSecOps and why is it so important?

ITIL v4 Framework

Complete Guide: SRE Director

Introduction to ITIL 4 Foundation

Introduction to ITIL 4 Foundation

An Approach to AIOPs Driven SRE Solution

What Can You Learn in the SRE Space in a Month?

Monitoring Tools Alone Have Never And Will Never Be Enough in Incident Response