How to Perform Incident Post-Mortems: Identify Root Cause with “Five Whys”

How to Perform Incident Post-Mortems: Identify Root Cause with “Five Whys”

Even with careful design and extensive testing—incidents happen.

An incident happens any time software behaves differently than expected. It can be as simple as one user not being able to download a CSV file, or it can be as severe as none of the users of an application being able to login.

“The important thing to realize is that failure is going to happen. It’s not a question of if, it’s a question of when.” —Paul Hammond’s from his 2009 seminal talk with John Allspaw, “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr.”

In a DevOps culture, where continuous integration and deployment (CI/CD) of cloud-based services increase deployment frequency, the probability of incidents is high. It’s just a price to pay for delivering value as often as possible. Consequently, the only realistic approach to decrease downtime is to create an organized plan for incident response and management by performing incident post-mortems to identify root cause.

Why You Should Invest in Incident Analysis

A production incident usually means the system is down for some time. But downtime means dissatisfied customers, damage to brand reputation, and revenue loss. For the Fortune 1000 companies, the average total cost of unplanned application downtime is $1.25B to $2.5B per year.?

But instead of panicking about the inevitability of incidents and their cost to your organization, try looking at them in a different light: incidents are learning opportunities.?

Analyzing incidents reveal insights. It will uncover relevant information about your team’s processes, and contribute to a better understanding of what's failing. Therefore, analyzing incidents lays the foundations to prevent the same from happening again. You can learn from the past to build a better future.

If you feel sure about an incident’s cause, you might feel tempted to skip a formal analysis. But following through with the process will allow other people in your team to have a picture as clear as yours, therefore impacting their future contribution to the team and customers. Sharing your findings with the customers then builds trust.

Why You Should Figure Out The Root Cause of An Incident

In a fast-paced environment, tools like version control and continuous delivery make it easy to “undo” an incident. Often, incidents happen when a bug is pushed into production and rolling back that change can quickly revert the situation. While this is helpful for the teams and gets the service working correctly again in a short amount of time, it doesn’t provide any intelligence on why the incident happened in the first place.

In order to generate learning opportunities, any incident management process should include 5 phases:

1. Detection?

It’s essential for any DevOps team to be prepared for eventual incidents. Identifying weaknesses in the system and setting up monitoring tools and system alerts help team members know what to do when an incident is detected.?

2. Response

DevOps teams usually have several members on-call available for escalations. If the on-call engineers cannot solve the issue, they can bring in the right people to escalate the problem to facilitate the incident resolution.

3. Resolution

This step is all about taking the necessary measures to fix the issue. It is when the problem is solved and the system goes back to functioning properly again.

4. Analysis

The analysis phase of incident management is often referred to as “post-mortem” or Root Cause Analysis (RCA).

While, historically, this phase has mostly been about performing RCA, as systems grow in complexity, teams increasingly look towards models that address complexity, such as the Cynefin framework.

5. Readiness

This is when the process comes full circle. Once an incident has been fixed and the system is restored, the team should reassess its readiness for the next incident. Ideally, everyone will be more prepared and will have learned important lessons from the post-mortem, that equip them to better deal with upcoming incidents.

How the “Five Whys” Can Help You Reach Root Cause

No alt text provided for this image

An effective way to perform this incident analysis is to identify the root cause of an incident?by adopting the “Five Whys.” Start by asking why an incident happened and keep repeating the question “Why?” to each answer, five times in a row. Here is an example:

Q: What was the error?

A: I opened the website and got an error trying to log in with my Facebook account.

Q: Why did it return an error? (first Why)

A: The Facebook login hasn’t worked since the release on February 1.

Q: Why not? (second Why)

A: The API key for our Facebook login was incorrect.

Q: Why was the API key for the Facebook login incorrect? (third Why)

A: One configuration setting was incorrect on LIVE, meaning that Facebook rejected our login requests in their API.

Q: Okay, how could that happen?

A: The wrong configuration file was copied from the developer’s workstation without anybody noticing that it was the wrong file.

Q: Why didn’t anybody notice? (fourth Why)

A: We don’t test before a release that the configuration files are the right ones for our environment.

Q: Why don’t you perform regression tests prior to a release? (fifth Why)

A: There is no process in place to ensure that only valid configuration files are copied into production.

When we say “Five Whys”, this is not a strict number. Sometimes, you’ll reach the root cause with only three “Whys,” other times you might need seven. The point is to keep asking until you reach the root cause. By gaining this knowledge will you be able to correct the problem. This is how you learn and improve incident management in your organization.

How to Perform A Post-Mortem

As mentioned, the analysis phase of incident management is often called an incident post-mortem. If this is not something your teams are used to doing, introducing post-mortems to an organization can be challenging. To avoid it turning into a game of blaming and pointing fingers, there are some guidelines to follow:

Stay Away from Finger Pointing

This is the most crucial rule to follow when performing an incident post-mortem. Focusing on finding the guilty people and blaming them for what happened causes more harm than good. Instead, focus on making sure that the whole team learns from the incident and performs better next time.

Appoint a Dedicated Incident Lead

Appoint a dedicated lead whose focus is enforcing post-mortem for each incident. Having a dedicated lead responsible for handling the incident from start to finish will ensure the likelihood of all important details being captured when doing the post-mortem, thus contributing to its success.

Share your Findings

Document the knowledge acquired with each post-mortem in a way that is accessible to the whole organization, for example in a wiki page. This way, every team member and, consequently, the organization as a whole, will benefit from the lessons learned.

Start Small

If you don’t have a culture of post-mortems in your organization yet, start with small steps. Not all incidents are equal. They are usually put into categories of low, moderate, and severe. The categorization is mostly related to the impacted functionality, the number of users affected and the duration of downtime. Naturally, start by analyzing severe incidents, as these cause bigger damage to your organization and your customers. As the post-mortem culture gets ingrained into your organization, you can proceed to analyze medium and low-severity incidents.

No alt text provided for this image


Conclusion

DevOps culture leads to more frequent deployments—and generates more incidents. Therefore, you could ask why you would even want to adopt DevOps. Maybe, instead, you should just do larger deployments more often??

Depending on the type of service you provide, this could be a better option. But not in all cases. Either way, don’t be afraid to fail. Be afraid of not learning from your organization’s failures. They will always happen but if you do your post-mortems right. With each new incident, your teams will be stronger and better equipped to handle them.

要查看或添加评论,请登录

S?ren Pedersen的更多文章

社区洞察

其他会员也浏览了