Managing Incident : Incident Handling
From time to time incidents will happen, especially in a complex system. Making sure that an incident can be handled well is as important as making the system more reliable. There are many discussions about incident handling, Google's SRE book is an excellent reference. Hopefully, this article will enrich the discussion positively.
From what I observed there are 2 important aspects of Incident management, the first is the cycle of incident and the second is the roles involved in incident handling. Both of them are important to define and understand for an organization to have a good incident-handling culture.
Cycle of Incident
There are typically 4 stages in an incident, let's discuss it one by one.
1. Incident declaration
Usually, incidents are declared following a triggered alert from monitoring tools, some organizations even go to the extreme by declaring incidents for every alert in a certain category. Once an incident is declared someone has to take up the charge to be incident handling leader, their role is to gather related people to the "warroom". In most situation, I see that the service owner, infrastructure and DevOps team is essential to join, as in many organizations, the infrastructure, and DevOps team are the one who has knowledge and access to important resources. I like to call them operational experts.
Early notification of incidents can be done to related business counterparts, even when the impact is still mostly unknown. This will be useful for them to anticipate queries from consumers. Someone needs to assume this communicator role, their task is to give timely updates to the stakeholders. When an incident is not as big or complex, the leader can act as a communicator.
Important aspects in this stage are: 1) Failure to identify and declare incident in time will prolong the impact of the incident. 2) Gathering the right people in the warroom is important to make sure smooth handling.
2. Understanding the incident
Once necessary people have been gathered, the next step is to understand what's happening and estimate the impact on the business or customer. Having proper access to monitoring and logging tools is important, sometimes access to the resources directly is proven to be useful. Lot's data and facts will be gathered and discussed at this stage, The leader's role is to make sure everyone is actively investigating data, looking up facts, and communicating their finding.
The goals of this stage are to understand what's going wrong and estimate the scale of impact. This information will be useful to strategize user communication and formulate a stop-gap measure.
3. Stop the bleeding
In an emergency situation, a fast recovery is needed. This recovery may not restore the full capability or functionality of the system, however, it will stop or minimize disruptions. A stop-gap measure needs to be taken before a proper solution can be delivered. Operation experts and Leader have to decide on what stop-gap measures need to be taken, which usually involve rolling back deployments, reloading a service, or disabling a feature. Unfortunately, some incidents may require complex and multi-step measures, Leader's role here is to keep the focus of the warroom and minimize noise and pressures.
As the situation can be hectic and stressful, a measured step may need to be taken and reviewed by all operation experts in place. An example of this is the step executor to project their screen while typing the command, and asking for confirmation before actually running it.
4. Closing an incident
With the stop-gap measure complete and the problem mitigated, we're in the closing stage of the incident handling. Here all the known facts and data are gathered and documented in one place. At this stage typically there is less pressure and more time, so we have more time to find the real root cause. The goal of this stage is to document findings and formulate action items. This action item ideally contains a proper solution to the issue and sometimes several actions to gather more data to validate the assumption. Good action items require clear PIC, a time box, and clear deliverables.
After that, the incident can be closed and the team can work on the action items.
5. RCA Postmortem sharing
After an incident has been resolved and a postmortem report has been written, a sharing session is necessary to be done. Preferably if this can be done frequently, this is very important to build learning culture in the organization. Here the post-mortem will be discussed, and the action item will be validated by wider audience.
In every stage of the incident, having a blameless mindset and focusing on the process, problem, and solution rather than people. Mistakes are inevitable, so embracing them and making sure the right mindset here will enable positive discourse and engagement.
To summarize our discussion so far :
领英推荐
Roles
Incident Handling Leader
Communication Leader
Operational experts
Cycle of Incident
1. Incident declaration
2. Understanding the incident
3. Stop the bleeding
Usually, it involves :
- stopping or rolling back a deployment
- reloading application
- Scaling up the deployment
- disabling features
4. Closing an incident
5. RCA Postmortem sharing