Managing Incident : Incident Handling

Managing Incident : Incident Handling

From time to time incidents will happen, especially in a complex system. Making sure that an incident can be handled well is as important as making the system more reliable. There are many discussions about incident handling, Google's SRE book is an excellent reference. Hopefully, this article will enrich the discussion positively.

From what I observed there are 2 important aspects of Incident management, the first is the cycle of incident and the second is the roles involved in incident handling. Both of them are important to define and understand for an organization to have a good incident-handling culture.

Cycle of Incident

There are typically 4 stages in an incident, let's discuss it one by one.

1. Incident declaration

Usually, incidents are declared following a triggered alert from monitoring tools, some organizations even go to the extreme by declaring incidents for every alert in a certain category. Once an incident is declared someone has to take up the charge to be incident handling leader, their role is to gather related people to the "warroom". In most situation, I see that the service owner, infrastructure and DevOps team is essential to join, as in many organizations, the infrastructure, and DevOps team are the one who has knowledge and access to important resources. I like to call them operational experts.

Early notification of incidents can be done to related business counterparts, even when the impact is still mostly unknown. This will be useful for them to anticipate queries from consumers. Someone needs to assume this communicator role, their task is to give timely updates to the stakeholders. When an incident is not as big or complex, the leader can act as a communicator.

Important aspects in this stage are: 1) Failure to identify and declare incident in time will prolong the impact of the incident. 2) Gathering the right people in the warroom is important to make sure smooth handling.

2. Understanding the incident

Once necessary people have been gathered, the next step is to understand what's happening and estimate the impact on the business or customer. Having proper access to monitoring and logging tools is important, sometimes access to the resources directly is proven to be useful. Lot's data and facts will be gathered and discussed at this stage, The leader's role is to make sure everyone is actively investigating data, looking up facts, and communicating their finding.

The goals of this stage are to understand what's going wrong and estimate the scale of impact. This information will be useful to strategize user communication and formulate a stop-gap measure.

3. Stop the bleeding

In an emergency situation, a fast recovery is needed. This recovery may not restore the full capability or functionality of the system, however, it will stop or minimize disruptions. A stop-gap measure needs to be taken before a proper solution can be delivered. Operation experts and Leader have to decide on what stop-gap measures need to be taken, which usually involve rolling back deployments, reloading a service, or disabling a feature. Unfortunately, some incidents may require complex and multi-step measures, Leader's role here is to keep the focus of the warroom and minimize noise and pressures.

As the situation can be hectic and stressful, a measured step may need to be taken and reviewed by all operation experts in place. An example of this is the step executor to project their screen while typing the command, and asking for confirmation before actually running it.

4. Closing an incident

With the stop-gap measure complete and the problem mitigated, we're in the closing stage of the incident handling. Here all the known facts and data are gathered and documented in one place. At this stage typically there is less pressure and more time, so we have more time to find the real root cause. The goal of this stage is to document findings and formulate action items. This action item ideally contains a proper solution to the issue and sometimes several actions to gather more data to validate the assumption. Good action items require clear PIC, a time box, and clear deliverables.

After that, the incident can be closed and the team can work on the action items.

5. RCA Postmortem sharing

After an incident has been resolved and a postmortem report has been written, a sharing session is necessary to be done. Preferably if this can be done frequently, this is very important to build learning culture in the organization. Here the post-mortem will be discussed, and the action item will be validated by wider audience.


In every stage of the incident, having a blameless mindset and focusing on the process, problem, and solution rather than people. Mistakes are inevitable, so embracing them and making sure the right mindset here will enable positive discourse and engagement.

To summarize our discussion so far :

Roles

Incident Handling Leader

  • Gathering related people
  • Making sure everyone involved has clear action items to do
  • Documenting an issue
  • Facilitating decision making

Communication Leader

  • Communicating incidents to related stakeholders
  • Giving frequent updates as needed to related stakeholders

Operational experts

  • Usually ICs, including infrastructure team
  • Gathering data and facts
  • Executing incident handling plan

Cycle of Incident

1. Incident declaration

  • Communicate the incident
  • Gathering relevant people

2. Understanding the incident

  • Understanding what's going wrong
  • Gathering facts and data
  • Understanding the scale of impact

3. Stop the bleeding

  • Stop-gap measures that can be done to minimize impact instantly

Usually, it involves :

- stopping or rolling back a deployment

- reloading application

- Scaling up the deployment

- disabling features

4. Closing an incident

  • Monitoring stop-gap solution impact
  • Finding Root Cause (Root Cause Analysis)
  • Formulating action item (for further investigation and proper solution)

5. RCA Postmortem sharing

  • Nurturing learning culture
  • Validating action items


要查看或添加评论,请登录

Gian Giovani的更多文章

  • gRPC is cool. Then what?

    gRPC is cool. Then what?

    You can also read this in my blog What is the most popular RPC right know? I would say restful with JSON. It is…

    3 条评论
  • TCP Socket Implementation In Golang

    TCP Socket Implementation In Golang

    You can also read this on my blog. Golang is surely my first to go language to write web application, it hides many…

  • Setup rust development environment

    Setup rust development environment

    Having development environment is not only finish by installing a compiler. To do fast and efficient work-flow…

    3 条评论
  • Upgrading elasticsearch 2.x to 5.x

    Upgrading elasticsearch 2.x to 5.x

    Last year elasticsearch 5.0 released to public and 5.

    5 条评论
  • Javascript compiler optimization

    Javascript compiler optimization

    JavaScript Compiler optimization in browser is quite unique, it has to take account of many things C compiler doesn't…

    3 条评论

社区洞察

其他会员也浏览了