登录查看更多内容

Managing Incident : Incident Handling

Gian Giovani

Bytedance RD Leader

发布日期: 2023年8月28日

From time to time incidents will happen, especially in a complex system. Making sure that an incident can be handled well is as important as making the system more reliable. There are many discussions about incident handling, Google's SRE book is an excellent reference. Hopefully, this article will enrich the discussion positively.

From what I observed there are 2 important aspects of Incident management, the first is the cycle of incident and the second is the roles involved in incident handling. Both of them are important to define and understand for an organization to have a good incident-handling culture.

Cycle of Incident

There are typically 4 stages in an incident, let's discuss it one by one.

1. Incident declaration

Usually, incidents are declared following a triggered alert from monitoring tools, some organizations even go to the extreme by declaring incidents for every alert in a certain category. Once an incident is declared someone has to take up the charge to be incident handling leader, their role is to gather related people to the "warroom". In most situation, I see that the service owner, infrastructure and DevOps team is essential to join, as in many organizations, the infrastructure, and DevOps team are the one who has knowledge and access to important resources. I like to call them operational experts.

Early notification of incidents can be done to related business counterparts, even when the impact is still mostly unknown. This will be useful for them to anticipate queries from consumers. Someone needs to assume this communicator role, their task is to give timely updates to the stakeholders. When an incident is not as big or complex, the leader can act as a communicator.

Important aspects in this stage are: 1) Failure to identify and declare incident in time will prolong the impact of the incident. 2) Gathering the right people in the warroom is important to make sure smooth handling.

2. Understanding the incident

Once necessary people have been gathered, the next step is to understand what's happening and estimate the impact on the business or customer. Having proper access to monitoring and logging tools is important, sometimes access to the resources directly is proven to be useful. Lot's data and facts will be gathered and discussed at this stage, The leader's role is to make sure everyone is actively investigating data, looking up facts, and communicating their finding.

The goals of this stage are to understand what's going wrong and estimate the scale of impact. This information will be useful to strategize user communication and formulate a stop-gap measure.

3. Stop the bleeding

In an emergency situation, a fast recovery is needed. This recovery may not restore the full capability or functionality of the system, however, it will stop or minimize disruptions. A stop-gap measure needs to be taken before a proper solution can be delivered. Operation experts and Leader have to decide on what stop-gap measures need to be taken, which usually involve rolling back deployments, reloading a service, or disabling a feature. Unfortunately, some incidents may require complex and multi-step measures, Leader's role here is to keep the focus of the warroom and minimize noise and pressures.

As the situation can be hectic and stressful, a measured step may need to be taken and reviewed by all operation experts in place. An example of this is the step executor to project their screen while typing the command, and asking for confirmation before actually running it.

4. Closing an incident

With the stop-gap measure complete and the problem mitigated, we're in the closing stage of the incident handling. Here all the known facts and data are gathered and documented in one place. At this stage typically there is less pressure and more time, so we have more time to find the real root cause. The goal of this stage is to document findings and formulate action items. This action item ideally contains a proper solution to the issue and sometimes several actions to gather more data to validate the assumption. Good action items require clear PIC, a time box, and clear deliverables.

After that, the incident can be closed and the team can work on the action items.

5. RCA Postmortem sharing

After an incident has been resolved and a postmortem report has been written, a sharing session is necessary to be done. Preferably if this can be done frequently, this is very important to build learning culture in the organization. Here the post-mortem will be discussed, and the action item will be validated by wider audience.

In every stage of the incident, having a blameless mindset and focusing on the process, problem, and solution rather than people. Mistakes are inevitable, so embracing them and making sure the right mindset here will enable positive discourse and engagement.

To summarize our discussion so far :

领英推荐

Can I use our existing IT Incident Management Process…

Nicola Askham 3 年前

Navigating the Quagmire: The Analysis Paralysis of…

Omogbai Martins 1 年前

10 THINGS YOU NEED TO KNOW ABOUT INCIDENT MANAGEMENT

Jacob Molz 4 年前

Roles

Incident Handling Leader

Gathering related people
Making sure everyone involved has clear action items to do
Documenting an issue
Facilitating decision making

Communication Leader

Communicating incidents to related stakeholders
Giving frequent updates as needed to related stakeholders

Operational experts

Usually ICs, including infrastructure team
Gathering data and facts
Executing incident handling plan

Cycle of Incident

1. Incident declaration

Communicate the incident
Gathering relevant people

2. Understanding the incident

Understanding what's going wrong
Gathering facts and data
Understanding the scale of impact

3. Stop the bleeding

Stop-gap measures that can be done to minimize impact instantly

Usually, it involves :

- stopping or rolling back a deployment

- reloading application

- Scaling up the deployment

- disabling features

4. Closing an incident

Monitoring stop-gap solution impact
Finding Root Cause (Root Cause Analysis)
Formulating action item (for further investigation and proper solution)

5. RCA Postmortem sharing

Nurturing learning culture
Validating action items

要查看或添加评论，请登录

Gian Giovani的更多文章

gRPC is cool. Then what?

2017年8月24日

gRPC is cool. Then what?

You can also read this in my blog What is the most popular RPC right know? I would say restful with JSON. It is…

3 条评论
TCP Socket Implementation In Golang

2017年6月3日

TCP Socket Implementation In Golang

You can also read this on my blog. Golang is surely my first to go language to write web application, it hides many…
Setup rust development environment

2017年2月11日

Setup rust development environment

Having development environment is not only finish by installing a compiler. To do fast and efficient work-flow…

3 条评论
Upgrading elasticsearch 2.x to 5.x

2017年2月2日

Upgrading elasticsearch 2.x to 5.x

Last year elasticsearch 5.0 released to public and 5.

5 条评论
Javascript compiler optimization

2016年10月7日

Javascript compiler optimization

JavaScript Compiler optimization in browser is quite unique, it has to take account of many things C compiler doesn't…

3 条评论

See all articles

Managing Incident : Incident Handling

Gian Giovani

Bytedance RD Leader

Cycle of Incident

1. Incident declaration

2. Understanding the incident

3. Stop the bleeding

4. Closing an incident

5. RCA Postmortem sharing

领英推荐

Roles

Cycle of Incident

Gian Giovani的更多文章

社区洞察

其他会员也浏览了

18 MUST-HAVE BENEFITS IN A PROACTIVE INCIDENT MANAGEMENT SYSTEM

USE AN INCIDENT MANAGEMENT SYSTEM TO BE PROACTIVE (NOT REACTIVE)

Are You Prepared for an Incident? Best Practices for Effective Incident Management

Incident Management: The Firefighter Who Also Bakes You Cookies ????

Incident Resolution and the 3 strike rule

Incident Management & DEV/OPS

The Essential Skills for Success as a Major Incident Manager: A Personal Perspective

Incident Management

10 THINGS YOU NEED TO KNOW ABOUT INCIDENT MANAGEMENT

The 3 Phases of a major incident

Cycle of Incident

1. Incident declaration

2. Understanding the incident

3. Stop the bleeding

4. Closing an incident

5. RCA Postmortem sharing

领英推荐

Roles

Cycle of Incident

Gian Giovani的更多文章

gRPC is cool. Then what?

TCP Socket Implementation In Golang

Setup rust development environment

Upgrading elasticsearch 2.x to 5.x

Javascript compiler optimization

社区洞察

其他会员也浏览了

18 MUST-HAVE BENEFITS IN A PROACTIVE INCIDENT MANAGEMENT SYSTEM

USE AN INCIDENT MANAGEMENT SYSTEM TO BE PROACTIVE (NOT REACTIVE)

Are You Prepared for an Incident? Best Practices for Effective Incident Management

Incident Management: The Firefighter Who Also Bakes You Cookies ????

Incident Resolution and the 3 strike rule

Incident Management & DEV/OPS

The Essential Skills for Success as a Major Incident Manager: A Personal Perspective

Incident Management

10 THINGS YOU NEED TO KNOW ABOUT INCIDENT MANAGEMENT

The 3 Phases of a major incident