The Most Common Incident Management Problems
The Semicircle of Doom
You know the story. The incident is kicking off, the clock is ticking, all hands are on deck, and the focussed, purposeful wheels of the Incident Response playbook are turning in motion. Soon however, the audience arrives. Senior managers swoop in one by one like ravens onto a telephone wire searching for scraps of information to allay their increasing anxiety.
Meanwhile, the incident responders struggle to think while the impotent stares of the highly paid onlookers burn holes in their concentration. Get rid of them.
Anyone who’s not playing an active role in the incident response should be gone, with the expectation that they’ll be regularly informed.
I need a hero
I’m holding out for a hero ‘til the end of the night. They’ve gotta be strong and they’ve gotta be fast and they’d better not be on holiday when the incident happens or we’re doomed…
Every organisation has their heroes, their rockstars, their ninjas, and we love them and appreciate them dearly. But to depend on them entirely is to invite failure.
Make sure you have a breadth of experience and skills on your team so you can have confidence in your incident response without risking a single point of failure. Even better, find those true heroes who are generous and selfless with their knowledge and can help grow your strength in depth.
Failing to keep stakeholders informed
Failing to communicate with your stakeholders is a sure way to invite a semi-circle of doom. Keep the ravens in their nest by setting an expectation of regular, timely communication and sticking to your promise.
Usual Suspects
It’s a network issue. It’s always a network issue. It’s definitely a network issue. Not so fast. Though common problems are common and although case based reasoning is a powerful diagnostic tool, jumping to conclusions without evidence is a common form of premature convergence that can result in long journey down the wrong diagnostic rabbit hole. What’s the evidence? What does the data say?
It’s been a while…
The last major incident was back in 2017. Since then we’ve grown complacent, overly confident that our resiliency efforts have resulted in invincibility but suddenly we’re reminded that incidents can occur and we’re not prepared.
Incident response skills need to be embedded into muscle memory and if you don’t use it, you lose it. How are you keeping your incident response capability sharp so you’re ready at all times? Uptime Labs can help with that.
领英推荐
How big?
Failing to effectively size the issue is the first opportunity for your incident response to slip off the rails.
Is the issue minor, perhaps non customer impacting or affecting a non critical feature or is it major? Who’s affected? Everyone or a specific segment? Global or local? Degradation or outage?
Your assessment of the size and severity of the issue will impact your response and communication strategy so get it right.
To fail to plan is to plan to fail
It’s said that no plan survives first contact with the enemy, and while this may be true that’s not to say that planning isn’t critically important.
Your incident protocol or playbook is your “break glass here” action plan or checklist that will help you to get the basics done on autopilot, leaving your conscious brain free to work with the agility that a complex, emerging incident scenario demands.
How well established is your incident response playbook within your team?
Authority Through Seniority
So you’ve just got off the phone with the CTO and she’s convinced that the incident’s caused by a load balancer issue.
You feel like you have to focus your triage in this direction due to the respect you have for the CTO’s seniority. Be careful.
Outside opinions provided at a distance may be useful and they’re worth listening to, but the basic principles of effective triage remain. What’s your evidence? What does the data say?
You’re in charge of this incident and the seniority of an opinion should make no difference to how valid it is.
Get to the point
The incident bridge isn’t the place for your life story, leave that for the retrospective. Your job on a communications bridge is to keep the signal to noise ratio as high as possible. You can use helpful acronyms such as C.A.N to help focus your communication: –