How to be an effective Major Incident Manager
Glyn Tomkins
IT Service Management and Information Security Operations Executive with over 30 years experience leading digital, IT and business excellence
As an ITIL 4 Master with more than 35 years of experience, I’ve had the opportunity to step into numerous major incident situations across various industries and customer environments.
I confidently assert that I can walk into any major incident scenario, regardless of the environment or technology involved, and rapidly become a productive Major Incident Commander, handling the situation competently and efficiently.
You might wonder how I can make such a bold claim without prior knowledge of the specific environment or technology. The answer lies in my approach: I base all my actions around asking a set of essential questions. These questions, combined with the structured framework of the PagerDuty Incident Response model, guide the entire incident management process. This dual approach ensures that no critical aspect is overlooked, and that the incident is resolved as quickly and effectively as possible.
So here is the list of questions which form the solid foundation for handling major incidents, as they focus on understanding the situation, coordinating efforts, and ensuring effective communication:
1. What is the business impact?
Why it matters: Understanding the business impact helps prioritize the incident based on its effect on the organization. It guides decision-making on resource allocation and urgency.? Also, your leaders are going to be asking questions about the business impact, so it’s important to get this understood very quickly.
2. What is the users’ experience?
Why it matters: Knowing the users' experience provides insights into the scope and severity of the incident. It helps in understanding how the issue is affecting end-users, which can drive the incident resolution strategy.
3. Do you need any other assistance/help/information/resources/skills?
Why it matters: This ensures that the incident team has everything they need to address the issue effectively. It prevents delays that could arise from lacking critical resources or information.
4. Who can answer that question?
Why it matters: Quickly identifying the right person or team to provide answers is crucial in fast-paced incident management scenarios. It reduces downtime caused by miscommunication or waiting for responses.
5. Who can get that person/team on the call?
Why it matters: Coordinating efforts in real-time is vital. Having the right people on the call ensures that decisions are informed and that actions can be taken promptly.
6. What is the working hypothesis?
领英推荐
Why it matters: Establishing a working hypothesis allows the team to focus their efforts on a potential cause of the incident. It drives the investigation and troubleshooting process.? There may be more than one, and if you can, you should be trying to investigation different hypotheses in parallel.
7. How long do you need to work on that?
Why it matters: Setting time expectations helps in managing the incident timeline and communicating progress to stakeholders. It also helps in assessing whether additional resources are needed to meet deadlines.
8. What could go wrong with the proposed plan to resolve the incident?
Why it matters: Assessing risks associated with the resolution plan helps in identifying potential pitfalls before they occur, allowing the team to prepare contingency plans.
9. How will we tell if the proposed action has been successful?
Why it matters: Defining success criteria ensures that everyone is aligned on what constitutes a resolved incident. It also provides a clear metric for when the incident can be closed.
10. Are there any strong objections to the proposed action?
Why it matters: Encouraging dissenting opinions helps in identifying overlooked risks or alternative solutions. It ensures that the resolution plan has been thoroughly vetted before implementation.
Now there are a hundred different ways of asking those questions which will help you not sound like a robot reading from a script but if you base your game around them, you will look and sound like a seasoned veteran.
I’d love to hear from you what questions we should add to this list and also would like to hear what your experience is of using this framework of questions.?
Let me know what other topics you are interested in learning about.
Image credit: Incident by Nick Youngson CC BY-SA 3.0 Pix4free
Incident Management
7 个月Agreed. This is a very useful approach and I utilize it on my major incidents.
IT Service Management Lead at QataEnergy LNG
7 个月Yes, this is very critical questions, and we need to get the answers to these questions as quick as possible. but with a complexed IT environment and the incremental of business dependency in IT, along with the challenges we faced with balancing between the business growth and maintaining the service and environment stability, therefore, we have to develop a Major incident model that integrate with the service catalog, Disaster recovery plan, business continuity plan, CMDB, etc and supported by a high-end monitoring and analysis technology. all of this capability will allow us to identify the root cause and the right people timely to reduce the impact by getting the issue resolved as soon as possible
Agree Mostly about common sense, do you know what's not working correctly, who do we need to fix it, when do we need to fix it, and how to keep the customer informed and aware we know that they are important. Incident manager isn't the person who fixes (diagnose, isolate, resolve), but ensuring the right people are involved.