登录查看更多内容

How do you handle complex or escalated incidents that require root cause analysis and remediation?

由人工智能和领英社区提供技术支持

此文章中的业界达人

由社区从 7 条内容中精选。了解更多

Paul Teodorescu

IT Executive | Speaker - Authentic Leader, Data & Decision Velocity Driven, Customer Obsessed, Thinking Like an…

1 Define the incident scope and impact

The first step in handling a complex or escalated incident is to define its scope and impact. This means identifying the affected services, users, locations, and processes, as well as the severity and priority of the incident. You should also gather as much information as possible about the symptoms, timeline, and triggers of the incident, and document them in a ticketing system or an incident management tool. This will help you communicate the incident status to the stakeholders, and establish a common understanding of the problem.

添加您的观点

Calvin Kramer

PA GE Healthcare
举报内容
I would very often go to the owners of said information store, network parts, team leaders, or managers who would put a team to task immediately. Or, I would formulate a team to drill down to the root cause, then again target the owner of said fault. IT problems still just boil down to people not getting their 1's and 0's. Incident management was often just taking time away from resolution, so often the incidents were written up after resolution because reporting properly was key to making sure the true cause of the problem was identified and hopefully not repeated.

已翻译

赞

2 Form an incident response team

The next step is to form an incident response team that will coordinate the investigation and resolution of the incident. The team should include representatives from the relevant IT functions, such as service desk, infrastructure, applications, security, and vendor management, as well as from the business units or customers that are affected by the incident. The team should also have a clear leader, who will be responsible for assigning tasks, monitoring progress, and facilitating communication. The team should meet regularly, either in person or virtually, to share updates, findings, and actions.

添加您的观点

3 Analyze the root cause and identify solutions

The third step is to analyze the root cause of the incident, and identify possible solutions. This may involve performing tests, collecting logs, reviewing configurations, or consulting experts. The team should use a systematic method, such as the 5 Whys, fishbone diagram, or fault tree analysis, to trace the cause-and-effect relationships of the incident, and to avoid jumping to conclusions or assumptions. The team should also evaluate the feasibility, effectiveness, and risks of the potential solutions, and prioritize them based on their impact and urgency.

添加您的观点

Paul Teodorescu

IT Executive | Speaker - Authentic Leader, Data & Decision Velocity Driven, Customer Obsessed, Thinking Like an Entrepreneur
举报内容
Respectfully, in resolving an incident, irrespective of its magnitude of escalation, there is no focus on identifying the root cause. That is a step in the Problem Management space. Focus should be on response and resolution times.

已翻译

赞
Rob Mulder

Sr. IT Infrastructure Architect
举报内容
I think an intermediate step should be included. How do I end the service interruption as quickly as possible? Oftentimes a reboot or stop start of subsystem is enough to have the service reestablished after which the more time consuming root cause analysis can take place and a scheduled change in the maintenance window can be planned to install a more permanent fix.

已翻译

赞

4 Implement and verify the remediation

The fourth step is to implement and verify the remediation of the incident. This means applying the chosen solution, either as a permanent fix or a temporary workaround, and testing its functionality and performance. The team should also document the steps, results, and approvals of the remediation, and update the ticketing system or the incident management tool accordingly. The team should also notify the stakeholders of the resolution, and confirm that the incident is resolved and that the service is restored.

添加您的观点

5 Conduct a post-incident review

The fifth step is to conduct a post-incident review, which is a process of learning from the incident and improving the IT operations. The team should review the incident details, such as the root cause, the solution, the impact, and the timeline, and identify the strengths and weaknesses of the incident response. The team should also analyze the underlying factors that contributed to the incident, such as gaps in processes, policies, skills, or tools, and propose recommendations for improvement. The team should document the findings, actions, and lessons learned in a report, and share it with the stakeholders and the IT management.

添加您的观点

6 Implement and monitor the improvement actions

The sixth and final step is to implement and monitor the improvement actions that were derived from the post-incident review. This means executing the recommendations, such as updating procedures, training staff, enhancing systems, or changing vendors, and measuring their outcomes and benefits. The team should also track the progress and status of the actions, and report them to the stakeholders and the IT management. The team should also verify that the actions have prevented or reduced the recurrence or severity of similar incidents in the future.

添加您的观点

7 Here’s what else to consider

This is a space to share examples, stories, or insights that don’t fit into any of the previous sections. What else would you like to add?

添加您的观点

Paul Teodorescu

IT Executive | Speaker - Authentic Leader, Data & Decision Velocity Driven, Customer Obsessed, Thinking Like an Entrepreneur
举报内容
Key for an effective response and remediation approach is to include awareness and notifications of the appropriate SMEs and stakeholders. Automating notifications and leveraging omni channels (SMS, voice, MS Teams/Slack chat, etc) is a must, as well as automating logistics around huddling - create a channel in Teams, post bridge info, adding SMEs to the bridge, etc.

已翻译

赞
Gabriel Peter Calvanese

Manager, Operations Management
举报内容
RCAs are and have always been a part of Proactive project management and while not in a specific section- they are always in the PMs mind and forefront of dealing with any issue arising during project.

已翻译

赞

IT Operations

+ 关注

给文章评分

我们借助人工智能创建了此文章。您认为这篇文章怎么样？

很棒不太好

举报此文章

查看全部

How do you handle complex or escalated incidents that require root cause analysis and remediation?

1

2

3

4

5

6

7

1 Define the incident scope and impact

2 Form an incident response team

3 Analyze the root cause and identify solutions

4 Implement and verify the remediation

5 Conduct a post-incident review

6 Implement and monitor the improvement actions

7 Here’s what else to consider

IT Operations

给文章评分

感谢您的反馈

更多IT Operations相关文章

更多相关阅读内容

How do you handle complex or escalated incidents that require root cause analysis and remediation?

1

2

3

4

5

6

7

1 Define the incident scope and impact

2 Form an incident response team

3 Analyze the root cause and identify solutions

4 Implement and verify the remediation

5 Conduct a post-incident review

6 Implement and monitor the improvement actions

7 Here’s what else to consider

IT Operations

给文章评分

感谢您的反馈

查看其他技能