Root Cause Analysis Techniques Series - Change Analysis and Event Analysis

Root Cause Analysis Techniques Series - Change Analysis and Event Analysis


Welcome to a series of articles that will share practical information about effective Root Cause Analysis (RCA) techniques.?

In incident management theory and continuous improvement, understanding and addressing the core issues driving problems is essential. Over the coming articles, I will go through the diverse toolkit of Root Cause Analysis techniques. My goal is to give you common information about each technique and then share practical guidelines from my real-life experience as an Incident Management Director. Implementing these techniques in my company reduced its incident rate by 60%.

Whether you're an Incident Management trainer, consultant, or a professional seeking to enhance your problem-solving skills, this series is designed to equip you with the knowledge and guidelines to use these techniques effectively to make a change. Even Though I will be using examples from the Incident Management field, these techniques can be used to prevent problems in many other areas. When you reach the end of the series, you will have a complete set of tools enabling you to address problem prevention properly.?

If you need my help understanding which tools will be most effective in solving the problems you are facing, you are welcome to reach out to me.

This article is the first of a series of five articles:

  1. RCA and the Change Analysis and Event Analysis
  2. The power of the 5 Whys
  3. Navigating to the Root with Fishbone Diagrams
  4. Global overview using Pareto Charts
  5. Mastering Improvement with DMAIC

In this first article, I will start by explaining the main purpose of the RCA process and then kick start this journey with a deep dive into "Change Analysis and Event Analysis," the first technique in our exploration of powerful RCA techniques.

But first, a little about myself.

Who Am I

I'm Dani Tweig, a seasoned professional and an experienced Incident Management Director with extensive practical experience in global hi-tech companies.?

I have worked in the last couple of years in a cybersecurity company that processed billions of daily and time-sensitive transactions and was globally distributed across hybrid-cloud infrastructure with several diverse services and hundreds of components.?

That environment sure kept me busy :)

In the last couple of years, I have built and trained several incident management teams and designed many incident management processes from scratch, which helped prevent incidents and achieve continuous improvement. We have reduced from 91 severe incidents per year to 15, reducing the hours of effort invested in fixing production incidents, reducing the number of interruptions on the technical teams and the whole company, and increasing the quality and availability of the company services.

Root Cause Analysis (RCA)

Root Cause Analysis (RCA) is a structured approach used to uncover the reasons or causes behind problems, incidents, or failures within an organization. It is a powerful tool for identifying the visible symptoms but, more importantly, the fundamental causes that contribute to these problems.?

The RCA process involves a methodical investigation that seeks to answer the critical question of "why" an issue occurred rather than merely addressing its immediate symptoms. By uncovering the root causes, organizations can effectively implement preventive measures and continuous improvements, ultimately leading to enhanced service quality, a boost in customer satisfaction, increased availability, safety and operational efficiency.?

RCA is used in numerous fields, including healthcare, manufacturing, information technology, and incident management, which makes it an essential practice for organizations seeking continuous positive changes.

The Power of Change Analysis and Event Analysis

Introduction

Our first technique in the series will be the Change Analysis and Event Analysis. This technique is the baseline for all the other techniques in this series.

All problems are caused due to a specific sequence of events or changes within an organization or system. To uncover the root causes in these scenarios, we turn to the Change Analysis and Event Analysis techniques. In this article, we'll explore how these techniques can be used to analyze events and changes and reveal issues that need to be addressed.

What is Change Analysis and Event Analysis?

Change Analysis and Event Analysis are techniques used to understand how a sequence of events or changes have led to a specific issue or incident. These techniques focus on identifying the root causes by examining and documenting the timeline of changes and events that lead to the problem.

How to implement Change Analysis and Event Analysis

To implement Change Analysis and Event Analysis effectively, follow these steps:

1. Define the Issue or Incident: Clearly articulate the problem or incident you are investigating.

2. Collect Data: Gather detailed information about the changes and events leading up to the issue.

3. Timeline Creation: Create a chronological timeline of the events and changes.

4. Find Links: Analyze the connections between the events and changes, and how they contributed to the problem.

5. Identify Root Causes: Determine the fundamental causes within the timeline.

When to Use Change Analysis and Event Analysis

These techniques will be most effective when you're dealing with incidents or issues that can be tied to specific changes or a sequence of events that have occurred. The application of these tools is particularly valuable in the areas of incident management, cybersecurity, and process analysis.

I found it to be an extremely valuable tool when collecting data for production incident RCAs, and it can also be used to investigate any problem whatsoever.

Benefits and Advantages

Comprehensive Understanding: They provide a comprehensive view of the problem.

Sequence of Events: They help identify the exact sequence of events leading to the issue.

Preventative Insights: These techniques offer insights that can help prevent similar incidents in the future more easily.

My Experience

In the RCA process I developed, I used the Change Analysis and Event Analysis method to collect all the relevant information related to the incident. The information gathered contained all the data leading to the problem and all the details that occurred once the incident was found and until it was resolved.?

All the information was gathered into a timeline table, documenting information relevant to the incident. Each row held data of a single event or change.?

After several iterations of refining my RCA process, the fields which I found most effective to follow in the timelines table were the following: date, time, time from the beginning of the incident, relevant system, who and the information about the event.

The rows were ordered by date and time so that one could get a comprehensive view of the whole flow of events which led to the incident until its resolution.

An example of a piece of a timeline table describing events related to a production incident

Important Instructions

It is very important that the information you gather in the timeline table help you to understand what happened and why. In order to understand the sequence of events related to an incident, I developed the following set of guidelines:

  • Gather all the information to establish a clear flow of what has occurred before and during the incident.
  • It is very important not to be biased or fall to assumptions. Gather only facts: Check relevant emails, notifications, related change tickets (e.g. JIRA), and check relevant communication channels (e.g. Slack) for incident-related information. Check technical sources like logs, reports, and history/storage tools like Icinga / Kibana / Zabbix / DB or any other tool to gather facts.
  • Make sure the documented information is thorough and comprehensive but also brief.?
  • Ensure that you have access to accurate and comprehensive data about the changes and events.
  • Make sure you can understand from the timeline what was the root of the problem, what led to the incident, and what exactly was done to fix it.

If you build the timeline table correctly, it would be easy to understand the flow of events which led to the incident, know the root cause, whether the incident was handled in an efficient manner, and which preventative measures should be taken in order to prevent the issue from happening again in the future.

Conclusion

Change Analysis and Event Analysis are valuable tools when dealing with problems that are the result of a sequence of events or changes. Gathering relevant information is the foundation of all RCA techniques to come.

By going over the context and timeline of events, you can uncover the root causes, helping you to understand which steps to take in order to prevent similar incidents in the future.

Next Steps

Continue your exploration of Root Cause Analysis to find out how the Change Analysis and Event Analysis establish the foundation for an effective RCA process. Discover other techniques, including the 5 Whys, Fishbone Diagrams, Pareto Charts, and DMAIC, which we'll explore in the coming articles.

Closing Thoughts

Thank you for joining me on this journey into the world of Root Cause Analysis techniques, and as I did in this article, I promise to keep giving you valuable practical tips along the way based on my real-life experience so that you can be more effective when implementing the Root Cause Analysis process in your organization.

Please feel free to contact me for any further discussion or leave a comment with your thoughts or experience about the Change Analysis and Event Analysis or the RCA process.

If you want me to help you pick the most effective technique for your organization, or if you think your team should be trained in Incident Management and the RCA techniques, feel free to reach out to me. I can help.

Dani Tweig

Parameswaran A B

Principal Consultant - Incident & Escalation Management

1 年

The timeline table is very effective, though Problem Managers gather all the facts including error reports, and timelines this table is essential to put together in a single place and also calculate how much time is spent on which action item. Good one!!

要查看或添加评论,请登录

Dani Tweig的更多文章

社区洞察

其他会员也浏览了