Keys to  Building a Successful Root Cause Program

Keys to Building a Successful Root Cause Program

In November of 2018 I took on a new role to create and manage a root cause program for the engineering department of the software company I work for. When I attempted to research existing root cause programs for software companies, the returned results were for root cause specific software, not at all what I was looking for. I broadened my search when setting up the program and quickly found that many of the manufacturing and health care protocols did not translate well to database failures or code issues. Maybe you are in that same search right now. If you are starting the same journey I was on in 2018, I want to give you that ‘how to’ list that did not exist for me. In this article I’ll lay out a few key principles that are the foundation for the program. Over the following weeks I’ll take a deep dive into each section giving its own dedicated article.

Our program’s directive was to have one place where we tracked incident root causes and their corresponding mitigations. The business goal was to reduce the recurrence of incidents. We wanted to identify the root causes of our issues and put preventive actions in place to keep them from happening again. I built the program off some of the best practices I found, a lot of trial and error and plenty of input from the participants. We call our program RCAPA, Root Cause Analysis and Preventive Action. 3 years later I’ve reviewed over 800 incidents that span from emergency change requests to full blown outages. Over that time I’ve seen a shift from hesitancy (do I really have to do this for an emergency change request?) to full participation, and teams that use the process on their own when their issue doesn’t require a full RCAPA review. Our root cause program is well received, popular and seen as place where teams can come together to problem solve.

Warning: If you do this right you are not going to see a reduction in incidents right away. You are actually going to see more, or at least it will seem that way. When I set up the program, we thought we would see 5 or 6 incidents a month. 3 years later I review 8 to 10 incidents a week. It’s not that we are having more incidents, it is that we are looking at more than originally intended. Two things contributed to the perceived increase in incidents. First, I expanded the parameters of what we review, if I can get in front of something while is it a small, lower environment issue we can prevent it from becoming a production issue. Second, the program became popular. Teams saw it as a place to collaborate to build effective solutions. Once word got out that you didn’t get in trouble for speaking up in RCAPA, my calendar started to fill up. We weren’t having more incidents; we were shining a light in some dark forgotten corners and picking up rugs that had plenty of technical debt piled beneath them. Keep that in mind when building out your first set of OKRs, if incident reduction is a key result you might want to rethink how you measure it.?

Welcome to Switzerland

The number one rule of our root cause program is that it must remain blameless; RCAPA is a neutral zone. I state at the beginning of every RCAPA review meeting that the process is “not a finger pointing exercise”. Root cause requires transparency from your participants, for that to happen the process must be blameless. No one gets in trouble when they participate in our RCAPA program. Human error isn’t listed as a root cause, I don’t accept it as one, even when someone walks into a meeting and announces, “it was me I did it, I screwed up.” There is always a deeper systemic issue to be found underneath that human error. When participants are safe, it is much easier to uncover the root cause beneath the missed keystroke. If a human error has taken down a production system, then we need to understand why that system is vulnerable. RCAPA is a neutral zone, we don’t take sides or single out specific teams or people. When finger pointing and blame are absent, teams are freed up to work together to create solutions. Your root cause program should be considered sacred neutral ground, it is the Switzerland of your company.

Ask the stupid questions

Have you ever asked an engineer or developer “what happened” when something broke? Did they give you an easy-to-follow breakdown of their database workflows and then tell you in laymen terms exactly where it broke? I’m guessing they explained it to you like you work in those systems right alongside them every day when they talked about nodes, clusters, and I/O failures. If you don’t understand what is being described, stop, and ask the stupid questions. As the program facilitator I am not tied to any of the teams going through the RCAPA process. I don’t have deep technical knowledge of their products, their workflows, or their dependencies. Getting to root cause requires asking those “stupid” questions. Of course, there is no such thing as a stupid question, but I sure felt like the weakest link in the room the first time I asked an obvious question. I’ve since lost count of how many times an obvious question has pointed us to a deeper systemic issue that the teams involve did not see. I can see the forest for the trees because I am coming to the table with a different view. Your root cause facilitator does not need to be technical; it may better if they are not. All participants in the process should be willing to ask the obvious questions when reviewing an incident. Those questions are more likely to happen when you follow rule number one and give them a safe place to do so.?

Meet your participants where they are at

The majority of the participants in the RCAPA program are engineers, developers, program and product managers. When building the program, I quickly found that if I wanted participation, I needed to meet people where they were at. While it would have been nice to start with 3rd party root cause software that I didn’t have budget for, I was hesitant to add one more place for people to log into and manage. I built the program tracking out of existing tools that we all work in. The documents are in Confluence and all ticketing is in Jira, where our engineering department tracks its work. Some RCAPAs have dedicated Slack channels. We are a global company, so the RCAPA Review meetings encompass multiple time zones. I saw a big increase in participation from several teams once I started to hold meetings one evening a week in my time zone, which corresponds to business hours in India. When I meet participants where they are at, the program becomes more accessible.?

?

Elephants are welcome

I have a special category on my calendar for “elephants”. Those are meetings I schedule with leadership after we surface an elephant in an RCAPA review meeting. A successful root cause program will uncover systemic issues that may need a decision by someone outside the group on hand. I call those big issues “elephants”, after one of my first meetings where someone spoke up and said, “Can I talk about the elephant in the room?”. An elephant might be that your lower environments are not a match to production, a lack of ownership for end-to-end testing or unmapped dependencies. Uncovering an elephant in a review meeting can cause some discomfort in the room, because most employees don’t feel they can have any control over those decisions. They may have even accepted the innovation killer of “it’s always been that way”. When we hit that wall, I ask the teams “if you were to put the right people in the room to make a decision on this, who would it be?” What teams don’t know walking in, is that I may have seen this same elephant in other RCAPAs from other teams. I take my data to leadership to help them make business decisions when it comes to prioritizing initiatives. When the transparency in the RCAPA process uncovers an elephant, it gets the attention it needs, it doesn't get swept back under the rug. When an elephant shows up in your root cause program, give it the attention it needs, back it up with data and surface it to decision makers.

Track Action Items Through the Program

One of the reasons my company wanted a formal root cause program was to keep track of all the mitigations items in one place. We use quarterly planning to map out the work that will be completed each year. Once plan of record is in place, there isn’t much room to deviate. When a mid-quarter incident requires several technical mitigations, those need to be shoehorned into the quarterly plan. By tying each of the action items to its corresponding incident in RCAPA, we can prioritize the most critical action items, and get the others into planning. We know when an incident is mitigated when all the action items linked to it are completed. If an action item does not make it into a quarterly plan, and another incident occurs that refers to that action item as a mitigation, we can come back to leadership with data to help reprioritize it.?Over time we can show the return on investment of prioritizing action items verses the risk of pushing them further down the roadmap, i.e., you can expect to see X number of incidents costing approximately X number of dollars until this action item is completed.

Capture Learnings – Can you explain it to your mother?

After each RCAPA Review meeting I write up the review in a ticketing system where I track all the action items, root causes, mitigations, owners, and dates.?That ticket starts with an executive summary. If I’ve done my job well, anyone in my company should be able to read that summary and understand what happened, why it happened and what we are doing about it. I step back after writing the summary and ask myself “can my mother understand this?”. If the answer is “no”, I have more work to do. The clearer the language, the better it translates across products, teams, and businesses leaders. You will also be doing your legal team a favor if they have to sanitize the language for a customer facing RCA. The program data should be searchable and accessible to your teams so they can use it when they want to know “has anyone else seen this issue before?”. This is one of the many things that sets apart a formal root cause program from postmortems. The lessons learned do not stay siloed, teams are encouraged to learn from each other’s incidents.

It is not “your” program

The RCAPA program at my company could not be successful if it was “my” program. RCAPA works because it is “our” program. Each team has an RCAPA Champion that leads investigations and presents at the RCAPA review meeting. In the meeting I state that is it an open forum, everyone is encouraged to speak up if they have questions or recommendations. Our best solutions often come from those with feet on the ground. They are the ones who might be working in a piece of legacy architecture each day and know the improvements that are required to mitigate risks. I encourage new hires to attend a few RCAPAs and to speak up if they have suggestions. There may have been a similar incident at their previous company, and no one has brought up the solution that worked for them. Another set of eyes will help us see the forest for the trees. The biggest indicator that the program is owned by all, is when teams start to use it on their own for issues that do not require your formal root cause process.?

Treat it like a program

When creating or improving your root cause program I encourage you to define your set of must haves that will become the foundation for your program. While some of ours were defined along the way, the blameless process has been a requirement from day one. Over this series of articles, I'll dive into each of the areas I laid out and get into the finer details like the parameters of what we review and tracking action items. My number one recommendation for creating a root cause program, is to treat it like a program. Invest in a resource(s) that can manage the program with a focus on a defined business goal, build your program on a set of principles and get ready to start shining a light in some dark corners.?

Andrea Jagla, PMP

Senior Manager @ Flexera | Contractor Vendor Management, Team Leadership, Program Management, Technical Support, Sales Operations, SaaSOps Management, Business Operations.

2 年

Way to go Melissa. Awesome

要查看或添加评论,请登录

社区洞察

其他会员也浏览了