Why the Five Whys is so hard to do well.
“Explanations exist; they have existed for all time; there is always a well-known solution to every human problem — neat, plausible, and wrong.”
H.L Mencken - "The Divine Afflatus" in New York Evening Mail (16 November 1917)
~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~
“Your product is broken!”
Sometime in 2007 or 2008, a customer called with bad news. The company’s employees used our product to sign on to almost all the applications they used to do their jobs. Attempts to reach the sign-on page returned an error message. An unknown failure locked employees out of every tool. In industry parlance, they were experiencing a P0 issue–the system was in a total outage, no functionality was working, and it wasn’t clear what was causing the issue.
Our customer success engineers had so far failed to diagnose the issue. They struggled to get timely information from the customer–especially detailed log data from hundreds of distributed servers. The customer’s support engineers appeared panicked and lacked a troubleshooting methodology.?
The customer, short on patience, demanded a product manager join an open conference call where the two teams worked together to figure out the problem. My phone rang half a dozen times, each caller demanding I drop everything I was doing and immediately join the call. I grabbed a member of my product team, and we jumped on.
Product manager as punching bag
领英推荐
The executive leading the customer’s efforts was irate. I waited while he screamed like a banshee for ten minutes or so. He was having a bad day and needed to vent. When he finished, my product manager and I started asking questions. Did we have error logs from the product server and the web servers where all the distributed agents were running? Was the behavior the same for all of the distributed agents? What had changed in the environment before the outage? Was anyone from the customer’s infrastructure or networking teams available to help us troubleshoot?
The clock ticked while we tested several hypotheses. We uncovered that every agent had started refusing connections at the same time–at midnight the night before the outage. Finally, my product manager asked whether anyone had checked the agents’ digital certificates–the things used to secure the connections between the server and the agents. We quickly learned the certificates on all of the agents had expired. At midnight. The night before.
We verified that updating the agent’s certificate restored the product to normal operations. There was no product bug, no code to fix. Even though the product relied on the certificates to communicate securely, certificate handling was outside our product’s control. The product documentation recommended regularly rotating the certificates, and we reiterated that recommendation before dropping off the call. The customer executive was angrily contrite (if that is even a thing).
Root Cause Analysis
Feeling pretty beat up, we followed the incident with a half-hearted root cause analysis exercise. We promptly vindicated ourselves, indicted the customer for their lack of insight and process, and added a couple of low-priority features to have the agent warn the server upon imminent certificate expiration and improve the information collected in our logs.
The customer’s Six Sigma Black Belt conducted a more thorough root cause analysis using (I found out later) the Five Whys method. Their analysis concluded that our product needed to include a full-blown certificate management system. I decided the conclusion was ludicrous. Certificate management was its own market category occupied by mature vendors with sophisticated products developed over many years.
During the incident’s post-mortem, I explained our position on building certificate management into the product while the customer’s executive excoriated us through gritted teeth. Years later, I ran into the executive at an industry conference, and he was still mad at me...
If I had a dollar for every time there was a cert issue, I would retire a third time. I just went through this process recently and again the 5 why's indicated several areas for improvement across the board. RCA and the associated tools go a long long way regardless if you are the end user or the vendor.
CEO & Co-Founder @ TrustFour | Workload Identity & Segmentation Security
11 个月Well said. I haven't found a RCA methodology that works for all situations either, some work for problems that are huge in scale and others work for problems that are smaller in scale. Looking at your problem statement, if you took the first you could easily see that 2/5 of the why's would have been consumed had you just started with the third statement.