Putting an End to Human Error Outages
Tony Grayson
VADM Stockdale Leadership Award Recipient | Tech Executive | Ex-Submarine Captain | Top 10 Datacenter Influencer | Veteran Advocate
Over the past decade, human error has played a significant role in most of the industry’s Data Center outages. According to a 2021 Uptime Institute Annual Outage Analysis, an aggregated year-on-year average of 63% of failures are due to human error.
If human error continues to be such a problem in the industry, we must ask ourselves why we haven’t been able to fix it. I suspect that this is because we have not been addressing the real root cause- which he believes has its roots in the Challenger explosion.
On 28 January 1986, the Space Shuttle Challenger broke apart after the failure of an O-ring that degraded in the launch’s cold weather. In her 1997 book?Challenger Launch Decision, sociologist Diane Vaughan theorized that NASA’s decision to launch on such a cold day was due to a social normalization of deviance. This theory describes a situation in which people within an organization tolerate behaviors or practices once considered unacceptable, even when they fall below their personal or equipment safety standards. Often this happens for several reasons, including, but not limited to, a feeling that the rules don’t apply, inconsistencies in the level of knowledge, a lack of understanding, or fear of speaking up. Diane Vaughan theorizes that all these factors are often found in high-pressure environments.
Starting With Technicians
Consider a Data Center technician on shift who is continually engaged in activities involving varying risks while doing their job. When a problem is identified during a workday, the technician must either accept the risk of doing a quick fix to correct it or fix the problem via established procedures. But why would a technician feel enough pressure to take a risk to themself or their equipment to fix a problem rather than following the procedures?
On a typical workday, like anyone, a technician must first balance work with their personal life, which is especially hard considering the rhythm of shift work. Then, while on the job, they must conduct preventive and corrective maintenance, practice casualty response, study for continuation training and additional certifications, fill out and update tickets, and still find time to eat. This leaves little time for extra work.
But -because the cost of an outage is so high- risk tolerance in Data Centers is extremely low. This results in a heavy reliance on controls and evidence for work completed. Unfortunately, this bureaucratic method can create a culture of apathy among technicians because the process can be time-consuming, and there are implications in bringing up a problem.
领英推荐
If a technician identifies a problem and follows the established procedures, there will be a fact-finding which some technicians consider witch hunts. Fact-finding is followed by corrective actions, which are often viewed as punitive or cumbersome, mainly when they result in additional work for an already overtasked small team.
Furthermore, the technician might also lose their job if the problem results from a mistake caused by them. Together, these factors can create a culture where it could become acceptable to do what is needed to get a job done without considering the possible repercussions, such as an outage or, in more extreme situations, the technician being hurt.
How Can Data Center Leaders Evaluate Their Processes?
To address the potential of a social normalization of deviance developing in a Date Center, leaders should consider the following four points and then tailor additional actions based on their findings:
?As leaders in the Data Center industry, we are responsible for reflecting on how our leadership and broader organizational, and sometimes overly bureaucratic processes, affect our people. Just adding checks might have the reverse effect by creating a culture where risks are taken to get the job done. 63% of outages are caused by human error is too much for something we should be able to control. I believe we have the power to make things better, but can we overcome our own inertia to do so?
Owner at Txcellence Services
1 年Ok so a dated thought
Risk Specialist | Cyber/IT Security | Critical Infrastructure | Leading Change in Uncertainty | Risk to Opportunities | Seabed-to-Space Analytics | ISACA CISM/CRISC/CISA, PMI PMP/RMP, Veteran & Intelligence Professional
2 年The threat is real… I unlocked my office area one morning to find a “not typical” smell coming from the server room. Apprently the temp alarms went off at the end if the day, and facilities/security missed the sign “for emergency access to space, please contact the 24/7 team down the hall,” logged the event and carried on… The silver lining, training and an Off-site data plan developed shortly thereafter ??
SpaceX
2 年Great post, Tony. There's a book you might like, "The Field Guide to Understanding 'Human Error'" by Sidney Decker. It has some great insight about system design to withstand mistakes and errors.
Data Center Real Estate Portfolio Management, Colocation & Strategic Partnerships, Global Build Programs, Site Selection/Acquisition, Economic Development, M&A, investment, Technical Program Management, Construction
2 年That’s one of the critical question for years in all industries, we must need to deep dive and think and not only we have to take right action at early stage of QA/QC and later at Cx stage and at periodical maintenance as well with double make, checker ways!
Government Relations | Public Policy | Technical Standards | Advocacy
2 年When things fail, I would consider if its due to a confluence of factors, which might include the work, the work environment, the worker, and external circumstances. If the issue is not about the worker, then more training or better communication skills will not address the root cause On the flip side, it could often be a combination of factors, and that is where we have to go down a deep rabbit role to appreciate the cause and effect of any incident