Putting an End to Human Error Outages
@2011 Diane Alber

Putting an End to Human Error Outages

Over the past decade, human error has played a significant role in most of the industry’s Data Center outages. According to a 2021 Uptime Institute Annual Outage Analysis, an aggregated year-on-year average of 63% of failures are due to human error.

If human error continues to be such a problem in the industry, we must ask ourselves why we haven’t been able to fix it. I suspect that this is because we have not been addressing the real root cause- which he believes has its roots in the Challenger explosion.

On 28 January 1986, the Space Shuttle Challenger broke apart after the failure of an O-ring that degraded in the launch’s cold weather. In her 1997 book?Challenger Launch Decision, sociologist Diane Vaughan theorized that NASA’s decision to launch on such a cold day was due to a social normalization of deviance. This theory describes a situation in which people within an organization tolerate behaviors or practices once considered unacceptable, even when they fall below their personal or equipment safety standards. Often this happens for several reasons, including, but not limited to, a feeling that the rules don’t apply, inconsistencies in the level of knowledge, a lack of understanding, or fear of speaking up. Diane Vaughan theorizes that all these factors are often found in high-pressure environments.

Starting With Technicians

Consider a Data Center technician on shift who is continually engaged in activities involving varying risks while doing their job. When a problem is identified during a workday, the technician must either accept the risk of doing a quick fix to correct it or fix the problem via established procedures. But why would a technician feel enough pressure to take a risk to themself or their equipment to fix a problem rather than following the procedures?

On a typical workday, like anyone, a technician must first balance work with their personal life, which is especially hard considering the rhythm of shift work. Then, while on the job, they must conduct preventive and corrective maintenance, practice casualty response, study for continuation training and additional certifications, fill out and update tickets, and still find time to eat. This leaves little time for extra work.

But -because the cost of an outage is so high- risk tolerance in Data Centers is extremely low. This results in a heavy reliance on controls and evidence for work completed. Unfortunately, this bureaucratic method can create a culture of apathy among technicians because the process can be time-consuming, and there are implications in bringing up a problem.

If a technician identifies a problem and follows the established procedures, there will be a fact-finding which some technicians consider witch hunts. Fact-finding is followed by corrective actions, which are often viewed as punitive or cumbersome, mainly when they result in additional work for an already overtasked small team.

Furthermore, the technician might also lose their job if the problem results from a mistake caused by them. Together, these factors can create a culture where it could become acceptable to do what is needed to get a job done without considering the possible repercussions, such as an outage or, in more extreme situations, the technician being hurt.

How Can Data Center Leaders Evaluate Their Processes?

To address the potential of a social normalization of deviance developing in a Date Center, leaders should consider the following four points and then tailor additional actions based on their findings:

  1. Leaders need to ensure that they are effectively communicating core values and beliefs in such a way that it develops buy-in on the deck plate. The message should be delivered so that the team can absorb what is being said and why things are being done that way. The “why” is essential because the team needs to see its role in the organization’s future.
  2. ?Data Center leaders need to develop processes to ensure there is meaningful work and pathways to success while evaluating for and removing tearing-down forces that might affect personal integrity. Teams must feel confident that leaders have their back and are invested in their careers and families.
  3. ?Data Center leaders must take a hard look at their safety culture to see where organizational pressures could influence a technician’s risk tolerance. This will help inform how program policies can lead to deviance and ensure communication and supervision is effective (but not overbearing) to create a safe environment and help prevent poor-risk decision-making.
  4. ?The Data Center industry needs to strengthen its ability to incorporate human factors into its root-cause analysis by adopting the Human Factors Analysis and Classification System (HFACS). This system was developed in response to a trend that showed some form of human error was the primary cause of 80% of all Navy and Marine Corps flight accidents. It can provide Data Centers with a more comprehensive approach to identifying and mitigating human-factor problems by looking at human factors holistically.

?As leaders in the Data Center industry, we are responsible for reflecting on how our leadership and broader organizational, and sometimes overly bureaucratic processes, affect our people. Just adding checks might have the reverse effect by creating a culture where risks are taken to get the job done. 63% of outages are caused by human error is too much for something we should be able to control. I believe we have the power to make things better, but can we overcome our own inertia to do so?

Mark Stilley

Owner at Txcellence Services

1 年

Ok so a dated thought

Andrew J Clark

Risk Specialist | Cyber/IT Security | Critical Infrastructure | Leading Change in Uncertainty | Risk to Opportunities | Seabed-to-Space Analytics | ISACA CISM/CRISC/CISA, PMI PMP/RMP, Veteran & Intelligence Professional

2 年

The threat is real… I unlocked my office area one morning to find a “not typical” smell coming from the server room. Apprently the temp alarms went off at the end if the day, and facilities/security missed the sign “for emergency access to space, please contact the 24/7 team down the hall,” logged the event and carried on… The silver lining, training and an Off-site data plan developed shortly thereafter ??

Great post, Tony. There's a book you might like, "The Field Guide to Understanding 'Human Error'" by Sidney Decker. It has some great insight about system design to withstand mistakes and errors.

Mahendra Choubey

Data Center Real Estate Portfolio Management, Colocation & Strategic Partnerships, Global Build Programs, Site Selection/Acquisition, Economic Development, M&A, investment, Technical Program Management, Construction

2 年

That’s one of the critical question for years in all industries, we must need to deep dive and think and not only we have to take right action at early stage of QA/QC and later at Cx stage and at periodical maintenance as well with double make, checker ways!

Joshua Au

Government Relations | Public Policy | Technical Standards | Advocacy

2 年

When things fail, I would consider if its due to a confluence of factors, which might include the work, the work environment, the worker, and external circumstances. If the issue is not about the worker, then more training or better communication skills will not address the root cause On the flip side, it could often be a combination of factors, and that is where we have to go down a deep rabbit role to appreciate the cause and effect of any incident

要查看或添加评论,请登录

Tony Grayson的更多文章

社区洞察

其他会员也浏览了