登录查看更多内容

Putting an End to Human Error Outages

Tony Grayson

VADM Stockdale Leadership Award Recipient | Tech Executive | Ex-Submarine Captain | Top 10 Datacenter Influencer | Veteran Advocate

发布日期: 2022年12月4日

Over the past decade, human error has played a significant role in most of the industry’s Data Center outages. According to a 2021 Uptime Institute Annual Outage Analysis, an aggregated year-on-year average of 63% of failures are due to human error.

If human error continues to be such a problem in the industry, we must ask ourselves why we haven’t been able to fix it. I suspect that this is because we have not been addressing the real root cause- which he believes has its roots in the Challenger explosion.

On 28 January 1986, the Space Shuttle Challenger broke apart after the failure of an O-ring that degraded in the launch’s cold weather. In her 1997 book?Challenger Launch Decision, sociologist Diane Vaughan theorized that NASA’s decision to launch on such a cold day was due to a social normalization of deviance. This theory describes a situation in which people within an organization tolerate behaviors or practices once considered unacceptable, even when they fall below their personal or equipment safety standards. Often this happens for several reasons, including, but not limited to, a feeling that the rules don’t apply, inconsistencies in the level of knowledge, a lack of understanding, or fear of speaking up. Diane Vaughan theorizes that all these factors are often found in high-pressure environments.

Starting With Technicians

Consider a Data Center technician on shift who is continually engaged in activities involving varying risks while doing their job. When a problem is identified during a workday, the technician must either accept the risk of doing a quick fix to correct it or fix the problem via established procedures. But why would a technician feel enough pressure to take a risk to themself or their equipment to fix a problem rather than following the procedures?

On a typical workday, like anyone, a technician must first balance work with their personal life, which is especially hard considering the rhythm of shift work. Then, while on the job, they must conduct preventive and corrective maintenance, practice casualty response, study for continuation training and additional certifications, fill out and update tickets, and still find time to eat. This leaves little time for extra work.

But -because the cost of an outage is so high- risk tolerance in Data Centers is extremely low. This results in a heavy reliance on controls and evidence for work completed. Unfortunately, this bureaucratic method can create a culture of apathy among technicians because the process can be time-consuming, and there are implications in bringing up a problem.

领英推荐

The Hidden Costs of IT Downtime (& How To Avoid It)

Air IT 3 个月前

The Wrap: Government Doors Stay Open; AT&T Outage…

MeriTalk 12 个月前

6 Business Benefits of Data Centers in 2024

Remote Software Solutions Pvt. Ltd. 3 个月前

If a technician identifies a problem and follows the established procedures, there will be a fact-finding which some technicians consider witch hunts. Fact-finding is followed by corrective actions, which are often viewed as punitive or cumbersome, mainly when they result in additional work for an already overtasked small team.

Furthermore, the technician might also lose their job if the problem results from a mistake caused by them. Together, these factors can create a culture where it could become acceptable to do what is needed to get a job done without considering the possible repercussions, such as an outage or, in more extreme situations, the technician being hurt.

How Can Data Center Leaders Evaluate Their Processes?

To address the potential of a social normalization of deviance developing in a Date Center, leaders should consider the following four points and then tailor additional actions based on their findings:

Leaders need to ensure that they are effectively communicating core values and beliefs in such a way that it develops buy-in on the deck plate. The message should be delivered so that the team can absorb what is being said and why things are being done that way. The “why” is essential because the team needs to see its role in the organization’s future.
?Data Center leaders need to develop processes to ensure there is meaningful work and pathways to success while evaluating for and removing tearing-down forces that might affect personal integrity. Teams must feel confident that leaders have their back and are invested in their careers and families.
?Data Center leaders must take a hard look at their safety culture to see where organizational pressures could influence a technician’s risk tolerance. This will help inform how program policies can lead to deviance and ensure communication and supervision is effective (but not overbearing) to create a safe environment and help prevent poor-risk decision-making.
?The Data Center industry needs to strengthen its ability to incorporate human factors into its root-cause analysis by adopting the Human Factors Analysis and Classification System (HFACS). This system was developed in response to a trend that showed some form of human error was the primary cause of 80% of all Navy and Marine Corps flight accidents. It can provide Data Centers with a more comprehensive approach to identifying and mitigating human-factor problems by looking at human factors holistically.

?As leaders in the Data Center industry, we are responsible for reflecting on how our leadership and broader organizational, and sometimes overly bureaucratic processes, affect our people. Just adding checks might have the reverse effect by creating a culture where risks are taken to get the job done. 63% of outages are caused by human error is too much for something we should be able to control. I believe we have the power to make things better, but can we overcome our own inertia to do so?

Mark Stilley

Owner at Txcellence Services

1 年

Ok so a dated thought

1 次回应

Andrew J Clark

2 年

The threat is real… I unlocked my office area one morning to find a “not typical” smell coming from the server room. Apprently the temp alarms went off at the end if the day, and facilities/security missed the sign “for emergency access to space, please contact the 24/7 team down the hall,” logged the event and carried on… The silver lining, training and an Off-site data plan developed shortly thereafter ??

1 次回应

T.J. Kniveton

SpaceX

2 年

Great post, Tony. There's a book you might like, "The Field Guide to Understanding 'Human Error'" by Sidney Decker. It has some great insight about system design to withstand mistakes and errors.

2 次回应

Mahendra Choubey

Data Center Real Estate Portfolio Management, Colocation & Strategic Partnerships, Global Build Programs, Site Selection/Acquisition, Economic Development, M&A, investment, Technical Program Management, Construction

2 年

That’s one of the critical question for years in all industries, we must need to deep dive and think and not only we have to take right action at early stage of QA/QC and later at Cx stage and at periodical maintenance as well with double make, checker ways!

3 次回应

Joshua Au

Government Relations | Public Policy | Technical Standards | Advocacy

2 年

When things fail, I would consider if its due to a confluence of factors, which might include the work, the work environment, the worker, and external circumstances. If the issue is not about the worker, then more training or better communication skills will not address the root cause On the flip side, it could often be a combination of factors, and that is where we have to go down a deep rabbit role to appreciate the cause and effect of any incident

2 次回应

查看更多评论

要查看或添加评论，请登录

Tony Grayson的更多文章

Scaling Isn’t Dead: How Reasoning Models and Synthetic Data Are Redefining AI Progress

2024年12月20日

Scaling Isn’t Dead: How Reasoning Models and Synthetic Data Are Redefining AI Progress

Recent debates in the AI community have questioned the relevance of scaling laws—the principle that increasing data and…
Battlefield Lessons: How Ukraine Redefined Modern Warfare for Contested Environments

2024年12月4日

Battlefield Lessons: How Ukraine Redefined Modern Warfare for Contested Environments

The conflict in Ukraine has offered a sobering preview of the future of warfare, where electronic warfare…

10 条评论
Why Aren't We Talking More About Gen III+ Reactors?

2024年11月26日

Why Aren't We Talking More About Gen III+ Reactors?

As global energy demand rises and carbon emissions need to be reduced, Generation III+ (Gen III+) nuclear reactors…

15 条评论
Thinking Sketchy: How Life as a Submariner Teaches Adaptability, Observation, and Creative Problem-Solving

2024年11月15日

Thinking Sketchy: How Life as a Submariner Teaches Adaptability, Observation, and Creative Problem-Solving

I was watching PT-109 recently, and I couldn’t help but think about how much their mindset aligns with that of…

17 条评论
Adapt and Overcome: Why Diverse Perspectives Are the Military’s Best Weapon

2024年11月15日

Adapt and Overcome: Why Diverse Perspectives Are the Military’s Best Weapon

The military is often perceived as a bastion of uniformity in appearance and mindset. Tradition and standardized…

13 条评论
Protecting Guam’s Digital Infrastructure: A Vital Line in Pacific Security

2024年11月15日

Protecting Guam’s Digital Infrastructure: A Vital Line in Pacific Security

In May 2024, U.S.

9 条评论
Guam: The Strategic Cornerstone of U.S. Defense in the Pacific

2024年11月14日

Guam: The Strategic Cornerstone of U.S. Defense in the Pacific

As I sit in the SAME conference in Guam, it’s abundantly clear: this island is no ordinary U.S.

7 条评论
Why AI is Trending Local: Solving the Bandwidth Crisis for Image and Video Processing

2024年11月14日

Why AI is Trending Local: Solving the Bandwidth Crisis for Image and Video Processing

DALL-E and SORA exemplify the growing challenge in AI where data processing must happen close to the user due to the…

9 条评论
The Path to AI Monopoly: Creating Value Where Others Can’t Compete

2024年11月7日

The Path to AI Monopoly: Creating Value Where Others Can’t Compete

In Zero to One, Peter Thiel dives into the essence of building successful startups by creating unique, valuable…

5 条评论
Navigating Financial Barriers in AI-as-a-Service: Capital Costs as a Competitive Divide for Startups and Hyperscalers

2024年11月6日

Navigating Financial Barriers in AI-as-a-Service: Capital Costs as a Competitive Divide for Startups and Hyperscalers

In the rapidly expanding AI-as-a-Service (AIaaS) market, established providers, known as hyperscalers, maintain a…

See all articles

Putting an End to Human Error Outages

Tony Grayson

VADM Stockdale Leadership Award Recipient | Tech Executive | Ex-Submarine Captain | Top 10 Datacenter Influencer | Veteran Advocate

领英推荐

Tony Grayson的更多文章

社区洞察

其他会员也浏览了

Navigating Hope and Fear in a Socio-Technical Future

How SD-WAN improves Mean Time To Repair: WHILE outage { CASE detect(); diagnose(); resolve(); i++ }

Don't Be a Single Point of Failure: Spotting and Tackling Tech Weaknesses

Enhancing Efficiency and Safety: The Significance of Smart Dispatching Solutions in Land Mobile Radio Operations

The Significance of Data Cabling Choices: Unpacking the Impact on IT Operations

What is the true cost of IT downtime?

Power Normalization – Part 2

Environet Alert: Cost Effective Monitoring for Small & Medium Businesses

What should you do when your system does not respond?

Monitoring downtime in Mobile Networks

领英推荐

Tony Grayson的更多文章

Scaling Isn’t Dead: How Reasoning Models and Synthetic Data Are Redefining AI Progress

Battlefield Lessons: How Ukraine Redefined Modern Warfare for Contested Environments

Why Aren't We Talking More About Gen III+ Reactors?

Thinking Sketchy: How Life as a Submariner Teaches Adaptability, Observation, and Creative Problem-Solving

Adapt and Overcome: Why Diverse Perspectives Are the Military’s Best Weapon

Protecting Guam’s Digital Infrastructure: A Vital Line in Pacific Security

Guam: The Strategic Cornerstone of U.S. Defense in the Pacific

Why AI is Trending Local: Solving the Bandwidth Crisis for Image and Video Processing

The Path to AI Monopoly: Creating Value Where Others Can’t Compete

Navigating Financial Barriers in AI-as-a-Service: Capital Costs as a Competitive Divide for Startups and Hyperscalers

社区洞察

其他会员也浏览了

Navigating Hope and Fear in a Socio-Technical Future

How SD-WAN improves Mean Time To Repair: WHILE outage { CASE detect(); diagnose(); resolve(); i++ }

Don't Be a Single Point of Failure: Spotting and Tackling Tech Weaknesses

Enhancing Efficiency and Safety: The Significance of Smart Dispatching Solutions in Land Mobile Radio Operations

The Significance of Data Cabling Choices: Unpacking the Impact on IT Operations

What is the true cost of IT downtime?

Power Normalization – Part 2

Environet Alert: Cost Effective Monitoring for Small & Medium Businesses

What should you do when your system does not respond?

Monitoring downtime in Mobile Networks