Why Data Centers Are Crashing Like Planes—And How to Fix It

Why Data Centers Are Crashing Like Planes—And How to Fix It

Yesterday, the Uptime Institute Global Data Center Survey 2024 dropped, and while it would appear at first blush to be stating “not much has changed,” a detailed analysis of the report shows that the industry is still in very deep trouble.

PERFORMANCE

In the past, I’ve tried to apply the same rates of failure in the data center world, to the commercial aviation industry, and the numbers were very alarming. Unfortunately, my math was wrong; it’s far worse than I had estimated.

Here’s the question I plugged into ChatGPT-40.

“From the 2024 Uptime Institute Global Data Center Survey, "Outage frequency and severity,” compare this performance to the commercial aviation industry. If the aviation industry had similar rates of failure, and a "severe" outage from a data center was the equivalent of a catastrophic plane crash, how often would planes crash, while delivering the same performance?”

To compare the performance of data centers to the commercial aviation industry using the analogy that a "severe" data center outage equates to a catastrophic plane crash, let's first establish the equivalent failure rates:

Commercial Aviation Analogy

In the aviation context:

  • A "severe" data center outage is equivalent to a catastrophic plane crash.
  • The failure rate is analyzed over a three-year period.

Data Center Outage Statistics (2024 Uptime Institute Survey)

  • 53% of data centers experienced an outage in the past three years.
  • 3% of those outages were considered "severe" (catastrophic).
  • 6% were "serious."
  • The remaining outages were classified as significant, minor, or negligible.

Calculating the Equivalent Aviation Crash Rate

  1. Total Data Centers Experiencing an Outage: 53% over three years. Annually, this would be approximately 53%/3 ≈? 17.67% per year.
  2. Severe Outages (Catastrophic Events): 3% of 53% over three years. Annually, this would be approximately [3%×53%]/3 ≈ 0.53% per year.
  3. Commercial Flights: Assume there are roughly 100,000 flights per day globally (a standard estimate for pre-pandemic levels).
  4. Annual Flights: 100,000?flights/day × 365?days/year = 36,500,000?flights/year.
  5. Catastrophic Crash Rate: If 0.53% of flights were to experience a catastrophic event (equivalent to severe outages): [0.53% × 36,500,000] ≈ 193,450?catastrophic?crashes?per?year. Daily, this equates to [193,450 / 365] ≈ 530?crashes?per?day.

Punch-Line:

  • Aviation Industry Equivalent: If commercial aviation experienced failures at the same rate as data centers, with "severe" outages equating to catastrophic crashes, there would be approximately 530 plane crashes per day.

I don’t know about you, but with those odds, I’d never fly again; I’d drive.? Heck, I’d walk if I had to!

WHY IS THIS SO?

Power Systems Failures:

From the report, 54% of unplanned outages are due to power issues.

And the Uptime report helpfully states “As discussed in previous Uptime Intelligence reports, challenges with electrical grids, exacerbated by aging infrastructure, rising demand, severe weather events and a reliance on intermittent renewable energy sources, may worsen this trend.”?

This shouldn’t be news to anyone; I’ve been talking to the media about this in November of 2023, and wrote a detailed analysis of the power grid situation, “The Impending Storm: Keeping Data Centers Alive As The Electric Grid Fails.”

The growing risk of electric grid instability puts all enterprise IT organizations of EVERY major corporation in the world, at risk.

If you bother to read the latest Uptime report, you'll notice how it talks about network and cooling, but does NOT address the #1 cause of data center outages; power.

Why do you think that is?

I’ve written a full analysis of what’s really happening, Shattering the Illusion: Maintenance MIS-Management is the #1 Cause of Data Center Outages.? I strongly recommend you check it out.

In short, colocation providers are fully aware of when deficiencies occur, but they choose to NOT repair them, because any losses which they (the colo provider) may incur due to an outage are merely credits for future use by the client.

They suffer no ACTUAL financial penalties if their client experiences an outage.? It’s a classic “heads I win, tails you lose” scenario.

Colo companies have a financial cost associated with maintaining the critical facilities, yet they receive no direct benefit.? But, they do enjoy a financial benefit (cost avoidance) if they do NOT maintain their critical facilities, and there is no penalty for NOT maintaining their facilities.

So the colo companies divert the money that should be used for maintenance to other activities which artificially boost their stock, and the management can cash out hundreds of millions of dollars in bonuses.

Their IT clients are left utterly exposed to unplanned outages with absolutely zero agency.

“But WAIT a minute!” you say, “Aren’t the colo companies adhering to some sort of oversight, like SOC-2?!?”

Of course they do.? But the SOC-2 certificates aren’t worth the paper they’re not written on.

SOC-2 certificates, like their European ISO equivalents, merely provide a veneer of legitimacy to an otherwise crooked game.

Want proof? Ok, what’s the only requirement to be a SOC-2 auditor??

Answer: you have to be a CPA, a Certified Public Accountant. That's IT.

When I tell IT professionals and executives this, they’re always stunned.? Then they ask, “what does a CPA know about engineering systems?"

The answer, of course, is nothing.?

That’s the point!?

Imagine you’re about to board a jetliner to fly across the ocean, and a CPA comes running up to you and says “don’t worry, I audited the airline, THIS plane is safe!”?


"Don't worry, I audited the airline. This plane is SAFE!!!"

Would you bet on that with your LIFE?? What about your business, your retirement, your future?

Because that’s what it boils down to.

RESULTS

The poor performance cited at the beginning, is the aggregate result of all the crooked, sleazy games being played by colocation data centers (and many service providers serving enterprise IT organizations). The games can be hidden from the individual client, but not the final numbers.

Tier-III sites are designed to have a statistical probability of 1:5,555 of having an outage in a given year. The actual probability of a data center having an outage in any given year is ~1:5.5.

Put simply, data centers are delivering a product 1,000x WORSE than what they promise.

WHAT’S THE SOLUTION?

In contrast to the appalling performance of the data center industry at large, the Amerruss Resilience Program delivered a flawless operational record for 6 ? years of >60 sites scattered across the globe.? That portfolio included owned, leased and collocated spaces, in a 50-50 mix of Tier-III and Tier-II sites!?

In effect, while the rest of the industry delivered an actual uptime of ~83%, our program delivered 100%, with no excuses.?

We can deliver the same results for your company, whether you have owned facilities, lease buildings with 3rd-party vendors maintaining your systems, or depend on colocation facilities for your IT needs (whether a full-on enterprise IT presence, POP or DRaaS sites).? We have the only proven solution to help you regain agency for your IT presence, protect your IT operations and remove the crippling costs of unplanned outages due to power and/or cooling system failures. Our program is scalable, replicable, vendor-neutral, fully transparent and easily auditable.

The costs (especially compared to unplanned outages) are relatively minimal, and SLA results are insurable by Lloyd’s of London.? NO other “audit” program- whether ISO, SOC, anything you can reference- can match this, much less beat it. Additional benefits derived are a decreased need for additional fail-over sites, reducing IT operational costs for personnel and equipment that simply is no longer necessary.

Contact us at www.amerruss.com.? We can help your IT organization be more reliable, AND cost-effective, without stressing your budget or your nerves!

?

Ajay Varma (CDCP)

Datacenter Infrastructure Operations Engineer | Mission Critical Load Handling? At WebWerks & Iron Mountain Datacenter. APAC Region ! ???????.

3 个月

Great insights?? , The average unplanned Datacenter outage costs from min $9000 to Maximum cost of $2,409,991, the cost of downtime continues to increase year on year .

回复
Chris Hale, MBA

Infrastructure Services| Datacenter | MBA | Naval Nuclear Power -Veteran

5 个月

I propose that DC providers develop an internal audit program?separate from Operations. In the US Navy nuclear power program, this is present as an annual Operational Reactors Safeguards Exam (ORSE). Training reports, Level of knowledge interviews, maintenance reviews, observed evolutions, and so forth.

回复

要查看或添加评论,请登录

Dr. Eric Woodell的更多文章

社区洞察

其他会员也浏览了