Shattering the Illusion: Maintenance MIS-Management is the #1 Cause of Data Center Outages
Dr. Eric Woodell
World's #1 expert in data center resilience. I audit and certify colocation facilities, ensuring secure, continuous operations—insured by Lloyd's of London.
Infrastructure Topology Is Just ONE Side of the Coin
Words mean things, express concepts, ideas, and describe physical realities around us.? We use industry-accepted conventions for things like infrastructural architecture in the mission-critical facilities discipline, such that while the subtleties of a particular site may be confidential, the overall performance expectations are easily understood.? That being said, the Uptime Institute Tier Standard: Topology is a superb means to define data center infrastructure.
A close examination of that standard will make it abundantly clear that a Tier-III or Tier-IV data center should never suffer an unplanned outage due to power or cooling, because every single component has a fully redundant backup.? Thus, a single failure anywhere in the facility will never compromise the operation of the overall facility.? Put another way, there are no single points of failure, which cause the complete loss of power or cooling.
Like a jetliner, multiple levels of redundancy means that the failure of a single component in any system- no matter how critical- doesn’t compromise the entire system.? The failure of an engine on a jetliner doesn’t doom the plane; it still has the ability to safely land.?
It’s the same for Tier-III or Tier-IV data centers.? They’re designed with the knowledge that anything man-made will fail, sooner or later, and accounts for that in the design.? Hence the Uptime Tier standards, and the hundreds of construction companies who are fantastic in meeting those Uptime criteria.? This area of the critical facilities arena is well-covered, and the results of their efforts have been nothing short of amazing.
And yet, according to the 2023 Uptime Institute outage analysis, outages keep happening!? Roughly 60% of data center operators say they’ve had an outage in the past three years, that 44% are caused by power system failures, and 13% due to cooling system failures.
If the power and cooling system failures account for 57% of data center outages, despite being fully redundant, how in the world is this happening?
There Are TWO Mandatory Requirements for Data Center Operational Excellence
As mentioned above, the first key aspect of a having a critical facility that will support completely reliable IT operations for decades, is having a fully redundant architecture.? Tier-III facilities have this as their standard, Tier-IV expands upon this to have “fault tolerance,” the ability to adapt to changing conditions to preclude cascade failures from occurring.? But Tier-III is a far more economical approach, and should deliver perfect performance.
The fact that they’re obviously not delivering perfect performance, is due to the second key aspect of reliability in data centers: meticulous maintenance.
IF a data center is properly maintained 100% of the time, then a Tier-III facility can suffer any single failure and ride through until the failure has been mitigated.? However, if the data center is not being properly maintained, then every single failure of any given component has the very real risk of becoming a cascade failure.?
Let me say this again: Superior infrastructure topology negates data center outages due to a single failure. Superior MAINTENANCE prevents single failures from turning into cascade failures and outages.? They are two sides of the SAME coin.
?I wrote an article in 2016 titled “What IS ‘The Secret’ To Running Data Centers?”, where I poked holes in the claims of vendors of various types of technology that their offerings were THE KEY to attaining perfect uptime with regard to availability.? I then proceeded to point out that, like the aviation industry, everything had to work, and described, point-by-point, what it takes to have an effective mission-critical facility that delivers >99.999% reliability for power and cooling:
Manpower is the first element to consider; do you have enough people to perform the work that needs to be accomplished on a daily basis? Do they escort vendors for service calls, which will take them away from their normal duties?
Training: Have the staff received proper training on site-specific gear, and has it been documented? Are shift personnel qualified for specific shift operations and maintenance functions?
Financial management: are OPEX and CAPEX budgets of sufficient magnitude to fund critical-facilities projects and normal operations? Are they separate from other budgets? [As one friend pointed out, “Data Center managers are squeezed to save every penny that they get so what suffers, the infrastructure.” And I would add, the people as well…]
Reference library: do you have one, with as-built drawings, operations and maintenance documents, warranty, commissioning, and automation sequences for operation? How about studies for infrastructure systems, soil, waters, structural? Are the master copies being kept in a safe, centralized location? Is there a system for managing floor-space, power and cooling? Are these aspects being monitored and a process in place to forecast future growth?
Organization: Does the staff have an established reporting chain, including a call-out list during emergencies? Do the staff have job descriptions available, which state their duties AND expectations? Are roles and responsibilities clearly delineated? Are key roles clearly established with their own duties and responsibilities? Is there a succession plan in place?
Maintenance: Is there an effective preventative maintenance (PM) program in place, including detailed procedures that are clearly written for technicians to follow? Is there a QA system to validate PMs are performed as desired? Is there an effective maintenance management system in place, where hours and equipment history are tracked? Is there a detailed inventory of spare parts on-hand, and list of suppliers and vendors when un-stocked spares are needed rapidly? Is there a life-cycle planning system in place? How about a failure analysis program, and tracking of deferred maintenance?
What’s REALLY Going On…?
Returning to the 2023 Uptime Institute outage analysis, let’s more closely examine the failure rates by components in the power system:
The report contains an interesting aspect, single-corded IT device failures, which industry professionals will instantly recognize as a ‘smoking gun’ of human-error. ??Pretty much every serious enterprise IT organization prohibits the installation of single-corded devices to preclude IT operational failures when a single-asset failure occurs.?
But unless you’re an expert in the critical facilities management arena, you won’t notice the other ‘smoking gun’ that is hidden in plain sight:
Each of these systems (except for the single-corded IT device failures) is serviced by extremely well-trained technicians, almost always external vendors trained and licensed by the manufacturers of the equipment.?
For example, Caterpillar generators have local vendors licensed and trained by Caterpillar.? They are the very best at what they do; that’s all they do.? So the likelihood of they’re making mistakes is essentially zero.
It must also be pointed out that those vendors, if the equipment they service fails because of a maintenance oversight, will be liable for losses incurred due to negligence on their behalf.? Thus, they’re tremendously thorough in their maintenance operations.? Their meticulousness assures their place as THE preferred vendors for mission-critical service, in their region, both with customers and the manufacturers they represent.
The likelihood that they’ll make mistakes so often, which allows 44% failure rates, simply isn’t plausible.
So what IS happening?
The vendor reports that I’ve examined over the past six years- which I conservatively estimate at more than 20,000 documents- are consistently detailed, with the job plan, steps taken, parameters noted, testing results, observations for end-of-life component replacements (such as batteries or UPS capacitors) and recommendations at the end, for mitigating conditions which compromise future reliability or operational readiness.?
Of those >20,000 reports I have personally read, I can count on one hand the number of times a vendor report has dutifully reported parameters of a piece of equipment, where the parameters indicated an unusual condition was developing, and vendor failed to mark in the summary that the condition was developing and needed addressed.? That means the chances of the vendor failing to notate an unusual condition was <.025% of the time.
So if the vendor is so accurate in their reports, WHY is the failure rate of electrical systems 44%?
The answer is the failure of the critical facilities management to take timely corrective action, whenever defects are discovered.
The Many Facets of Critical Facilities MIS-Management
There are a variety of ways that data center operators fail to heed vendor reports:
领英推荐
The vast majority of the time, the responsible manager who received the vendor report:
Assuming the manager received and read the report, AND understood the ramifications of defects detected, the most common failure-mode of maintenance management that often comes into play is financial; repairs were deferred, due to an inadequate budget.
This last element, I have seen dozens of times.?
UPS Batteries Are Primary Reason For Data Center Outages
Again, if you’re not in the business, you would look at the graph above and say that UPS failure is a big part of downtime, but this is actually very deceptive.? You see, failure of a UPS system itself is exceedingly rare.? These systems have 50 years of design history, and the manufacturers have gotten very good in what they do.? So stating that it’s the UPS units, is not accurate.? You’ll also notice that UPS BATTERIES are not listed here; they’re invisible.
But UPS BATTERIES are the primary cause of power failures in data centers, for a several reasons:
So data center operators prefer to avoid replacing failing batteries until they absolutely have to, often leaving IT clients exposed to the risk of unplanned outages, due to degraded UPS batteries.
Example situation:
This happens ALL THE TIME…? I kid you not.
UPS Capacitor Example
Another recent example was reading the annual UPS PM report from a vendor, where they stated that all of the UPS filter capacitors on both the input and output sides needed to be replaced immediately, as they were at end of life.? The local manager had spotted the recommendations, tried to get the replacements done, but was over-ruled by upper management more concerned about saving a few bucks than keeping their equipment in excellent condition.? It should be noted that when UPS capacitors fail, there are no warning signs of imminent failure, and when failure does occur, the results are usually catastrophic, involving explosions, fire, smoke…?? Yeah.? ?And it’ll happen when the UPS’s are at their maximum loading, i.e., when you need them MOST.
In that example, the local manager specifically asked me to mark this down (which meant the site automatically failed the audit) so as to put pressure on his management to get the repairs done.? He was trying to do the right thing, but being over-ridden by upper management who were willing to put the IT clients at risk.?
Other Power System Failures- Causes
As I have explained with the UPS batteries, all of the other power system failure causes are serviced by outside vendors, due to the deep expertise required to service the individual systems, the variables introduced by manufacturers, and the need to maintain factory warranties (which are voided if maintenance is not performed by a manufacturer-approved vendor).?
So the other failures listed- generators, transfer switches, ABTs, STS’s, etc.- are all vendor-maintained systems, where the maintenance is almost always perfect.? As with the “UPS failure” the failure ends up being not the equipment, not the vendor, but the management failing (for whatever reason), causing the outages.
?How To Solve This?
The Uptime Report clearly demonstrates that the efforts by enterprise IT organizations to move to the cloud has not delivered relief from outages; in fact, they are creeping up.? The reasons are exactly what I described above.
While moving to colocation facilities has financial benefits, there are other costs associated with such a move, as I describe in Colocation’s Hidden Flaw: Lack of Agency.? And the SOC-2 certification from any colocation, with regard to availability, is prima facie fraudulent; CPAs can no more audit data center engineering operations than they can audit brain-surgery.
The occurrences of outage will get worse, I assure you: the increasing strain on the national electric grid and the soaring power demands are setting the stage for more frequent and severe outages. ?The mis-management of maintenance of the safety-nets that support your equipment are going to become more obvious and more expensive.
So the logical question now arises: HOW can the IT client of a colocation company make sure the infrastructure supporting their IT assets are properly maintained??
HOW do you make sure that the Tier-III or Tier-IV facility you’ve leased is actually maintaining their infrastructure, so that single-failure events don’t turn into cascade failures?
The Amerruss LLC Audit Program IS THE ANSWER
We offer the only proven availability assurance audit program in the world, bar NONE.
We achieve incredible results by performing the following steps:
It is important to understand that any defects found, are contractually the responsibility of the colocation vendor to mitigate, as the colocation contracts always stipulate the equipment will be maintained in accordance with industry practices and manufacturer recommendations.? Thus, all repair costs are the responsibility of the colo vendor, NOT the IT client.
The proven results of the audit program I developed and operated over the past 6 1/2 years, and now offer to you, resulted in perfect uptime of >60 sites spread across a global portfolio. When compared to the probabilities of downtime as published by the Uptime Institute 2023 outage report, where 60% of respondents had suffered an outage in the previous three years, the likelihood of the portfolio under my audit program NOT suffering an outage was 2.04x10^-126.? In other words, even competing against other critical facilities management professionals, my audit program delivered results that are statistically impossible, yet this was the result.
With our audit program, your need to continually expand your IT portfolio to more and more colocation facilities- a redundancy “arms race,” is no longer needed, and you can save millions by having our audit program resolutely monitoring the equipment that keeps your company safe and secure.
You won’t lose sleep at night, wondering if your company is exposed to hidden availability risks; we keep a sharp eye on things for you.
With increasing risks to the electric grid, you can no longer rely on an ersatz certificate like the SOC-2; you need to know the critical facilities components supporting your business are being properly maintained, so that WHEN utility interruptions occur, you’re sail through them without issue.
Reach out to Amerruss LLC today to initiate a tailored audit program for your data centers.
With our expertise and proven track record, we can transform your approach to data center management, ensuring resilience, efficiency, and, most importantly, uninterrupted service. The future of your IT infrastructure demands nothing less.
?