What IS “The Secret” To Running Data Centers?
Dr. Eric Woodell
World's #1 expert in data center resilience. I audit and certify colocation facilities, ensuring secure, continuous operations—insured by Lloyd's of London.
I’ve recently seen quite a few posts supposedly offering “the secret” to effectively running data centers; a new building management system, or a data center infrastructure management system (DCIM), or staffing, a new UPS design or hydrogen fuel-cells… Of course, those “secret” weapons are sold by the same people that are creating the posts, in other words, they’re marketing ploys to generate interest in whatever the newest product is. Certainly understandable from their perspective; it’s how you generate business.
But this does bring up a relevant point, and that is, “what IS the secret to effectively running a mission critical facility?”
IS THIS EVEN A VALID QUESTION?
I often compare the industry of mission-critical facilities to the airline industry. The performance expectations are similarly high, the infrastructure is similarly sophisticated and demanding, the training levels and behaviors are very close. And while big data-center operators are not prone to telling their secrets- or airing their dirty laundry- the airline industry is very public. Airplane failures and crashes are always publicized, and the lessons learned from the events not only make the news, but there are several television shows that talk about crashes, why they happened, and what steps are being taken to prevent future incidents of a similar nature.
So if we look at the airline industry, would we be able to point to a single factor, and say “THIS is the key to successfully flying an airplane”?
Well, we could point to the fuel used by the engines… or the engines themselves. We could point to the design of the wing, or the pilots, or the air-traffic controllers, or even the maintenance crews, their procedures and the certifications required of any spare parts used… At one time or another, the failure of every single aspect I’ve just named has resulted in airplane crashes, and loss of life. It is obvious that ALL of the pieces have to work properly, effectively and efficiently; the individual aspects are all crucial, but the failure of any one leads to predictable- and sometimes devastating- results. The same is true with any mission-critical facility.
“The Secret” to having your data center be as reliable, safe and profitable as you are expecting it to be, is that EVERY aspect must properly function within the operational framework.
IT’S ABOUT THE PEOPLE
The Uptime Institute has said for nearly a decade, that 70% of data center outages are due to human error; equipment failure is only 30%. This logically means that while executive managers of data centers and other critical facilities focus their attention on infrastructure components- UPS systems, generators, Tier-II vs. Tier-III, etc., these factors are not the main reason data centers fail. The biggest factor is the people. [And I’m not referring to who you’re putting on your staff, I’m referring to what your staff has to work with.]
These following items seem like common-sense, but you might be surprised how many executives simply don’t consider them:
Manpower is the first element to consider; do you have enough people to perform the work that needs to be accomplished on a daily basis? Do they escort vendors for service calls, which will take them away from their normal duties?
Training: Have the staff received proper training on site-specific gear, and has it been documented? Are shift personnel qualified for specific shift operations and maintenance functions?
Financial management: are OPEX and CAPEX budgets of sufficient magnitude to fund critical-facilities projects and normal operations? Are they separate from other budgets? [As one friend pointed out, “Data Center managers are squeezed to save every penny that they get so what suffers, the infrastructure.” And I would add, the people as well…]
Reference library: do you have one, with as-built drawings, operations and maintenance documents, warranty, commissioning, and automation sequences for operation? How about studies for infrastructure systems, soil, waters, structural? Are the master copies being kept in a safe, centralized location? Is there a system for managing floor-space, power and cooling? Are these aspects being monitored and a process in place to forecast future growth?
Organization: Does the staff have an established reporting chain, including a call-out list during emergencies? Do the staff have job descriptions available, which state their duties AND expectations? Are roles and responsibilities clearly delineated? Are key roles clearly established with their own duties and responsibilities? Is there a succession plan in place?
领英推荐
Maintenance: Is there an effective preventative maintenance (PM) program in place, including detailed procedures that are clearly written for technicians to follow? Is there a QA system to validate PMs are performed as desired? Is there an effective maintenance management system in place, where hours and equipment history are tracked? Is there a detailed inventory of spare parts on-hand, and list of suppliers and vendors when un-stocked spares are needed rapidly? Is there a life-cycle planning system in place? How about a failure analysis program, and tracking of deferred maintenance?
MANAGEMENT AWARENESS OF EXISTING RISKS
Take a look at the following graphic, which someone posted last week:
Two comments on the above graphic were “The question … becomes do you have any employees with the fortitude to inform leader ship?” and “… can you find Executives that want to hear there is a problem?”
An essay I recently saw on LinkedIn was about “authenticity” of employees. Basically, those people who dare to be honest have lower scores on their performance evaluations, receive lower bonuses and generally don’t survive long in most companies. Those who do survive are those who bury their feelings, don’t communicate openly and just work with what they have available. Or, as a friend succinctly stated, they’re “just collecting a check.” I’ve been in discussions about how employees that bring items up that are single points of failure are considered “trouble-makers” in some managers’ eyes. ?And even worse, there are some managers that use the simple motto: "If it ain't broke, don't fix it." In the arena of mission-critical facilities, this is not what you want.? But human behavior- for both employees and managers- doesn’t really change, meaning that critical facilities will have hidden flaws that the boots-in-the-trenches will know about, but executive management does not.
This translates into hidden risks for the owners of critical facilities which could significantly impact their risk management profile, and of course their customers. This might be an acceptable situation for some leaders, in the short term. But sooner or later, hidden risks will manifest themselves, either through unplanned outages or other unusual events.
Executive managers must get objective information on issues that represent threats to their current operations, as well as opportunities wherever available, to reduce costs, increase operational effectiveness and increase overall profitability. When data center failures translate into millions of dollars of lost revenues, it is incumbent on the executive leadership to be fully informed of outstanding risks, so that they can make equally informed decisions on how to mitigate those risks.
HOW IS THIS DONE?
To get an adequate determination of the condition of your critical facilities requires an objective observer, an outsider who can come in and perform an honest assessment. Naturally, the ideal situation requires a vendor who has no financial incentive in the final results of the assessment that would represent a conflict of interest; no commissions off of equipment sales, engineering fees or other “add-ons.”
For that assessment to be meaningful, it should include not only the infrastructure assessment- which accounts for only 30% of data center failures- but the other 70%, which is due to human-oriented issues, such as the bullet-points mentioned above. (There are many more, but I didn’t want to belabor the point.)
This, of course, goes far beyond engineering codes, typical tests and engineering classes. Such assistance requires direct engagement with the on-site staff, asking questions and letting them know they are being heard, without threat of retaliation. [Another comment on the Iceberg graphic: “I have seen leadership teams manage by fear and employees stop engaging. Then when something goes wrong, leadership questions why they were never told.]
WHAT ARE THE BENEFITS?
With such assistance, executive leaders will be able to get the “full picture” of outstanding risks in their facilities and, with their managers determine a corrective action plan to resolve outstanding issues. There will be payoffs in several aspects: increased reliability and redundancy, reduced complexity, identifying issues before they become problems, structured program of procedures and maintenance, enhanced effectiveness of staff and reduced operational risk.
For more information on how we can assist you, please visit our website at www.amerruss.com
Have a great 2024
8 年Great article. People are key....as long as there is adequate equipment and funding. People need to be trained and educated on the value of the mission at hand. Put a good tool in the hands of a master and it will outperform the newest whiz bang device in the hands of a novice every time. Put the new technology in the hands of a master with adequate training and now you have something.
Let's embrace curiosity together!
8 年Great article Eric! Gaining the seat at the table with the most sacred "Trench Leaders" you must first show yourself approved. You must take on that risky situation they have been complaining about for years, understanding their point-of-view and do something about it, and be able to get down in the trenches with them. After that you will start to hear about the problem areas that only the trusted few will ever know about. Managers after manager just care about meeting production and metric objective. However when seeking excellence in the Critical Environments, you must understand your team first. They are your key to success, and when they know you care about them they will cover you and they will make you look good.
Global Real Estate Executive | Board Member | Non Exec | | Leading Transformation | Driving Innovation | Building Exceptional Teams |
8 年A good read! Human Factors underlies every one of the components discussed. After 15 years the industry still focuses disproportionally on product and design whilst failing to recognise the fundamental contribution to risk made by the right behaviours, competence, motivation and culture prevalent in a DC.
Chairman & CEO at EPI Group of Companies
8 年People are indeed the prominent factor and yet often not getting the focus and resources. It seems that buying that new fancy piece of hardware is far more justifiable than sending people to training (often at a fraction of the cost) and to review operational processes. That is the reason why we created the DCOS (Data Centre Operations Standard) that address all you mentioned above and more. Its not just about having great processes, it is as much as important to have them all integrated as well which is another key factor why data centres go down and one of the often ignored fact. People like to build their own empires and that can be deadly in a data centre environment. For more info https://www.epi-ap.com/dcos