What We Can Learn from Delta Airlines' Data Center Outage

What We Can Learn from Delta Airlines' Data Center Outage

We all saw the news today of Delta Airline’s data center outage that stranded flights worldwide. Like many of you, I just assume that the systems that support major airlines and most major industries will not inconvenience me. I leave for the airport about 2 ? hours prior to most of my flights. I try to build in a little time for the more frequent traffic jams or security line delays. I use the Waze app to get me to the airport in the fastest, most efficient route. My phone is always in use, whether I’m checking email, returning texts, checking Twitter for the latest breaking news, or just catching up on Facebook. We assume all of these applications and systems will always be available for our convenience. Then there are mornings like today (and Southwest Airlines’ computer issues earlier this month). I’m glad I was not traveling today.

 I’m a Delta Medallion traveler, Silver Medallion to be accurate. In Atlanta, there are more Medallion passengers than non-Medallion passengers. Ok, maybe that’s a stretch, but it certainly seems like it. Being a Silver Medallion is akin to being the worst player on a good team. At least I can get an exit row. Every day, thousands of business travelers jump on Delta jets for that next meeting or to finally come home. They expect Delta to acknowledge their loyalty. They expect Delta to know who they are, where they are, where their bags are, and give them up to the minute updates on seat upgrades and flight statuses. Then there are mornings like today.

 This is in no way a criticism of Delta, actually quite the opposite. From most reports, Delta employees were extremely courteous and understanding. Delta did the right thing and honored change requests and refunds for all flights through the end of the week. Delta did everything they could to rectify the situation. As a customer, you want to know how a company will respond when there is an issue. Delta responded very well today, in my humble opinion.

 What could they have done differently to prevent this costly interruption in service? Only Delta will know the actual details of how all its reservation systems failed on Monday morning. It has been reported that this may have been the result of a power failure somewhere in their system. They are now referring to this as "Failed Switch Gear". In residential terms, we would call it a "fuse box". In the data center world, we call it a remote power panel. As someone in the Data Center industry, I have a hard time believing this was the case. But I’ll take them at their word. Now computers do break down. Power supplies do fail. Most companies build redundancy into their applications and systems. But a company the size of Delta, with the resources they have available, should never have a power failure, or failed switch gear, as the main cause of an outage.

 Today’s data centers are built to be robust and resilient so that they can work around a single source power failure. Multiple concrete-encased underground feeds from diverse substations are what feed power into today’s better data centers. Generators are designed to handle a data center’s IT load for multiple days. UPS Systems are designed to handle the data center’s IT load for a few minutes while the generators get up to full speed. Static transfer switches allow for multiple electrical feeds to each Remote Power Panel (RPP), which provide redundant (A&B) circuits to a cabinet.

 So in order to experience a power failure, one or more of the following had to occur:

  • A server or servers had to fail and the back-up servers failed to respond
  • Both PDUs within a cabinet or multiple PDUs failed to provide power to the servers
  • Multiple RPPs had to fail simultaneously (if the reports are accurate)
  • Multiple static transfer switches had to fail simultaneously
  • The primary and failover UPS systems had to fail simultaneously
  • The generators supporting the UPS systems failed to generate enough power to support the load. The redundant (N+1 or 2N) generator also failed to start in time to support the IT load
  • Both substations had to fail at the same time and one or more of the above scenarios had to occur.

 With a Tier III type data center, a power failure is highly unlikely. There should be no single points of failure in the power design. I won’t say outages won’t occur. Customers are more likely to see an outage due to network issues (cable cuts, router failure, etc.), application design (bugs, untested code), or malicious intent (hacking, sabotage). That would leave me to believe if it indeed was a power failure, one of the following actually occurred:

  •  The servers are located in an older data center not designed to handle today’s higher capacity equipment.
  • The data center design had a single point of failure and it was exposed.
  • A power surge damaged the redundant equipment designed to handle the load in the event of failure.
  • Temperatures in the data center reached a level where servers began to power down, possibly due to a failure to the data center’s cooling system.
  • Malicious activity occurred to keep systems from operating as they should in the event of outage.

 So what can we learn from Delta’s misfortune? Protect your company by finding a data center that is designed to be “always-on”. Ask questions about the power feeds. Ask questions about what happens in the event a generator fails to respond. Ask how power is distributed to the floor. Ask the data center operator to show and explain their electrical line-ups. If they’re a true Tier III type data center, you’ll understand my skepticism.

You can reach Bob at [email protected]

Considering how long certain aircraft are kept in the fleet, it is not surprising that Delta may have tried to milk their hardware for as long as possible before needing to replace it.

回复
Desmond Hardy

Cybersecurity Consultant | VoIP Specialist | MSSP Maven | Cloud Consultant

8 年

This essay was delivered with surgical precision, Bob. I found myself asking the same exact questions. Whoever manages the Delta account for their storage and datacenter operations should expect a hefty commission for the adjustments that Delta should be making over the coming months.

It goes without saying for those in the Infrastructure Applications/Data Center industry, this is cringeworthy. Too many variables had to fail, if they were even there??? Could it be they had no logical failover? Was it really all physical on-site and hadnt been tested? The details need to come out soon, blaming Georgia Power was not a good response by any measure. :o)

Lori O'Toole

Account Executive at CBTS Company, formerly OnX Enterprise Solutions

8 年

Great points on how there truly is no small single failure that will take an entire system down and for the length of time it was down. Planning, testing, and retesting is crucial! A failover plan only works efficiently if it is tested often. Work loads and applications change too frequently to be able to be resilient if never tested.

要查看或添加评论,请登录

Bob Kramlich的更多文章

社区洞察

其他会员也浏览了