登录查看更多内容

What We Can Learn from Delta Airlines' Data Center Outage

Bob Kramlich

Client Manager

发布日期: 2016年8月8日

We all saw the news today of Delta Airline’s data center outage that stranded flights worldwide. Like many of you, I just assume that the systems that support major airlines and most major industries will not inconvenience me. I leave for the airport about 2 ? hours prior to most of my flights. I try to build in a little time for the more frequent traffic jams or security line delays. I use the Waze app to get me to the airport in the fastest, most efficient route. My phone is always in use, whether I’m checking email, returning texts, checking Twitter for the latest breaking news, or just catching up on Facebook. We assume all of these applications and systems will always be available for our convenience. Then there are mornings like today (and Southwest Airlines’ computer issues earlier this month). I’m glad I was not traveling today.

I’m a Delta Medallion traveler, Silver Medallion to be accurate. In Atlanta, there are more Medallion passengers than non-Medallion passengers. Ok, maybe that’s a stretch, but it certainly seems like it. Being a Silver Medallion is akin to being the worst player on a good team. At least I can get an exit row. Every day, thousands of business travelers jump on Delta jets for that next meeting or to finally come home. They expect Delta to acknowledge their loyalty. They expect Delta to know who they are, where they are, where their bags are, and give them up to the minute updates on seat upgrades and flight statuses. Then there are mornings like today.

This is in no way a criticism of Delta, actually quite the opposite. From most reports, Delta employees were extremely courteous and understanding. Delta did the right thing and honored change requests and refunds for all flights through the end of the week. Delta did everything they could to rectify the situation. As a customer, you want to know how a company will respond when there is an issue. Delta responded very well today, in my humble opinion.

What could they have done differently to prevent this costly interruption in service? Only Delta will know the actual details of how all its reservation systems failed on Monday morning. It has been reported that this may have been the result of a power failure somewhere in their system. They are now referring to this as "Failed Switch Gear". In residential terms, we would call it a "fuse box". In the data center world, we call it a remote power panel. As someone in the Data Center industry, I have a hard time believing this was the case. But I’ll take them at their word. Now computers do break down. Power supplies do fail. Most companies build redundancy into their applications and systems. But a company the size of Delta, with the resources they have available, should never have a power failure, or failed switch gear, as the main cause of an outage.

Today’s data centers are built to be robust and resilient so that they can work around a single source power failure. Multiple concrete-encased underground feeds from diverse substations are what feed power into today’s better data centers. Generators are designed to handle a data center’s IT load for multiple days. UPS Systems are designed to handle the data center’s IT load for a few minutes while the generators get up to full speed. Static transfer switches allow for multiple electrical feeds to each Remote Power Panel (RPP), which provide redundant (A&B) circuits to a cabinet.

So in order to experience a power failure, one or more of the following had to occur:

A server or servers had to fail and the back-up servers failed to respond
Both PDUs within a cabinet or multiple PDUs failed to provide power to the servers
Multiple RPPs had to fail simultaneously (if the reports are accurate)
Multiple static transfer switches had to fail simultaneously
The primary and failover UPS systems had to fail simultaneously
The generators supporting the UPS systems failed to generate enough power to support the load. The redundant (N+1 or 2N) generator also failed to start in time to support the IT load
Both substations had to fail at the same time and one or more of the above scenarios had to occur.

With a Tier III type data center, a power failure is highly unlikely. There should be no single points of failure in the power design. I won’t say outages won’t occur. Customers are more likely to see an outage due to network issues (cable cuts, router failure, etc.), application design (bugs, untested code), or malicious intent (hacking, sabotage). That would leave me to believe if it indeed was a power failure, one of the following actually occurred:

The servers are located in an older data center not designed to handle today’s higher capacity equipment.

The data center design had a single point of failure and it was exposed.
A power surge damaged the redundant equipment designed to handle the load in the event of failure.
Temperatures in the data center reached a level where servers began to power down, possibly due to a failure to the data center’s cooling system.
Malicious activity occurred to keep systems from operating as they should in the event of outage.

So what can we learn from Delta’s misfortune? Protect your company by finding a data center that is designed to be “always-on”. Ask questions about the power feeds. Ask questions about what happens in the event a generator fails to respond. Ask how power is distributed to the floor. Ask the data center operator to show and explain their electrical line-ups. If they’re a true Tier III type data center, you’ll understand my skepticism.

You can reach Bob at [email protected].

Steve Ritzi, CDCE

8 年

Considering how long certain aircraft are kept in the fleet, it is not surprising that Delta may have tried to milk their hardware for as long as possible before needing to replace it.

Desmond Hardy

Cybersecurity Consultant | VoIP Specialist | MSSP Maven | Cloud Consultant

8 年

This essay was delivered with surgical precision, Bob. I found myself asking the same exact questions. Whoever manages the Delta account for their storage and datacenter operations should expect a hefty commission for the adjustments that Delta should be making over the coming months.

2 次回应

Bryan Wolfe

8 年

It goes without saying for those in the Infrastructure Applications/Data Center industry, this is cringeworthy. Too many variables had to fail, if they were even there??? Could it be they had no logical failover? Was it really all physical on-site and hadnt been tested? The details need to come out soon, blaming Georgia Power was not a good response by any measure. :o)

2 次回应

Lori O'Toole

Account Executive at CBTS Company, formerly OnX Enterprise Solutions

8 年

Great points on how there truly is no small single failure that will take an entire system down and for the length of time it was down. Planning, testing, and retesting is crucial! A failover plan only works efficiently if it is tested often. Work loads and applications change too frequently to be able to be resilient if never tested.

2 次回应

查看更多评论

要查看或添加评论，请登录

Bob Kramlich的更多文章

Which best describes your IT strategy: Transact, Transition, or Transform?

2022年7月26日

Which best describes your IT strategy: Transact, Transition, or Transform?

“If you choose not to decide, you still have made a choice.” – Geddy Lee, RUSH I decided to update an article I had…

2 条评论
Transact, Transition, or Transform: Do you understand your corporate strategy?

2017年6月21日

Transact, Transition, or Transform: Do you understand your corporate strategy?

“If you choose not to decide, you still have made a choice.” – Geddy Lee, RUSH I decided to update an article I had…
Don't miss Atlanta's Premier Data Center Event

2017年5月5日

Don't miss Atlanta's Premier Data Center Event

Atlanta's 5th Annual Data Center Summit will be held on May 24th at The Metropolitan Club in Alpharetta. This is a…
5 Things We Can Learn From The AWS Outage

2017年3月1日

5 Things We Can Learn From The AWS Outage

According to SimilarTech 148,213 companies rely on AWS S3. 3-4 trillion pieces of data are stored on it.

10 条评论
5 Things to Consider After Choosing a Data Center Provider

2017年2月15日

5 Things to Consider After Choosing a Data Center Provider

Making the decision to put one’s computer infrastructure into a data center colocation facility is certainly a…
A Follow-Up to Delta's "Power Outage"

2016年8月10日

A Follow-Up to Delta's "Power Outage"

Delta’s Chief Operating Officer, Gil West, came out with an update this week (August 9) regarding the likely cause of…

17 条评论
Outsourced Data Centers: The Main Concern

2016年6月21日

Outsourced Data Centers: The Main Concern

Have you ever gone home on a nice sunny day only to notice your clocks are blinking? Have you ever been online and then…
Hybrid IT – The Standard for Today’s Businesses

2016年2月9日

Hybrid IT – The Standard for Today’s Businesses

I meet with many different companies every week. They range from start-ups to Fortune 500 companies.
5 Popular Data Center Trends for 2016 – A Not-So-New Year Ahead

2015年12月15日

5 Popular Data Center Trends for 2016 – A Not-So-New Year Ahead

As companies wrap up 2015 and begin to put their strategic plans in place to make 2016 a successful year, I am…

1 条评论
Partnering: A Win for All in the Data Center

2015年11月2日

Partnering: A Win for All in the Data Center

"Businesses once grew by one of two ways; grass roots up, or by acquisition..

See all articles

社区洞察

Airport Management

How do you manage complex air traffic?

What We Can Learn from Delta Airlines' Data Center Outage

Bob Kramlich

Client Manager

Bob Kramlich的更多文章

社区洞察

其他会员也浏览了

What Do Airlines Want from Trump (And Can He Deliver)?

What Chief Data Officers can learn from Southwest Airlines Technology Vulnerabilities

Media Statement: South Africa Marks International Day of the Air Traffic Controller

Coffee and Conspiracies - Sara’s Dispatch Dilemma

Is Southwest Airlines (LUV) Collapsing?

USA OpsCast Week 49:

Another day, another airline outage, or: “I hate to say I told you so”

Time out for NOTAM

Why Russian Airline Reservation System Proved Immune to Global I.T Outage?

Bob Kramlich的更多文章

Which best describes your IT strategy: Transact, Transition, or Transform?

Transact, Transition, or Transform: Do you understand your corporate strategy?

Don't miss Atlanta's Premier Data Center Event

5 Things We Can Learn From The AWS Outage

5 Things to Consider After Choosing a Data Center Provider

A Follow-Up to Delta's "Power Outage"

Outsourced Data Centers: The Main Concern

Hybrid IT – The Standard for Today’s Businesses

5 Popular Data Center Trends for 2016 – A Not-So-New Year Ahead

Partnering: A Win for All in the Data Center

社区洞察

其他会员也浏览了

What Do Airlines Want from Trump (And Can He Deliver)?

What Chief Data Officers can learn from Southwest Airlines Technology Vulnerabilities

Media Statement: South Africa Marks International Day of the Air Traffic Controller

Coffee and Conspiracies - Sara’s Dispatch Dilemma

Is Southwest Airlines (LUV) Collapsing?

USA OpsCast Week 49:

Another day, another airline outage, or: “I hate to say I told you so”

Time out for NOTAM

Why Russian Airline Reservation System Proved Immune to Global I.T Outage?