The SWA Meltdown
Sanjeev Khadilkar
Program Management for Software Product Engineering - Intelligent, Web Scale, Distributed, Mobile, Real Time, Embedded
A Case Study of Catastrophic Cascades
with lessons for the technology program management of large complex systems
Abstract
Catastrophic cascades, chains of low likelihood events with cumulative impact, have the potential to severely challenge business continuity and disaster recovery plans. System-of-systems approach, resilient anti-fragile design, cybernetic design of robotic process automation incorporating human-in-the-loop, validation by stress testing and adapting to political-economic-societal-regulatory evolution are key elements in the technical program management of large complex systems.
Part 1: What Happened
Context
In the last week of 2022, Southwest Airlines, which carries more passengers within the United States than any other airline, cancelled between 60% and 75% of its flights, in a meltdown that began when the holiday winter storm hit. The operational meltdown at Southwest Airlines stranded thousands of travelers across the U.S., with many enduring long lines and missing luggage.
Elderly passengers sat in wheelchairs for hours, mothers ran out of formula for their infants, critical medicines remained out of reach in lost checked-in baggage and a lot of families slept on the floor. "It was just imploding, and no one could tell you anything," one traveler said. “(The airline’s frontline employees) were desperately trying to help, but you could tell they were just as clueless as everybody else … it was scary."
Frustrated travelers vented their emotions online. “I have been to the airport 5 times in last 6 days. (It) IS A MESS!!!!!!!!! OMG you should see the bags in baggage claim.” “It took 3-4 hrs of lines to get to the SWA desk both in the terminal and gates. Waits on the phone were 7-8 hrs.” “Food was out in the food courts at the terminals and the gates because of all the people. Toilets clogged, vomit from drunks hanging for days with nothing to do, no charging of electronics as they were being fought over, floors and wherever were bedrooms.”
“Our flight to San Francisco was cancelled on Christmas night (12/25). The plane we were expecting was coming from San Francisco and would be flying back. At the time, a flight to Oakland was already delayed, and in fact, they de-boarded the plane once due to a lack of crew. An announcement came in that a crew was found and at almost the exact same time, our flight was completely cancelled. Sure enough, our plane arrived, a crew came off, and they were whisked away to the Oakland plane. I'm guessing there are additional regulations or perhaps penalties that come with a plane that has boarded.”
“My wife boarded a 9:30am flight at almost 5pm and then they made everyone deplane because the pilot timed out. Probably for the best, since they would've just been stranded after their connecting flight most likely would've cancelled.” “Our flight was cancelled 10 hours after scheduled departure… after they told us every hour the flight would still leave, once they found a flight attendant (which obviously never happened).” “This happened to us.?Plane was there, pilots there, 4 crew all at the gate. 1 crew member needed to be added to the flight. They couldn’t get hold of corporate to do the addition of the crew member, so pilots timed out and flight cancelled.”
“Wish I had known. Would never have checked the family's bags in. Now the bags are in the wind.” “After being cancelled, one family that had already checked in was turned away from baggage claim without the car seats of their young children which by then were lost in the black hole of airline luggage transport, creating a handicap for them to travel by road.”
Unable to find plane, train or bus seats, one family felt lucky to manage a rental car and drove 12 hours to reach their destination hundreds of miles away. “My two nephews had to drive from central Texas back to California because of the mess.” “My two primary rental agencies had no cars.” “My wife and I had to drive 21 hours home due to cancelled flight.” One couple two young children rented a car and drove 26 hours from New York to Denver, taking a quick breather at a hotel in a place about halfway through in Illinois.
The cancellation of so many flights has also had an impact on cargo operations. Brian Patrick Bourke, CCO of SEKO Logistics, tweeted that the disruptions had severely impacted cargo operations and that this was the second peak season in a row that they had to find alternatives for larger clients.
Lack of communication from the company was especially frustrating. Some passengers didn't know their flight had been cancelled until they happened to check the status online. Others say they waited hours to speak to a Southwest representative on the phone, and later waited days to retrieve checked bags after flights were cancelled.
All other domestic airlines recovered fast, returning to pre-storm delay and cancellation levels quickly after being knocked off-kilter by the severe winter storm. A tweet from Southwest directing customers to self-service options had more than 1,000 replies -- many of them angry. One of the replies in part read: "Stop blaming the WEATHER! Had to buy a first-class ticket on another airline but it TOOK OFF ON TIME! You still have our luggage with medication inside! Can't get through on the phone!" Blaming the problems on the storm is ok, until you realize that the same storm didn't grind all of the rest of the airlines to a halt.
For Southwest Airlines, the travel mess lingered like a vicious hangover with migraine-proportioned headaches, and the pain spread further even after the storm had passed. The crisis worsened, spiraling into a complete meltdown of its flight system, paralyzing the airline, with climbing levels of disruption, accounting for over half to nearly three-quarters of U.S. flight cancellations at one point. In the days since, the carrier’s scramble to recover was slow, with flight cancellations continuing for far longer than its competitors. Southwest Airlines continued cancelling thousands of flights even three days after the storm was over, long after its rivals had resumed normal service.
"In short, everything possible went wrong for Southwest" said an editor of a travel website. "At this point, we can very safely say that this is no longer a weather-related disturbance. We've had clear skies in the United States for several days now, more or less, and Southwest is the only airline that is failing so spectacularly here." “This has been a week. Not a couple days.” “The ‘wait here all day to not go anywhere’ is just brutal.”
“This is the largest-scale event that I’ve ever seen,” Chief Executive Bob Jordan said in an interview, as the Dallas-based airline proved unable to stabilize its operations, adding that Southwest planned to operate just over one-third of its typical schedule in the coming days to give itself leeway for crews to get into the right positions.
The US Transportation Department called Southwest’s rate of cancellations “disproportionate and unacceptable” and said it would examine whether the cancellations were controllable and whether the airline is complying with its customer-service plan.
"We're going to expect them to go beyond the letter of the law in terms of how they treat passengers, making sure they pay for things like hotels, ground travel expenses, meals and of course, refunds," Buttigieg said.
A proposed class action filed on Dec. 30 in New Orleans federal court (Capdeville v Southwest Airlines Co, U.S. District Court, Eastern District of Louisiana, No. 22-05590) accuses Southwest of breach of contract, seeking damages for passengers on Southwest flights cancelled since Dec. 24, and who did not receive refunds or expense reimbursements.
President Joe Biden urged consumers to check if they’re eligible for compensation as cascading airline delays have disrupted holiday travel across the country. “Our Administration is working to ensure airlines are held accountable,” Biden tweeted.
Later, the U.S. Transportation Department (USDOT) said in a notice posted on its website it intended to hold airlines, ticket agents and others “accountable and deter future misconduct by seeking higher penalties that would not be viewed as simply a cost of doing business.”
What went wrong
Early reactions identified the cause as a combination of the location of bad weather and, some said, execution challenges including a crew-scheduling system that was overwhelmed by and buckled under the volume of changes. Some flights were boarded, then cancelled and deplaned due to flight crews running out of allowable work time.
A huge swath of the US was under some form of weather advisory or alert in a very compressed timetable, during a time of year when travel was already extremely pressured. Southwest has a major presence in locations which were heavily affected by the storm. Planes froze overnight, and were unusable until the following midday. Airports ran out of space for de-icing, hobbling operations. Unexpected problems, such as fog, staffing shortage and a logjam created by planes grounded for the night, continued to crop up one after another.
Essentially a short and medium haul airline that mostly doesn’t do long haul services except for Hawaii, Southwest turns aircraft quickly, in less than 30 minutes, to achieve higher aircraft utilization than any other major US airline. They often run their crews on tight loops where they’re out from home and back the same day so they can save money on accommodating crews who overnight away from their home base. A Southwest pilot may take off in the morning from one city, then fly to two, three, four, five, six other cities, before returning home to spend the night.
For longer circuits, there's a flight crew change somewhere in there, with the timed-out crew overnighting while the plane flies on with a rested crew for more legs across the country and returns to the city where it started. When you get a situation like this, everything tends to get out of position and backfilling can be a nightmare.
When Southwest melted down, they didn’t have nearly the number of rooms reserved that they needed for their own crew, and it was Christmas so hotels were full and crews got stuck. Pilots booked their own hotels when the airline didn’t assign them. Crews often did not get rooms and just got dumped like passengers at airports. Though there are crew break rooms at most airports, they are not very comfortable for sleeping over. Some flight attendants spent the night on cots in crew lounges. Lacking accommodations, some found their own way home.
The system couldn’t keep up with the changes, meaning crews no longer were where the scheduling system thought they were and they lost track of most of their employees. With operational conditions forcing daily changes to schedules at a volume and magnitude that swamped the system’s ability to recover, not knowing where their crews were undermined the airline’s efforts to restore full capacity. Plagued by staffing shortages and failures of automated systems, the airline attempted manual workarounds that overwhelmed the capacity of available staff.
Many crews were “unaccounted” for over several days. The airline flight crew has no front-end technology to input their whereabouts into the system. The only way for the crew to update their location was to call in on the phone to share that information with a scheduler.
To make matters worse, the phone system was not working well, heavily overloaded, with some employees trying to get in touch with scheduling for 12-16 hours. The schedulers quickly put together a web form as an alternate way for crew to submit updates, but without backend integration it was too much to keep up on manually and ultimately that method for tracking crews also failed.
Putting flight crews and aircraft back together became a train wreck. If a flight attendant or pilot at Southwest Airlines Co. gets reassigned to another flight or a different hotel, someone has to call them or chase them down physically in the airport and let them know. Almost everything still has to be done by paper, such as assigning ramp agents to certain zones or aircraft for unloading bags.
“Matching crew members with aircraft broke down as the airline struggled to meet Federal Aviation Administration (FAA) regulations.” "With those cancellations and as a result, we end up with flight crews and airplanes that are out of place and not in the cities that they need to be in to continue to run our operations," the airline said at a news conference. By Xmas evening the crew scheduling department had essentially reached the inability to do anything but simple, one-off assignments.
Technology handicaps
In trying to explain later how Southwest Airlines melted down, airline executives, labor leaders and IT engineers responsible for fixing, maintaining and keeping the software running also pointed to inadequate technology systems. Woefully antiquated home-grown systems developed in silos during the 1990s were not integrated with each other. The flight cancellation system could not talk to the flight reservation system. There were no app / internet options, it was all manual entry.
Outdated software packages had not been upgraded. CPU, memory and disk space on the server had not been expanded to keep pace with growing utilization. The system had settings that you DO NOT fiddle with for fear of crashing the whole enchilada. One former software developer commented online, “Wow. I'd bet the nerds that know all the ways to make the system work were called in to fix the problem. They know all the hacks, all the data inconsistencies and back doors and weird behaviors that add up to a ‘functioning’ system. I'm imagining code comments like ‘Don't change this!’ everywhere.”
For reassigning crews after flight disruptions, Southwest uses an operations research (OR) system called SkySolver that optimizes the route planning using a complicated mathematical system. SkySolver is an off-the-shelf application that Southwest has customized and updated, but that is nearing the end of its life, according to the airline. The program was developed decades ago and is now owned by General Electric Co., which said its software isn’t an end-to-end solution, but rather a backend algorithm that airlines can supplement with other software to resolve crew-related disruptions.
The SkySolver based crew scheduling system creates the automated flow of crewmembers moving about their day and publishes the assignments. It doesn’t know whether they actually flew the assigned leg. It just assumes so and moves the piece forward for scheduling the next leg. In the event of a disruption, crewmembers call scheduling, the scheduling managers manually adjust data, and the system re-computes a fresh solution.
System overload
It does work, it just works for an airline one-third the size of SWA or during a more typical disruption like past hurricanes and snowstorms. Southwest had expanded faster than their operational infrastructure could keep up, meaning something that should have been a worse-than-average-but-not-catastrophic winter event turned into an utter disaster because it was just enough to stretch everything past its breaking point.
When the storm came it impacted ground ops very badly. When the weather hit all those stations at once the ramp crews had to work in shifts to not become injured due to the cold. That slowed down the turns and backed up the planes.
Southwest had anticipated this, and issued a "State of Operational Emergency" at the Denver airport on Dec. 21, ahead of the winter storm, threatening to terminate employees who did not work mandatory overtime and those who were sick and did not provide in-person doctors' notes, among other things. But it was not enough in view of the system collapse.
When each news alert came, the update had to be manually input to the crew scheduling software and the whole solution re-computed. Given the pace of new alerts and the duration needed to re-compute each time, the system simply couldn’t keep up. Even as it tried to solve one set of problems, new ones would emerge.
The scale of the 2022 storm overwhelmed its capacity to process disruption alerts and re-compute schedules before the next disruption alert came. SkySolver was unable to track and coordinate all the pilots, crew members, and airplanes under the rapidly changing conditions of nationwide disruptions, and got backlogged on catching up with status updates in real time.
Meltdown
With the actual positions of so many crew members and airplanes out of sync with the data that had been input till then, the system lost situational awareness. The airline had no idea who was where, what plane was where. With system data updates lagging far behind real time events, the crew scheduling software was operating on the crew’s assumed location instead of their actual location.
Much of the bad data was generated by itself, causing a nasty feedback loop. For instance, it might put someone on a flight, then cancel that flight, yet, since flight scheduling was not integrated with crew scheduling, still think the crew would arrive at the destination city to pick up another flight.
The only way for the schedulers to figure it out was for the crews themselves to call into dispatch to inform their locations for updating manually in the system. But it was a major weather event, and the airline was cancelling flights due to lack of crews faster than the humans could tell it where actually crews were available. Phone lines jammed up, with pilots and flight attendants trying to get assignments kept on hold for hours, while planes were stuck for the lack of a crew, and the airline was scrambling just to figure out where its crew members were located.
As bad data led to SkySolver progressively publishing unimplementable flight assignments out of sync with the ground reality of plane and crew availability, it ceased to be a trustworthy source for decision makers. Eventually the scheduling application and other APIs and services went offline due to server resources over-utilization.
Human schedulers started to comb through records by hand, a burden for which they were understaffed, ill-equipped and unrehearsed, in the hope that once they knew where everyone was, then they assign crews manually.
Southwest employees worked heroically to keep things moving despite an outdated system. “They would make great progress, and then some other disruption would happen, and it would unravel their work,” COO Watterson said. “So, we spent multiple days where we kind of got close to finishing the problem, and then it had to be reset.” Cancellations snowballed.
Before a flight can leave the gate, it must be dispatched by a company dispatcher. This person puts the crew names on the dispatch paperwork to make it legal. They get the names from the scheduling department that monitors legality issues such as rest and duty day. The dispatchers were not getting names from scheduling so they didn’t know if the crews present were legally eligible and therefore could not send the dispatch paperwork. There where multiple times when a full crew was on the airplanes, ready to take passengers and waiting to get the dispatch paperwork that never came.
One pilot wrote online, “What should have been one minor inconvenient day of travel turned into a nightmare. The airline (lost) track of all its crews. ALL of us. We were there. With our customers. At the jet. Ready to go. But there was no way to assign us. To confirm us. To release us to fly the flight.”
A catastrophic cascade
The key to recovery is typically, keep operating. Typically, after day three or day four, one is in pretty decent shape. That philosophy did not work this time, though. Several times, the airline’s leaders believed they’d gotten control of the problem, only to encounter some new roadblock that required them to cancel more flights, undoing the carefully set crew plans.
Scheduling pilots and flight attendants for their next flight assignments became chaotic due to incorrect data on crew availability. Many waited for hours for instructions before hitting regulatory limits on how many hours they could work without rest. This caused more flights to be cancelled as they didn’t have a full complement of rested crew eligible to fly. Those cancelled flights meant even more crews were out of position for their next flight.
Operations at several key airports were in danger of becoming gridlocked due to arriving flights competing for space with those held up at gates waiting for crews. To avoid those snarls, Southwest cancelled more flights, starting the process all over again. Inability to catch up with the rapid pace of changes thus snowballed into a complete meltdown.
The fallout of this cascading chain of failures was a destabilized network of fragmented connectivity, with hundreds of planes and hundreds of crews idled, unable to be matched to each other for flight assignments, while customers continued to book tickets for upcoming flights, maintaining ongoing pressure on the system.
“Winter storms and staff shortages were only the tipping point that sent Southwest Airlines IT infrastructure over the edge, leaving thousands still stranded across the US”, chief operating officer Andrew Watterson explained. “In effect the winter storm that flowed across much of the US triggered a cascade event which the company's IT infrastructure was ill-equipped to manage.”
System restart
The final hope of all trouble-shooters before giving up is to hit the reset button. As a reader commented online, “The not so pretty, but simple answer: call everyone back to home base. (Then) get everyone where they need to be (+ bags) and provide a fair refund for the inconvenience.”
Eventually Southwest executives realized SkySolver’s incremental fix algorithm might not catch up any time soon and that they needed a full reboot. This would allow pilots, flight attendants and planes to get into position, and let the system’s data updates re-sync with ground reality.
To achieve this, the airline cancelled about two-thirds of its planned flights for multiple days, and locked up seat inventory on its website so customers couldn’t buy tickets for a flight that might ultimately be cancelled.
By reducing the company’s flights to one-third, Southwest effectively tripled the resources-to-flights ratio, a level more than ample to handle that amount of activity and with enough slack to enable scheduling their way out of the crisis and into a stable operating state.
"After days of trying to operate as much of our full schedule across the busy holiday weekend, we reached a decision point to significantly reduce our flying to catch up," CEO Bob Jordan said. "Clearly, we need to double down on our already existing plans to upgrade systems for these extreme circumstances so that we never again face what's happening right now."
As the COO of another airline said, “Trying to get Humpty Dumpty back together again is not that easy.”
Southwest is known for its exceptional customer service; it should survive and bounce back. Before this happened, SW was ranked #1 in customer satisfaction (in basic economy — it doesn't have business class). A frequent-flier program member said that except during this crisis the carrier is typically very proactive, with alerts for even a five-minute delay. CEO Bob Jordan, a long-time company executive who has been in the current job for less than a year, publicly apologized.
Ryan Green, Southwest’s chief commercial officer, said in an interview the airline is taking steps such as covering customers’ reasonable travel costs, including hotels, rental cars and tickets on other airlines, and will be communicating the process for customers to have expenses reimbursed. He also said customers whose flights are being cancelled as the airline recovers are entitled to refunds if they opt not to travel.
Part 2: Underlying issues
Legacy systems and technical debt
It was not technology that failed its spec per se, though that was blamed as the immediate excuse. The technology did what it was designed to do, it just hadn’t been updated in 30 years. At the time of development, the volume of data that required processing was significantly smaller than what it is today. Software that worked for the volume of data and environment for which it was written began to experience systemic failures as those parameters changed. It was only a matter of time before something caused a cascading series of failures that spiraled out of control and an antiquated business process that relied on manual human input into IT processes was overwhelmed. They should have seen it coming.
Many carriers still rely on solutions largely built on legacy mainframe computers by vendors or homegrown by the airlines themselves. Airlines have expanded fast, and the scale and the growth of business has got ahead of their technology. Tech and operating procedures from the '90s, when companies were half their current size, still prevail.
A lot of the problems stem from technical debt. Before it grew from a small player to a national and then international airline, Southwest didn’t need the same kinds of commercial platforms that rivals used, and developed many of its own systems instead. Now business needs have changed.
Specifically, SkySolver, an off-the-shelf piece of software that Southwest has customized and updated, was nearing the end of its life, the airline said. The SkySolver scheduling system was not up to the task of dealing with storms, and deficiencies in scheduling processes needed to be addressed.
Unfortunately, it is difficult to measure the return on investment of upgrading the backend technology infrastructure needed to operate efficiently and consistently. Too many companies see it as a sunk cost rather than as an investment. Why spend on systems when we are making money? Technology is like plumbing: it's never worth investing in until it explodes. No one ever thinks of the risks associated with excessive cost control. There are no accolades or great press when things run normally and so upgrading supply chain processes and philosophy is always next year’s problem…until it becomes today’s crisis.
Trying to show stakeholders the intrinsic value of spending millions of dollars on new technology to prevent future risk is hard. Most executives are older and lack the technical know-how to lead businesses critically dependent on technology. They see technology as a complex change and more time-consuming than it needs to be. So, they wait until something happens to cost them billions or even their business.
Mid-level leaders who are not sufficiently technical fail to see that see that the wheels are about ready to fall off the bus. Unfamiliar with the architecture and software, yet charged with assessing technical issues, they end up mis-reporting to the C-suite on the potential customer impact. Ignoring the reliability and performance of systems critical to the business ultimately leads to disaster.
Of course, risks are going to present themselves, that's out of our control. Having inadequate supply chains and systems make the risks of outlier events worse because we aren't able to be agile in these situations. Look at the entire Covid pandemic and how supply chains were exposed with their issues. The impact of meteorological events is no different.
As business volume ramped back up after Covid, the lack of attention to the technology was waiting to show its ugly head. The system was intended to provide situational awareness and enable them to react in real time, but SWA found out the hard way that a failure to keep their software modernized and well maintained is as negligent as failing to do the same for their aircraft fleet.
Day-to-day problematic areas that continued to arise while coming back up to full operations post pandemic were the initial warnings of an oncoming crisis, this complete operational failure that already was inevitable. All the links in the chain had been established except the final one (the winter storm).
On November 14th (2022) Capt. Casey Murray, the President of the SWAPA pilots’ union said, "I fear we are one thunderstorm, one ATC event, one router brownout away from a complete meltdown. Whether that's Thanksgiving, or Christmas, or the New Year, that's the precarious situation we are in."
By the time someone acknowledged the elephant in the room, it was too late. The automation had not been developed with the goal of running a sophisticated operation of the current size and complexity. The house of cards came tumbling down as a routine winter storm broke the antiquated systems, with managers scrambling to manage 20,000 frontline employees by phone calls.
The airline was undone by problems of their own making, running razor thin margins and leaving customers to end up paying the price. De-prioritizing & under-investing in modernizing outdated backend technology for decades finally led to an inability to recover from disruption in extreme circumstances that occur 1% of the time and pushed them over the cliff.
As readers commented online, “This is what technical debt looks like. MBA kids, start studying here.” “This will become a business school case study on why it’s important to never have tech that is behind or just enough.”
UX versus the backend stack
Southwest IT investments focused primarily on tech for customers, tech for cost cutting & utilization and regulator-facing applications, like self-service capabilities, maintenance and record-keeping. A primary goal of systems modernization remains to “take time out of the turn”, quickening passenger movement by speeding up the boarding process, making more gate-to-gate flight time available for scheduling during a flying day.
Much of the recent IT investment has focused on user experience improvement and automatic data collection to help improve on-ground coordination, like a $500 million reservation system for consumers, hand-held tablets for ramp workers & maintenance to communicate important ground operations data like luggage and cargo weight electronically and flier loyalty programs.
An upgrade or replacement to core systems like SkySolver would bring the same focus to the backend stack, that may have been taken for granted so far by past executive leadership who possibly prioritized UX exclusively and opted to expand and grow without the technology infrastructure needed to handle scale.
Most often, the work involved in upgrading the backend stack is the kind that does not deliver immediate, obvious value, but is necessary for long-term sustainability. The value of the investment becomes evident only over a time period that’s far longer than the fiscal periods on which most company managers focus.
As a reader commented online, “Southwest's stock price plunged 11% this week, costing shareholders $2.2 billion. The next time a company says they can't afford to spend money to upgrade their supply chain technology, it's worth asking the question -- can they afford not to?”
Fragility of route optimization
Over time, a realization has emerged that the setup of Southwest’s flight routes, differing from that of other airlines, also had a role to play in the crisis. The routing approach left Southwest building an incredibly delicate house of cards that could quickly tumble when the company encountered a problem.
Dr. Edward Rothberg, chief scientist of Gurobi Optimization LLC, a startup that develops mathematical optimization software used by carriers including Air France-KLM, said Southwest’s hopscotched “point-to-point” model—rather than the hub-and-spoke model—greatly increases the difficulty of the problem, requiring more computational power than its current systems are likely able to handle.
Unlike some rivals that concentrate on flying around central hubs, Southwest planes generally hopscotch from one city to another, with its fleet crisscrossing the country each day. Southwest does operate local mini-hubs - Las Vegas, Phoenix, Dallas, Houston, Denver, Kansas City, Chicago, Baltimore, Atlanta, etc. that are connected to each other as a point-to-point network, but not a major central hub for the entire country.
Hub-and-spoke is fine on busy routes and the best way to run a feeder into long haul. For a budget airline with no long-haul links, point-to-point has the advantage in normal conditions, as it follows demand (provided one can fill flights).
It enables passengers to travel directly between smaller markets, unlike United, American and Delta that typically fly from smaller markets to hubs, requiring passengers flying between small cities to change planes, and it lets Southwest maximize use of its planes and crew, increasing the utilization and efficiency of each plane.
It allows SWA to have smaller narrow belly planes of all one size, which works well when flying into smaller airports in mid-size markets. In fact, other airlines have to some extent been moving back to the direct flight system over time.
Airlines based on the traditional spoke and hub system can handle nasty weather better than point to point carriers like Southwest because more crews are based near the hubs. That model has the operational advantage of quickly flying crews and planes out of the hub to where they’re needed.
In contrast, point-to-point is more susceptible to resource disruption. The daisy chain structure makes its network more vulnerable. Southwest’s point-to-point model involves planes flying consecutive routes and picking up crews at those locations. They just kind of build on from city to city to city.
During large weather systems and snow conditions, overnighting crews can get stuck in the wrong places all over the country, without airplanes arriving for them to fly out when they are rested and back on duty.
Because Southwest flights hopscotch city to city, cancelling flights into one city could impact the entire network. When there are cancellations in one area, the disruption ripples across the country, leaving crews trapped everywhere, as they are not in the right positions relative to the planes. Once that happens, it’s very difficult to get the operations flowing smoothly again.
It becomes hard to contain problems in one region and isolate them from the rest of the country and it becomes difficult to catch back up when things start to go wrong, causing disruptions to ripple. Chain reactions propagate back and forth through the network for days, costing millions, and the airline struggles to rebound after upsets.
Doing point-to-point, instead of hub-and-spoke, left planes and especially crews stranded away from each other far more than happened with other airlines. Southwest ended up with planes in some locations that were ready to take off as well as crew in other locations that was available to fly, rendering the schedulers unable to match them with each other.
“Clearly the weather was the tip of the iceberg; however, the company's unusually complex flight coordination model (is) at the root of the problem”, commented one reader online. A scheduling system that normally works pretty well ran into a perfect storm.
It’s fragility of the network that's the problem. It is when something happens and their outdated programs start trying to work solutions that the failures start. The meltdown is evidence of a broken system. It’s not until SkySolver’s scheduling roulette begins that the network spirals out of control. The house of cards falls, and fall it does with severity.
Ecosystem anti-patterns
Operator incentives misaligned with consumer protection strategic objectives?
The phenomenon of incentives misaligned to consumer protection needs can be seen even with some urban online cab aggregator services that fail when needed most, either simply by being unavailable, or by charging extortionist demand-based prices for limited seats.
Enforceable consumer protection accountabilities in the form of punitive damages are a market-based solution to encourage airlines to improve reliability. Organizations respond to incentives, and we need to create the right ones.
Which matters more – Minimizing flight delays or Minimizing flight cancellations?
One way to handle failures is proven but expensive: holding crews and aircraft in reserve to recover from irregular operations. Qantas successfully does this. When a Qantas A380 unexpectedly landed in Azerbaijan with a problem that couldn’t be promptly repaired over there, Qantas sent a rescue flight. Because Qantas plans ahead for emergencies (and they absorb the expense of doing so), they were effectively able to recover their operation.
With Southwest, operating lean with minimal slack had driven systemic risks and instabilities to unprecedented levels. Why do airlines plan route plan like very aggressive “just in time” supply chain planning, with little slack or contingency planning, knowing fully well that the follow-on impact, if any flight, on any leg, has a problem, would leave no options but to simply dump the problem on the customers? It seems crazy, right? Well, it’s a question of incentives.
Though in Oct. 2021, just over a year earlier, Southwest incurred a cost of ~$75 million due to cancelling more than 2,000 flights over a four-day period, this financial loss apparently may not have been incentive enough relative to the ~$24 billion annual revenue to put in place network routing improvements and resource buffers of sufficient quantity to have prevented the Dec. 2022 disaster when they cancelled more than 15,000 flights in total. Smarter ways to enforce airline customer service plans are needed that do not just put consumers in a different kind of mess.
Just blame it on the weather
Airlines, after all, are covered legally by their Contract of Carriage and US Department of Transportation rules via which the federal government has given airlines a liability shield and bear bounded responsibility when the failure can be attributed as weather related. If a failure is due to technology, the airline is more liable to pay for things like rental cars and hotels. With weather, it's a different story since it's out of their control.
In the absence of heavy fines for cancellation, the heavy fines for long delays of planes filled with passengers sitting for hours on the tarmac, together with limited liability for weather-related cancellations, simply incentivizes airlines to cancel altogether flights at such risk.
It makes business sense to over-book flights, avoid warning customers early enough that their flights were likely to be cancelled, and cancel the flight at the last possible minute, blaming the weather to escape fines and deny FAA mandated compensation. No big deal; rope the customer in on the off chance their flight does go out, and leave them hanging to make alternative accommodations if it does not.
领英推荐
Greatest good of the greatest number or Leaving no one behind?
All too often, the doctrine of the greatest good of the greatest number goes in hand with a notion of “acceptable” levels of collateral damage, abandoning and sacrificing those whom triage deems lost causes. Such outdated business-as-usual approaches put average rates of progress ahead of the worst-off, threatening to leave them irrevocably behind. Random short-term shocks derail the progress trajectory beyond prospect of easy recovery for a few, and, once left behind, they tend to continue being locked out of any possibility to re-join the flow.
In contrast, leaving no one behind connotes prioritizing the progress of the most marginalized first. It means endeavouring to fast-track and support first those furthest behind or at most risk of being left behind because of exposure / vulnerability to shocks on fragile systems. It implies moving beyond solely improving the average towards also closing the downside gap from median to the worse extreme. The furthest behind need to be benefited to a greater degree and at a faster pace to ensure that those who have been left behind can catch up to those who have experienced greater progress.
The market and the regulatory environment tend to normalize an “acceptable” rate of failure to meet the quality-of-service bar, without a sliding scale of punitive compensation based on the extent of failure when the bar is not met. This incentivizes utility operators to optimize for at-quality performance on the large majority of transactions, and to abandon the remaining small minority of transactions altogether once it is clear that they are not going to meet the quality bar.
Acts of God
It is cheaper to pay off the rare customer eligible for damages, with even that risk typically packaged and outsourced to insurance companies, than to commit to service level agreements for guaranteed worst case quality of service and heavy penalties. In any case, “Acts of God” are always available as the no-liability clause in any transaction. As per the Force Majeure (Acts of God) clause, the airline cannot be held responsible for weather-related issues and airport closures by the FAA. Cancelations due to weather are not subject to refund. So, the more extreme the adverse impact attributable to Acts of God, the safer the utility operator is from liability risk.
TINA factor
As per free market dynamics thinking, if consumers get mad enough, they will stop using Southwest. Consumers will now be aware of the possibilities of such a disaster and will either be willing to gamble on them or switch. According to online comments, “Passengers have said I'll never fly this airline again. Well great. That's how we roll in America. My guess is after sampling other low-cost airlines these folks will be back to free checked bags and open seating by Easter.” “Airlines get away with this because consumers get amnesia when they see a low fare.” “Thousands of gift certificates will be given out by Southwest to 'make it right'. All will be forgotten within a year.” “Would you rather pay a bit more or roll the dice on getting stranded with no hotel, no options, and no idea where your luggage is? Some customers made a voluntary choice to roll the dice and lost.” Hence, the argument goes, no legislative response from Congress is required.
Of course, what this fails to take into account is that the proportion of consumers who suffer severely enough to leave permanently in spite of having little choice, is so small that their attrition and the nuisance value of the noise they make does not impact the business materially, especially if the utility operator’s PR narrative pitches it as an “Act of God”.
As readers commented online, “Yes, they will suffer from this by losing customers. But, given airline consolidations, people will have no choice but to take these flights from Southwest.” “Consumers can’t discipline companies when they don’t have other options.” “Most consumers can’t afford to discipline companies…they’ve paid their money, took their chances, and are getting screwed. The companies will be fine, and the consumers are paying the freight.” “(Which) consumers (will) punish Southwest by going to Delta to pay $4,300 for a one-way ticket with a 12-hour layover?” “Letting companies serving as a low-cost alternative go bankrupt won’t increase competition”, a “there-is-no-alternative” (TINA) factor to keep them safe.
Consumer protection
Under a 1958 law, passenger airlines are exempt from Federal Trade Commission (FTC) oversight and most state investigations for consumer complaints. Federal officials are limited on what they can do beyond a harshly worded social media callout for an airline to enforce its customer services commitments. Those commitments are not required for an airline to keep its license but were established in 2011 law as a sort of travelers' bill of rights that airlines are supposed to enforce on themselves.
In the wake of the mess, Transportation Secretary Pete Buttigieg and federal lawmakers have stepped up calls for more stringent consumer protection measures. "We're going to expect them to go beyond the letter of the law in terms of how they treat passengers, making sure they pay for things like hotels, ground travel expenses, meals and of course, refunds," Buttigieg said. President Joe Biden urged consumers to check if they’re eligible for compensation as cascading airline delays have disrupted holiday travel across the country. “Our Administration is working to ensure airlines are held accountable,” Biden tweeted.
While this is in response to the widespread customer pain, it also points to a deeper root cause of systemic nature that underlies the episode. There is an asymmetric balance of power and consumer protection regulation is needed to address it. Some industries play critical roles in the economy and the consequences of industry failure can be catastrophic. These industries, like airlines, have a greater responsibility to the public than just providing in-flight pretzels.
Market competition cannot be the only thing to constrain bad behavior. It is unfair for an airline to maximize profit by stranding consumers in the middle of nowhere indefinitely on Christmas Eve as long as they provide a refund. The ground rules of any capitalist system require that profit maximization exist within a framework of assured minimal fairness of all participants. There is a point beyond which the human cost of failure becomes more than tolerable. Regulators need to set minimum standards and penalties for the very real impacts airlines have on people’s lives when they are treated like parcels.
What is the right metric?
Delta’s automated rebooking system called Viper reduces the average wait time rather than the worst wait time. With Viper, the average delay during irregular operations like bad weather has ranged between two hours and nine hours, compared to 15 hours in 2012, the airline says. Whether the worst-case wait time improved to the same extent is not known.
United’s new rebooking system reduces the number of passengers that get told they’ll have to wait three or four days for a flight. Southwest’s Baker flights cancellation system reduces the number of people that cannot be re-accommodated. Whether these few unfortunates get adequately compensated for their pain is not known.
In all these cases, the metric being optimized isolates pain to the fewest passengers, but does not limit the extent of that pain. Pivoting to a metric that optimizes the worst pain of any consumer, say, “net delay in end-to-end trip arrival at the final trip destination” (treating cancellation without re-accommodation as equivalent to, say, a week’s delay) would promote consumer protection by incentivizing airlines to adopt different optimization algorithms.
Liability shapes incentives
What’s the fix? Liability. Money is the biggest driver of everything. If the burden were put on the airline to insure against weather risks, denying force majeure protection, if aggregate delays from all causes in end-to-end trip arrival at the final destination triggered punitive damages in addition to compensation, if such damages weren’t excluded from class action liability, the incentive would shift towards minimizing the extreme adverse impact for any customer, rather than minimizing the percentage of customers affected by extreme adverse impact, and airlines would suddenly become extremely interested in leaving no one behind.
Compensation could be made proportionate to the aggregate duration of delay from all causes, including due to over-booking, extra hops and re-booking, with the airline paying for hotel rooms, meals and ground transportation. In case of delay more than three hours or flight cancellation, the compensation could be multiplied punitively to (say) three times the full cost including any extra purchases such as bag fees or a seat assignment. Time limits could be imposed to ensure promptness in making the compensation pay-outs, putting such creditors first in line ahead of all other creditors. There are many options available to regulators to protect consumers from extremes.
Consumer protection across the globe
Apparently, Ryanair and Wizzair have similar point-to-point networks at a large scale, but haven't had the same kind of meltdown as Southwest in the US. What they DO have is European Union passenger compensation rules that would bankrupt them if the same fiasco happened to them. EU261 makes a huge difference in how often flights are delayed or cancelled last minute.
As online readers commented, “I wonder why the airline industry gets a free pass on failing to provide the service people pay for with little or no penalty. In Europe, if you are delayed by a few hours you are entitled to get 600 euros from the airline for the screw up. I'd like to see that in place here.” “In Europe, the airlines are mostly regulated, are they doing worse than our non-regulated airline industry?” “If the government didn't intervene the company simply could throw their hands in the air and say "oh well" and everyone would be left stranded with no recourse while the airline continues to stockpile cash knowing the next suckers just bought tickets.” “What we need is a #PassengersBillOfRights.”
Part 3: Takeaways
Comprehensive system failure doesn't instantiate overnight. Most aviation accidents are the culmination of a long cascade of small errors. Introspection and innovation are needed at all levels to make sure this never happens again ever.
Addressing Cascading Failures
Catastrophic cascades are one category of disaster, wherein an initial cause becomes the trigger for a delicately balanced house of cards to come tumbling down. For example, a major earthquake might cause powering down of entire facilities and disconnecting of all networking & communications links in a local area, and hence the loss of all infrastructure assets there, including serving zones, op centers as well as offices.
This then might become the triggering cause of a catastrophic cascade. The resilient self-management response of (say) compartmentalization and redundant failover may choke due to failure of other infrastructure assets being overloaded by temporary load spikes or having hitherto-unrecognized dependencies on the already-lost assets. Or it may fail simply because critical emergency response leaders get locked out of their workstations.
The cascading failure grows over time as a result of positive feedback. When a portion of an overall system fails, it increases the probability that other portions of the system will fail. Reduction in the rate of useful work being done by one component overloads other components and triggers failure in them, causing the problem to snowball and potentially spread globally.
A well-planned system incorporates resilience slack to limit this risk, but slow increases in usage over the years crossing above a threshold can invisibly eat up the resilience slack until the system reaches the brink, unnoticed by business leaders.
Most cascading failures are initiated due to variations of overload scenarios causing wastage of work capacity in some critical component. Once initiated, often such resource exhaustion scenarios feed from one another in complex causal chains. For example, “crash looping” occurs when the moment a crashed system tries to recover, it gets bombarded with an overload from the pending queue and fails again almost immediately.
Another example is when requests of a particular type or category that are more prone to failure start accumulating in the pending queues and hog more and more of the re-booking slots available, resulting in denial of service to requests of other types or categories. A third example is when re-booked requests fail over and again and pile up over time; accumulation of peak load in cyclic patterns of oscillation at the system's resonance frequency can lead to destructive breakdown.
Stress testing
The unexpected happens. Things will go wrong. We cannot do much to prevent it, but there is a lot we can do to be prepared for it so business operations continue to run following a disaster. Failures are a means of learning.
It is hard to predict exactly which resource will be exhausted and how that resource exhaustion will manifest. System validation is best done by intentional stress-testing; exposing the system to various combinations of gradual ramp-up and impulse-shock load-patterns under different failure-modes instigated deliberately and simultaneously in multiple components. By intentionally causing failures in critical systems and business processes, we can find and fix vulnerabilities before such failures happen in an uncontrolled manner.
During stress testing, single or multiple failures are triggered during a system maintenance downtime window in order to observe and study sequent events possibly caused by the cascading propagation of failure. When such a system maintenance downtime window cannot be taken due to 24x7x365 safety-critical operation, stress testing has to be done on a simulation model.
Which what-if scenarios should be stress tested? When conditions are changing, when stakeholders and technologies have evolved, war gaming can help focus the investment.
War gaming simulations differ from traditional stochastic models in that participants roleplay networks of stakeholders that may include political, economic, judicial and regulatory agencies, market players, financial institutions, consumer groups, labour unions, etc. In addition to realistic nature models of inanimate forces like the environment that simulate earthquakes, cyclones, heatwaves, etc., war gaming has each role simulating the intentionality and irrationality of the stakeholder it represents, based on the various emotions, conflicts of interest and asymmetries of information and power represented.
Even a moderate level of uncertainty, with (say) two or three outcomes plausible along each of several dimensions and competitive dynamics due to conflicts of interest between stakeholders, can make the system complex enough to preclude traditional analysis, yet war gaming can shed valuable light on the range of possibilities that executives should be considering, providing strategic guidance on the industry’s direction, the most promising types of moves, the company’s strengths and weaknesses, and where to focus further analysis.
Stress testing and war gaming of technology systems and associated standard operating procedures helps in understanding the limits of both systems and people. Examining operational readiness for events such as the recent winter storm will yield savings down the road. An analogy from the military reminds us that the sweat we leave on the training field limits the blood left on the battlefield.
Apart from continuous testing of routine failure conditions, interfaces and complex scenarios in staging environments, by breaking live systems while explicitly preventing critical experts and leaders from participating, we can test both technical and operational resilience.
Like fire drills, periodic but unannounced stress testing during live operation mimics realistic crises, reveals where the breaking point is, measures the behavior of people, process and technology in emergency situations, allows estimation of the system’s reduced capacity during the crisis, checks the system’s self-managing ability to return automatically to normal without any intervention after the crisis ends, enables provision for worst-case thresholds, informs the trade-off of utilization versus safety margins and provides education to staff.
Such live stress testing of systems in prod is neither cheap or devoid of risk; it entails a sizable engineering investment, considerable disruptions to productivity, high risk of outages, user-facing issues and / or revenue loss. For an airline with planes in the air, safety considerations preclude adopting the full spectrum of live unannounced stress testing. Simulations often are the only (and inadequate) way to validate end-to-end resilience against certain catastrophic cascades in highly safety-critical systems.
Reliability analysis by simulation of complex systems helps to determine the stability and resilience of different infrastructure options, analyze the effect of control measures on the adaptiveness to unpredicted changes, maintain the ability to provide required services and decide the safety margins needed to avoid cascades and avalanches.
Stochastic models and ‘‘what-if’’ scenarios are needed for the interdependency analysis, reliability assessment and behavior prediction in system-of-systems that include uncertainties taking into account the dynamic interaction of complex adaptive sub-systems with high degree of coupling.
Detecting impending collapse and taking proactive measures
Predicting an accurate sequence of events during the propagation of rare dependent failures in a cascade is difficult as there is practically an infinite number of possible operating contingencies and system changes which would have to be considered. Some collapses are inevitable and must be managed.
A service already experiencing cascading overload often has a host of secondary symptoms that can look like the root cause draining the capacity, making it difficult to unravel the dependencies and sequence the recovery actions. During early stages, when the system is on the brink, increasing queue length / backlog size, proportion of requests missing delay thresholds, decreasing capacity utilization and shrinking slack in turnaround time, especially all occurring together, are early indicators of a death spiral towards impending collapse.
Southwest didn't have to go this way. There are ways to temporarily turn the point-to-point network into something else, something workable. Skip some cities, and focus on moving people around the weather. But you have to have some slack in your operations and you need timely information to adapt. Clearly Southwest lacked both.
For Alaska which operates a hybrid network, taking out Seattle and Portland in the ice + snow storm forced a total shutdown in flights. Still, they rebooked guests and the cancelations didn’t snowball. They configured the network on the fly to get more flights and guests moving, and as soon as the runways opened were digging out.
Airlines need to plan for a rainy day differently than they would for a sunny day, preparing for forecasted events ahead of time and pivoting to emergency operational processes sufficiently early. Knowing that there were upcoming meltdown risks, they should cancel more flights ahead of the storm and plan for compartmentalized operation until full-scale recovery is feasible.
Mitigating a collapse: Adaptive SOPs for emerging crisis scenarios
Once systems are overloaded, it is not possible to fully serve every request; something needs to give. But it is not desirable that in case of technical failures or unfavorable circumstances in the operational environment, the system immediately shuts down. After a service passes its breaking point, it is better to allow some failures in order to manage the situation.
The system must pick the most acceptable degraded operation mode as per the configured partial order from preferred mode to more and more unwanted modes till the most undesirable modes safer than a shutdown in context of the prevailing environmental conditions. The partial order essentially captures the relative importance of combinations of safety hazards, user annoyances and cost over-runs that might arise from failures or unfavorable environmental conditions.
During the crisis, there may be partial information due to sub-system disconnection, and mechanisms such as fault tolerance to data unavailability have to be implemented. Shared vulnerabilities need to be decoupled by introducing mechanisms such as buffering devices. Infrastructures are highly dynamical systems, and the capacity of an infrastructure to change in a timely manner is crucial for the adaption to failures.
Making the backend system self-managing prevents needless fires, and making the frontend system well-integrated and intelligently assistive saves the ops team from the need for heroic efforts to control the burn on the remaining fires that are inevitable.
The five basic ways for a deployed system to handle overload are spinning up more capacity, slowing down the service, operating with degraded quality, throttling the intake of fresh customer demand and abandoning the contractual commitments already made. Thus, an airline might wet lease planes or book seats on other airlines, delay flights, take off without some meal options, stop selling more tickets and / or refund passengers with valid paid-up tickets.
Effective incident management is key to limiting the disruption caused by an incident and restoring normal business operations as quickly as possible. Ad hoc incident management practices can cause an incident to spiral out of control. Gaming out a principled response to potential incidents in advance can make all the difference in real-life situations.
Recovering from collapse
Cascade vulnerability of an infrastructure system renders it susceptible to incapacitation when a risk materializes and reduces its capacity to resume new stable conditions. Recovering from a cascade takes longer than it would for restoring an isolated system.
In order to dig out of an outage, the load on the frontends must be dramatically reduced or eliminated until the backends stabilize. “Lame ducking” is a mitigation that drops the rated throughput for planning / scheduling purposes of each impacted component to a small fraction of the normal value along all performance dimensions and holds it there for an extended period. Together with load-shedding of the unmanageable bulge already choking the queues, this buys time for the system to stabilize and recover.
Once a collapse has happened, instead of considering only complete perfect solutions, the scheduling algorithm needs the ability to get the passengers most distant from their trip destination progressively closer to where they want to go. Each successive flight leg should reduce, say, the estimated aggregate cost of hotel, food, rental car & gasoline for abandoned passengers to complete the remaining portion of the journey to their end-to-end trip destination by road.
This is a different optimization metric that follows a principle of prioritizing “pain minimization” for the extreme of the distribution, spreading the inevitable pain in tolerable amount among many to avoid unbearable suffering for some, and may help save the airline to an extent from punitive and reputational damages.
Improving backend systems, resilience to risk and cybernetics
Paying off technical debt
This meltdown was a consequence of something technology people call ‘ignoring tech debt’. Backend technology typically gets neglected until it breaks. It unfortunately takes chaos to reveal the extent of the problem.
Airlines’ increasingly complicated networks and operational processes require a better technology foundation. Technology has a life and death cycle that many do not appreciate. Software is a very fast-evolving facet of business handling and infrastructure management, perhaps the fastest of all facets, and it needs semi-frequent updating and evolving in turn.
There are engineering teams within organizations that push to make changes that are ‘not sexy’, unlike features (like a shiny new app) that make the user go ‘Aah!’. Tech debt is a very serious issue, and needs really good technology program managers who understand the business implications of old code, as well as compliance and security concerns, when the operations scale up dramatically.
Managers who are not technical enough will always limit the spend of internal projects that don't clearly show how they affect revenue. Back-end features are never money-bringing but money-keeping ventures; the ROI on "reliable and robust" is that it reduces the frequency and severity of losses. Worse, technology upgrades have long time-to-payoff horizons and do not align with the short-term focus on quarterly results. Non-technical management doesn’t understand it, and keeps delaying it. With no idea about the severity of tech debt, many times technology upgrade proposals get side-lined, until, unfortunately, the antiquated systems, and processes supporting those systems, all break down and trigger a collapse.
Maintenance upgrades of outdated equipment, software and processes, including, say, internet-based communication as an alternative to telephones, should not be deferred. Apart from re-thinking design, implementation and maintenance, there is a need to re-think from scratch how to specify, monitor and evaluate systems and processes to get more resiliency in the solutions. Not making the investment to improve backend technology systems and relying on execution agility to compensate makes it even more costly in the long run.
The cost of ignoring risk
Bolstering efficiency at the price of losing resilience creates an Achilles’ heel. Southwest was at the sharp end of this spear; their operating model was not designed for resilience. With cost-cutting leading to unhedged operational risk, pushing systems & processes to the brink of meltdown, the focus on efficiency eventually led to the 2022 crisis. Efficiency with resilience is good; unhedged hyper-efficiency is not.
Cost focus heedless of risk is just plain bad business. Crises will keep happening until businesses stop trying to run lean at the expense of resilience. It may seem sub-optimal to have a systems landscape with redundant and resilient structure (which creates extra work and may be less than real-time in all situations). But it would save the whole nation from grinding to a halt and leaving millions stranded, due to some glitch in a central computer.
This debacle should help the board and the finance function understand that you don't just look at how much something will drive revenue. You also look at what reduces risk. Otherwise, occasionally, the lack of the technology investment will bite them, and they will lose a lot of money all at once. By the time all the damages are accounted, it should be clear what risk a new system would have prevented for SWA.
Business leaders need to shift their perspective and start considering reliable and robust backend technology as part of the fleet, rather than as a cost centre to be minimized. Just as they invest in "reliable and robust" planes, they should invest in "reliable and robust" technology to keep those planes in the air. They just need to think about technology reliability the same way they think about airplane reliability.
Industrial hardening for resilience, anti-fragility & robustness
Much of the complexity behind airline-operations technology stems from the multimodal diversity of real-time data points and constraints a single system must take into account. The problem space is very complicated, with human, technological and meteorological network effects interacting with each other. Problems arise from applications developed in silos, and multi-dimensional data on rare events is too sparse to magically fix everything with AI.
While the operational technology shortcomings are front and center here (particularly the crew management system), the SWA network operations model, being a point-to-point multi-stop one (as opposed to the hub-and-spoke model), arguably is inherently more brittle under increasingly large near-continent-scale weather disruptions. During Winter Storm Elliot, when the point-to-point system came unglued and stopped spinning, everything just flew apart. It needed more people and planes than they had to spin it back up again.
Point-to-point operational models likely require more slack capacity in terms of spare planes, fuel reserves, backup on-ground facilities, crew staffing, etc. to be as resilient as hub-and-spoke models. This should be budgeted intentionally as an insurance premium for disaster recovery and business continuity and not considered to be low utilization.
As an alternative, the system should be capable to switching on-the-fly from point-to-point to hub-and-spoke until the crisis has passed. The scheduler should isolate storm-impacted cities to keep everyone else moving. Such storms are predicted at least a week ahead. They can adjust plans as soon as they are aware of the forecast, by pivoting temporarily to hub-and-spoke and reducing flights proactively to create temporary slack in the system.
Cyber-physical systems with human-in-the-loop
Cyber-physical systems with human-in-the-loop handle complex tasks in unstructured environments by combining the cognitive skills of humans with autonomous systems behaviors to adapt at run time to new environmental conditions and unpredictable situations.
The diversity of sub-systems, expertise domains, operating environments, context situations, and socio-legal constraints requires human support to ensure complete and correct behavior in various situations. At the same time, human factors such as limited attention, stress and fatigue can put the overall system performance at risk, so control has to be shared between the human and the system.
Without burdening humans with unnecessary involvement while the system is operating autonomously, it must be possible to get the human’s attention quickly when participation was required. A scheduling system need to be built with redundancy and a reset button. Scheduling managers need to be equipped with mechanisms to recognize emerging crises and gracefully degraded tools to keep the show going on, just like air traffic controllers do when the radar goes down.
When an ops team must allocate a disproportionate amount of time to resolving failure cases at the cost of spending time on regular service operations, it is a warning sign that there are missed opportunities for software upgrade, automation improvement and / or process revamp to eliminate scalability and reliability concerns by shifting the focus from how to quickly address emergencies to how to reduce the number of emergencies.
Policy and strategy revamp
Technology’s place in the org chart
In today's world every company needs to be a technology company. Nowadays technology is so integral with a company that if a company ignores technology, then the company is basically ignoring itself. Technology is the life blood of all companies; it does not make sense that a company so heavily influenced by data would ignore their technology department and expect to be competitive. With production critically relying on software systems, technology is not a technology problem, it’s a business problem. Technology becomes as strategic as Marketing or Finance.
Strangely from a modern perspective, many times technology departments report up through finance, which often sees it as an expense, instead of as the vital revenue generating tool it definitely is today. This old school mentality that technology systems do not create business value and are just an expense to be cut only works so long, until shit blows up.
Technology departments need to roll up to the corporate strategy & planning function, not the finance function. There is a need to understand the evolving ecosystem in its entirety, to keep modernizing processes continuously and to make ongoing investments in updating the systems to handle the increased size, scope, complexity and risks as a table-stakes requirement.
The strategy is in the software and the software is deeply enmeshed with the system context. And Tech Strategy is different from IT. Most enterprises do not have a Tech Strategy org separate from an IT org. Staffed by technology program managers, the Tech Strategy org translates business strategy into technology strategy, specifies use cases formally, makes business logic choices and provides them for the IT org to build, operate and maintain.
Planning of routine lifecycle management upgrades due to technology obsolescence and volume growth is done by the IT org, but updating the business logic or use cases when needed based on ecosystem evolution & operational risk assessment and providing them to IT as design change requests is done by the Tech Strategy org.
Digital transformation
Transitioning from the as-is state to the aspire-to-be state typically would need a multi-year initiative, setting feasible and ambitious annual targets, with technology program managers working as change agents to focus attention, build consensus for collaboration, chart a new path and take action to overcome hurdles for confronting the gaps deemed unacceptable-to-persist and achievable-to-fix in the current context.
Analytical capacities are vital to identify opportunities for maximum and progressive impact, to anticipate and analyze trade-offs given resource constraints, balancing improvement in the overall average with improvement in the worst-case extreme and to assess the viability to invest, weighing financial and operational strategy options.
Proactive outreach and responsive “one-stop” customer service integrated in well-designed and unified systems and embedded across functional strategies that break silos are effective at preventing and recovering from disasters, adapting to unexpected change, insulating and re-booting operational systems after breakdown, preserving the continuity of services critical to business and enabling left-behind consumers to catch up.
Capacity support is necessary to enable staff to respond effectively to the needs of left behind consumers, by anticipating risks, proactively planning backup logistics and delivering in real time on the strategy to improve their situation. The system must allow for and adjust to targeted interventions by operations managers without melting down.
Increasingly robotic and technologically-driven products and services put a premium on skilled, informed and digitally-connected workers and functional systems. Opportunity maximization depends on the ability to learn, innovate, iterate and leapfrog into new value chains, leveraging fail-safe access to resilient online services via robust last-mile communications infrastructure.
Adapting to evolving market imperatives
The business model and technology strategy that worked well in the environment of the past may not be optimal for the future given emerging political, economic, societal and regulatory trends. As regulators progressively tighten consumer protection measures, it shifts the optimum of the space of possible business models by changing the risks, hence the liabilities and hence the incentives. Fiduciary responsibility towards shareholders ideally should drive business leaders to adapt their strategy.
But business leaders often are not technical enough to realize the risk that is growing in their blind spot, namely the technology strategy. On the other side, traditional IT leaders frequently lack situational awareness of non-technical context, essentially being engineers focused narrowly on DevOps tasks and counting upon “Business” to give them “Requirements”.
The burden therefore falls upon technology program managers staffing the Tech Strategy function to be all-rounder generalists rather than functional specialists. They need to keep abreast of such developments and proactively advocate to non-technical business leaders for adapting the business model and accompanying tech strategy to changing conditions.
In the context of the recent meltdown, the “Passenger Bill of Rights” movement is likely to pick up momentum. Consumer protection approaches are likely to shift from the “Greatest good of the greatest number” towards “Leave no one behind”. This realization should impel technology program managers to pivot to new success metrics.
The status and progress of the furthest behind relative to everyone else need to be tracked and reported as key performance indicators. Market surveys and user research needs intentional focus on the 99.9th percentile (one in thousand) of customer satisfaction relative to the 50th percentile (middle of the pack) to understand and address divergent rates of progress.
Closing observations
Complex systems-of-systems are characterized by having a large number of dimensions, nonlinearity, dynamic interaction coupling of sub-systems, stochastic uncertainty, time delays, adaptive emergent behavior and feedback loops. Complexity arises from the resonating interaction of multiple sub-systems, and the potential for chaos emerges from the complexity.
Interdependencies within critical infrastructures that allow failure propagation lead potentially to cascades affecting all systems in the network. In an interlinked network, component outages are not independent, and interact with each other through dependent failure chains that may be low-probability on a priori basis, but may have high conditional probability once the cascade is triggered.
Breakdowns of such complex networks are often the result of relatively slow initial system degradation escalating into a fast avalanche of sub-system failures, potentially leading to a complete loss of service. While the first few outages might even be independent of each other, the causal failure chains usually become more pronounced in the course of the events, ending up in a fully cascading regime.
Catastrophic cascades, chains of low likelihood events with cumulative impact, cause chaos by the domino effect of failures successively triggering more failures, and have the potential to severely challenge business continuity and disaster recovery plans. The end-result of a catastrophic cascade, known as collapse, is the large-scale failure of an important system-of-systems.
When there are people in the loop, there are added complications due to intentionality, irrationality and human error. The inevitable entropy of society when brought into the complicated mix eventually takes down functioning systems.
Like with Jurassic Park, systems cannot be set up once & for all and then taken for granted; something or the other always will go wrong, and catastrophic things can happen. Unless someone is aggressively combating entropy, systems naturally collapse.
One of the impressive things in nature is its resiliency, which rests on a foundation of redundancy. In contrast, industrial systems, like (say) global shipping / logistics, have been reducing redundancy for decades to save costs and as soon as Covid happened they crumbled. We need to think in a different way henceforth about the systems we build, knowing we are going to be so vulnerable to them.
Note: Facts and quotes herein are from the references linked at the end of the article. Readers are encouraged strongly to refer to these original sources for further details. The focus of this narrative is to learn from the experience and opportunities surfaced by the incident.
References
AI + Data ... the journey begins | Marketing | Sales | Sustainability | CSR | Coaching
1 年Well written and very comprehensive. Knowing you that wasn't a surprise.
Algorithm Engineer
1 年excellent, clear writing
Problem Solver using Culture and People centric approach enabled by Technology
1 年You have done a fantabulous job of highlighting the complex large system dynamics in an easily understandable fashion, Sanjeev! Having known you for such a long time, I expected nothing less. Your point about technology strategy as a separate capability from the typical IT resonated deeply with me. I also felt that with sufficient drive to adopt passenger bill of rights in the transportation sector, there will be much needed incentive for large travel industry enterprises to change their posture towards investing in building resilient backend systems that are fit for purpose. In any case, you have done an awesome job highlighting the key role played by Technology Program Managers in large enterprises. Way to go!