How Complex System Fail 3
Reviewing the work of Richard Cook https://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf and applying it to satellite communications
Catastrophe requires multiple failures single point failures are not enough..
The array of defenses works. System operations are generally successful. Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure. Put another way, there are many more failure opportunities than overt system accidents. Most initial failure trajectories are blocked by designed system safety components. Trajectories that reach the operational level are mostly blocked, usually by practitioners.
I view number three here as building off of number two. If you have a properly engineered system then it will have backup systems that will be able to handle single point failures and maybe even multiple point failures. In the end it always comes to a breaking point where the odds stack up against the system and it has a catastrophic failure.
One of these that sticks out in my mind was when we had an uninterrupted power supply catch fire, melted and dropped power to our transmit amplifiers.
So what happened?
Once again multiple failures that seemed unlikely, but with enough time did eventually happen.
This catastrophic failure was in a satellite terminal in Balad, Iraq. The shelter was getting older with age and starting to show it. Part of the issues here is that the corner of the shelter started to separate, creating a hole. Normally with the crazy hot weather in Iraq this wasn't an issue, but then we hit rain season. Management had been notified for months about the maintenance required and the resources required to complete the maintenance. I think the management was just waiting till their tour ended and hoped it wouldn't be an issue and could push it off for the next guy.
On a gloomy Iraq day it started to rain, down pour rain, and it started to get into the shelter through the hole. The water dripped along the wall until it hit the TWTA exhaust vents, pooled up on them and started to drip again. This time it dripped on the uninterrupted power supplies. The water pooled up and soon went into the uninterrupted power supplies and started a fire. The uninterrupted power supplies melted can cut all power going to the TWTA.
The outage alarmed on our monitor and control system. When the technician went to the shelter he was able to get the fire and and reroute power. The outage lasted 45 mins. The management team was able to get the previously asked for resources to fix the hole before the outage was resolved. By the end of the day the hole was fixed and this never happened again.
What's your favorite incident of all the pieces aligning to create a catastrophic failure?
That has been my experience / observations in the aviation world as well...rarely are catastrophic aviation incidents the result of a single point of failure...it's many small failures that align to cause a bigger issue. The "technical" term we used was: "The holes in the Swiss-cheese lined up" (i.e. the holes by themselves don't cause an issue, but when they come together in the right way, you have a problem.