The Real Risk in Identifying Failure Modes
The theme of this article: Identifying the Single Point of Failure
Tags: #RCM #reliabilitycentredmaintenance #reliability
A Story:
The Battle of Stalingrad during World War II was the largest battle in history. With it, came equally staggering stories of how people dealt with risk.
One came in late 1942 when a German tank unit sat in reserve on grasslands outside the city. When tanks were desperately needed on the front lines, something happened that surprised everyone: Almost none of them worked.
Out of 104 tanks in the unit, fewer than 20 were operable. Engineers quickly found the reason for Tanks not working.
Historian William Craig writes: “During the weeks of inactivity behind the front lines, field mice had nested inside the vehicles and eaten away insulation covering the electrical systems.”
The Germans had the most sophisticated equipment in the world. Yet there they were, defeated by mice. You can imagine their disbelief. This almost certainly never crossed their minds. What kind of tank designer thinks about mouse protection? Not a reasonable one. And not by one who studied tank history.
Discussion:
But these kinds of things happen all the time. You can plan for every risk except the things that are too crazy to cross your mind. And those crazy things can do the most harm because they happen more often than you think and you have no plan for how to deal with them. Avoiding these kinds of unknown risks is, almost by definition, impossible. Such risks are known as 'black swan' events. You can’t prepare for what you can’t envision. If there’s one way to guard against their damage, it’s by correctly identifying and avoiding single points of failure.
A good rule of thumb for a lot of things in life is that everything that can break will eventually break. So if many things rely on one thing working, and that thing breaks, you are counting the days to seemingly 'black swan' catastrophe. That’s a single point of failure.
Conclusions:
Some people are remarkably good at avoiding single points of failure. Most critical systems on airplanes have backups, and the backups often have backups. Modern jets have four redundant electrical systems. You can fly with one engine and technically land with none, as every jet must be capable of stopping on a runway with its brakes alone, without thrust reverse from its engines. Suspension bridges can similarly lose many of their cables without falling.
Therefore, in RCM, it is important to identify the Failure Modes which are Single Point of Failure. And such failure modes must be identified for all stages of an engineered system, i.e. -- a) during startup b) during operation c) during emergency stops d) during stoppages e) during stand-by f) and during storage.