Risk can’t be eliminated, but it can be managed
Carl Forsling
Government and aerospace business development; Marine veteran; Shipley certified; award-winning military affairs writer
In 2007, I started my engines for a night-vision goggle (NVG) flight as the pilot in command on an MV-22B Osprey tiltrotor aircraft. “Some sparks out of the left EAPS, sir.”?
The EAPS was a hydraulically-driven impeller system that protected the engine by expelling particles from the intake to prevent what’s called foreign object damage, or “FOD.” The high speed of the fan often imparted static electricity to the particles as they flew out the exhaust, which often gave the appearance of sparks on NVGs. “Rog. No big deal.”
We flew to our landing zone (LZ), and began our training, picking up a concrete block with a hook, called a pendant, underneath the aircraft. Then we’d circle the LZ and set it down again. We completed several repetitions.?On one, we were on final to drop off the load, when the crewchief called, ”I see flames coming out of the left nacelle.”?
Almost simultaneously, the cockpit display lit up with a cascade of alerts, starting with a relatively minor HYD 3 PRESS LOW, meaning we’d lost pressure to our utility hydraulic system. Within just a few seconds, they increased in severity, ending in LEFT ENGINE FIRE. That one got our attention.
“That fire is getting pretty big, sir.” If you can see the engine fire on the outside of the nacelle, things are getting bad. We jettisoned the load we were carrying and landed the aircraft as quickly as possible.
As soon as we landed, we performed an emergency engine shutdown and hit the switch for the engine fire extinguishing system, filling the engine compartment with gas, but to no effect. We clambered out of the aircraft. By the time we were outside, flames had engulfed the left nacelle all the way from the exhaust past the top of the rotors—flames some 15 feet high.?
Over the next half-hour we watched the left nacelle gradually melt as we waited for the Holly Ridge Volunteer Fire Department, who looked exactly as you might expect. They arrived on the scene only to spray water on a Class D fire—a metal fire that normal water is almost wholly ineffective against. Another half-hour later, the trucks from Camp Lejeune came with special foam for aircraft fires and finally extinguished the blaze. While the fire hadn’t claimed the whole aircraft, a $16 million nacelle was melted into a ball of toxic slag.
After the Bay of Pigs fiasco, President Kennedy said, “Victory has a thousand fathers, but defeat is an orphan.” That’s just what the folks wanting to avoid paternity say.
Nearly every adverse event of any consequence doesn’t just have one cause, it has dozens, if not more. Any given catastrophic event could have been stopped at many junctures but wasn’t.
These moments are described by something called the “Swiss Cheese Model.” Every complex undertaking has a series of procedures and safeguards, none of which are 100% solid, much like slices of Swiss cheese. If you picture the course to a catastrophe as a straight line from the beginning of an evolution to the adverse event, many things must go wrong in sequence—passing through several holes in several slices. If just one of those holes was filled or moved just slightly, a disaster could have been averted.?
The most common mistake in analyzing failure is to stop at the first-order cause. “The reason the plane caught on fire was because of a faulty component.”?But analysis can’t stop there.
In my case, first the impeller had to malfunction. Then my crew and I had to miss the significance of the sparks being visible to the naked eye, vice through NVGs. Then the first malfunction had to lead to a hydraulic line breaking and spilling fluid down into the hot exhaust section. Then the fire extinguishing system had to fail to put out the fire. Even that cursory analysis gives at least four points at which the chain of events leading to disaster could have been broken. There are certainly many more.
No one intentionally screws up. No pilot goes flying with the intention or wrecking his aircraft. He made a mistake because he either forgot how to do the task correctly or was never taught properly in the first place. Perhaps he didn’t sleep well the night prior. Why didn’t he??
Perhaps there was an equipment malfunction. Who serviced it last? Was that technician up-to-date on his training? Were the tools he used properly calibrated in accordance with published procedures? If we want to think bigger, were the aircraft flight instruments easy to use? Were they in accordance with current best practices in human factors design?
It might seem an impossible task to fill every hole in the Swiss Cheese where things might have gone wrong--redesigning entire airplanes to avoid an adverse event that occurs only once in millions of flight hours is not an efficient use of resources.?
On the other hand, we often think of the Butterfly Effect as a science fiction trope about how time travelers to the past irrevocably and significantly screw up the present by altering events in small, nearly imperceptible ways.?
We rarely appreciate how small actions in the present can positively affect the future. The Swiss Cheese Model helps identify points where we should use the Butterfly Effect to our advantage.
If we can efficiently put in small, often almost unnoticeable, control measures somewhere in the chain of events leading to an adverse event, we can help avoid those events from happening in the future.?
We can’t eliminate all risk unless we decide to never do anything at all. What we can do is evaluate the risks that do exist and make plans to mitigate them. The simplest way to do this is to think of risk as the product of multiplying the likelihood of an adverse event happening and the consequence of that event.?
If something is very likely to happen and the consequence is severe, the risk is high. The maximum level of this would be a 100% chance of death or severe injury. Most people would agree that the risk in that endeavor was unacceptable, and that whatever task that was shouldn’t be undertaken. While an extreme example, every field has an acceptable maximum risk level, be that in lives or dollars. If you can’t get below that, don’t do it at all.
领英推荐
Too often, people stop their flowcharts right there. They think that something is either too risky or it’s not.?
Most risks don’t have to be eliminated. They just have to be controlled. How one controls risk depends on what resources you have and how big the risk is.?
In the case of my extra-crispy fried Osprey, the perfect solution might have been to redesign the particle separator system to eliminate hydraulics altogether. Some other rotorcraft use what are called “barrier filters,” much like those in your home air conditioner, to protect their engines. A system like that would have eliminated any chance of a hydraulic fire.
That also would have entailed redesigning the entire nacelle and replacing the nacelles of every aircraft in service. It would have impacts across the entire supply system. All told, the bill would have been billions of dollars and several years. The cost would have exceeded the price of several aircraft, far more than it would likely save over the life of the fleet.?
When the cost of mitigation exceeds the cost of the hazard, our controls may be hurting more than they help. We can’t bubble-wrap the world. Take a step back and devise a more efficient control for the risk. That could be a less ambitious technical solution, or it could be a combination of a technical solution and a personnel solution. One of the corrective actions for my mishap was less complete than the perfect solution, but still filled a hole in the Swiss cheese—they simply added a drain to vent hydraulic fluid overboard in the event of a system failure.
Just as impactful was the implementation of training. In aviation, one of the first things that happens after an investigation uncovers the root causes of a mishap is to tell everyone exactly what was done wrong. In my case, it meant broadcasting to the world that I should have realized the significance of sparks visible to the naked eye coming from the particle separator. The circumstances of my mishap were read aloud in every Osprey squadron in the Marine Corps. It wasn’t the right solution for my ego, but it was the right solution for the safety of Marine aircrews and their passengers.
?Look outward at the risks and inward at how to fix them. Good organizations look at their failures and see when their controls failed. The best organizations go one step further. They envision worst-case outcomes and perform “pre-mortems.” “If this planned action was to fail catastrophically, why would it have happened?”
You shouldn’t need to suffer a catastrophe before filling the holes in your Swiss cheese. Some pre-mortems are brainstorming sessions, looking for “Black Swans.” Others may be a matter of forcing people to fully consider more routine risks.
In military aviation, the people who schedule the flights will check for certain key factors known to be warning signs. If one of the pilots hasn’t flown in a long time, the other pilot might need to be an instructor, for example.?
Before the flight leaves, the flight crew is charged with a final assessment of the risks, not only double-checking the previously identified risks, but looking for ones that could have emerged just prior to the flight. They look at whether all the systems on the aircraft are working, the weather, and even whether the whole crew rested properly the night before.
Certain risks are minor enough that they are just noted for awareness. Some?may require a change in plans—bad weather might force a change in the route. Others might require approval from a higher level leader to decide whether the benefits of the flight are worth the risk. Still others require the flight to be cancelled, with no room for negotiation or mitigation. If the crew didn’t get their mandated rest, there’s no way to mitigate the risk after the fact. The only choices at that point are to cancel the flight or to accept the risk because the mission is important enough.
You don’t need to work in a life-or-death job to need controls for risks. Too often, businesses neither look backward nor forward at risks. People do not want to admit nor be reminded that their mistakes cost money.?
The analysis usually ends with, “Finance/sales/PR/etc. screwed up. Don’t let it happen again.” Going back to the premise that no one screws up on purpose—why was the task done wrong? How can we make sure it’s done right? This requires a truly honest assessment of what happened after a major error and a forthright assessment of weaknesses.
Are your standard procedures faulty? Perhaps they are, but perhaps the issue is that they aren’t easily referenced and are buried in a poorly organized hard drive or SharePoint site. Perhaps just having a procedure was sufficient to pass a safety or ISO inspection, but for actual work, people must be able to use real procedures, which means they need to be accessible and usable.
Follow the holes back another slice. If you don’t provide training or good references, people get by just observing others, who might not know the right way, either. Sometimes this situation is even actively promoted by people for whom esoteric knowledge is their organizational power.?
Every time someone passes knowledge on, there’s the potential for each generation to add their own errors and idiosyncrasies. Like a cassette tape, each time someone makes a copy of a copy, it gets a little worse—what aviators call the “normalization of deviance.” That isn’t as fun as the term implies.
These sorts of arrangements point to an inadequate training process and shortcomings in organizational culture. Are you really training people to do critical tasks, or just counting on them to figure it out as they go?
Follow a different sequence of holes and you might end up at your hiring process. Are you hiring for skills or for talent? Which is the right one for your organization’s mission?
You might not have the resources to fill all the holes you find with risk controls. Increasing pay to attract better employees might not be in the budget and could take months or years to have an effect. But perhaps if you spent a couple of weeks documenting your key procedures and another few thousand dollars in website design making them easily searchable, you could eliminate at least some of the holes.?
If you prevent just one expensive error, it’s time and money well spent.
Just as success is a multi-step process of interdependent variables, so is failure. Either one can be easily interrupted at the wrong, or right, time. Half of success is just avoiding failure. Address the root causes of our failures, and success is that much more likely.
Founder of ICARUS Devices
2 年Great article Carl!