Black swans and fatigue failures
Stephen Dosman
FRAeS, Consulting services incl. CVE, Dynamics, Test Development, Fatigue/Damage Tolerance, & Continuing Structural Integrity
What does the global financial crash of 2008 have in common with the uncontained engine failure on an A380, Qantas Flight 32, over Indonesia in 2010? A failure which came within a whisker of causing major fatalities. {1}
tl;dr – The costs to Rolls Royce and the insurers was well in excess of 200 million US$, yet the cost of preventing the failure via Service Bulletin, which was due to a known issue in the engine oil feed stub pipes, would have been insignificant in comparison {2} (737 Max is another obvious example of cost of prevention vs cure). Engineering Managers are incentivised to prioritise short-term thinking and to take risks on the assumption that “it will be ok” or “it won’t happen to me”. Fund managers in the 2008 crash made the same assumptions. In both cases the individuals did not understand the true probability of risks being realised.
This article is about similarities in the way that financial institutions and engineers poorly model risks of failure. It is about the irony of researchers demonstrating that existing methods didn’t work, but their findings were ignored and yet their underlying data becoming enshrined in regulators’ sanctioned methods. It is about the process of retconning old methods and rationalising the factors used to generate safe lives. Finally, it is about why it’s important to understand all this if you are accountable for aviation risks. This article will help you understand those risks.
Put another way – we use bad methods, failures happen when they shouldn’t, and we shrug our shoulders and keep on using those bad methods.
The Question
Back to our question; both the 2008 crash and the A380 engine failure could be considered Black Swan events, something which “comes as a surprise, has a major effect, and is often inappropriately rationalized after the fact with the benefit of hindsight.” Black swans are examples of ‘fat tails’ in the distribution of catastrophic events (see Figures 1 vs 2). These fat tails occur where contributing factors combine to exacerbate each other. Nassim Nicholas Taleb, who developed the black swan theory, and who arguably predicted the 2008 crash, calls this a ‘scalable’ property. An example of non-scalable might be the distribution of people’s heights, because nature physically limits how tall we can get; whereas wealth on the other hand is scalable – the wealthier we become the more wealth we can gather, without limit.
In the case of the financial crash the fat tails were due to the financial services use of bad risk models and poor forecasting methods. For Flight 32 the problem came as a result of a poorly made oil feed stub pipe in Engine #2, a known issue to the engine manufacturer, which cracked, leading to a fire, explosion, and then very-nearly the loss of the aircraft with 469 passengers and crew on board. This event fits the definition because, it was an unanticipated effect, the impact on the aircraft was significant and outside all proportion of what one would consider reasonable, and after-the-fact much rationalising of the events that led to the engine failure occurred (cf. ATSB report). {3}
All well and good, but what does this have to do with fatigue scatter factors? I’m going to argue here that the scatter factors used on fatigue test results {4}, intended to provide an acceptable minimum level of safety, use the same modelling techniques and ignore the effect of fat tails; yet fat tails can and almost certainly exist here {5}, and as a result the real risks are unquantified. This is important because decision makers in the realm of Structural Integrity need to quantify their risks {6}. (Note: I’m not talking here about probabilistic risk assessment, which is a different aspect)
“They think that intelligence is about noticing things are relevant (detecting patterns); in a complex world, intelligence consists in ignoring things that are irrelevant (avoiding false patterns)” Nassim Nicholas Taleb
Fat tails
Modelling of fatigue failure generally uses a log-normal distribution (Figure 1), which is a thin-tailed distribution (Gaussian). What is it about aircraft structural failures in the real world that can lead to fat tails? Remembering this idea of scalability – we may test a structure to identify its fatigue characteristics and even if we attempt to put a certain degree of manufacturing variation and accidental and environmental damage (AD/ED) into the test article {7} – the fatigue performance of that article still in many ways assumes that it is built right, used right, and maintained right.
How can poor build, misuse, and poor maintenance fatten the tails? Well they are all ‘scalable’ – they can all exacerbate each other. Remember the example of the oil feed stub pipe? A minor part cracks and an uncontained engine failure then results. Similarly, we can have a poorly maintained part begin to corrode or fail, and then lead to a cascade of other failures (e.g. AW101 Merlin Helicopter ZJ123 NLG striker plate failure leading to an inability to lock down the gear (in this case the impact was small luckily)). Similarly, If the aircraft usage changes then this can significantly alter fatigue damage accumulation and/or create new unanticipated failure modes. All these things involve small unanticipated issues that can lead to large changes in when, or even if, fatigue failure is likely to occur. For this reason, I would suggest that fat tails and black swan events can be expected in real world aircraft usage. Now I’m not saying here that the industry doesn’t recognise and ameliorate the impact of build, use, and maintenance on safety, just that their impact is not quantified. If the risk created was small, then perhaps the effect could be ignored; however, just considering maintenance for the moment, even the FAA states that “Maintenance was involved in 15% of accidents (39 of 264) during 1982-1991, and ranks second in contributing factors to onboard fatalities”. So even here we can see that our assumption that it is ‘maintained right’ to support our safe life is starting to look extremely weak. {8}
Safety provided
What is the impact of these fat tails on our assumptions of safety provided by fatigue scatter factors? (I’m limiting the discussion to fatigue test factors here but will talk about Damage Tolerance factors another time) Firstly, we need to look at what level of safety is intended to be provided by the scatter factor. I’m going to summarise the commonly used methods later in the article, but for now we can confine ourselves to considering the following. A fatigue test (or a number of fatigue tests) is used to create a sample mean life. The distribution of fatigue failures anticipated in service is assumed to follow a probability distribution, commonly the log-normal distribution shown in Figure 1, and an acceptable total probability of failure is defined, the area under a section of the PDF. For a given standard deviation, s, there will be a fixed relationship between that probability and the mean life. And ‘Hey Presto’ we get the scatter factor – apply this to your test result and you now have a safe-life. The life at which the probability of failure has increased to the threshold that you set.
Figure 1 – Probability Distribution Function (PDF) for Gaussian distribution, safe life = 1/1000 PoF {9}
Of course, the big assumption here is that not only is that your test mean life is reflective of the mean in service, but also that the distribution of failures in service will reflect your assumed parameters. These assumptions are necessary as it is not feasible to test a design again and again in order to develop the needed statistical parameters. Therefore, in the main, the standard deviation of log-life used in scatter factors is taken from the results of fatigue tests and in-service aircraft teardown on the assumption that this data will reflect our new design.
What happens if real-world effects mean that the failure distribution is scalable, i.e. fat tailed? We get something more like Figure 2 even if our test mean life is accurate. It is interesting to note that the most likely event is still the mean life here, it’s just that extreme events are also much more likely too.
Figure 2 – PDF with Fat Tails, safe life represents >> 1/1000 PoF (note: the mean is more likely!)
Fairly common in these methods are the following features:
- Assumptions are made about the independence of cracks found within a single specimen (did a crack appear here, because one formed over there and the load paths altered?), but it is unclear whether this assumption has ever been tested within the data sets used.
- Assumptions are made as to the validity of using the log-normal distribution, but it appears in many cases that it only is appropriate near the mean and does not work in the tails (cf. USAF AFFDL-TR-66-197, sect II, para 1) – exactly where the results are used!
- The life factor created by this analysis is very often then reduced further to account for unquantified conservatisms in not accounting for the effect of inspections and fleet experience on the true risk of failure, yet at the same time the methods ignore the detrimental effect of possible poor maintenance, misuse, poor build etc {10}. I.e. they ‘hand-wave’ some overconservatisms into the factor but then ignore the unconservatisms; you can’t have it both ways!
So the most common method to generate a fatigue life factor is to take a bunch of fatigue test crack findings from a number of tests, fit a distribution which may only be accurate near the mean log-life, and then extrapolate out to a defined acceptable probability of failure, and then fudge that factor down a bit to a) make the result a bit more acceptable from a cost point-of-view and b) account for some difficult-to-quantify conservatisms (but also exclude some significant unconservatisms) {11}.
Does that mean that, say, one can really count on a Def Stan 00-970 based test factor to provide 1/1000 failure of probability? I would argue that the answer is emphatically ‘no’. I’m not saying that this means that a structure may be unsafe – what I am saying is that if a decision maker, say a Type Airworthiness Authority (TAA) is operating under degraded levels of safety for any reason (which is rather often!), then relying heavily on that PoF as part of their justification to their Aviation Duty Holder (ADH) may not be appropriate.
Now it is worth mentioning here that being able to work with risk is core to MoD type airworthiness requirements. They use the term for risk as being made ‘As Low As Reasonably Practicable’ ALARP and Tolerable {12}. A key feature is that what is ‘tolerable’, in terms of probability of occurrence is not specified – it is down to the ADH to decide for herself what she feels is a tolerable risk in that situation. It is exactly here where an ADH will need to understand what risks she is running and where a fatigue test factor may give false optimism.
One obvious question at this point is whether we can add fat tails to our statistics? The problem is that these distributions are not always amenable to using concepts like mean and standard deviation {13}, and there does not appear to be much in the literature at the moment where researchers have established how to take fat tails into account for structural failure (though there is some {14}). Therefore, practically speaking, the main takeaway here is the understanding that the use of a scatter factor can give some assurance of structural integrity, but it cannot account for real world issues associated with build quality, usage, and maintenance and thus the level of assurance is not quantified. Now Nicholas Taleb would suggest using an antifragile strategy to deal with this, a way to minimise negative risks whilst accentuating positive risks; however, the aviation community has been moving towards use of Safety Management Systems (SMS), with SMS already in many parts of civil regulation and coming soon to Part 21 and Part 145. I guess the big question is whether SMS will adequately address the fat tails issue? This is not something we an answer right now, so dealing with something we can answer…
Understanding the statistics behind commonly used scatter factors
If we’re going to work with these scatter factors then, it makes sense to understand the statistics and assumptions that underly them. I.e. you can’t break the rules until you understand them. See Table 1 below for a run-through of common factors used in civil and military Acceptable Means of Compliance (AMC) {15}. Even a quick look shows that there is a fair amount of ‘hand-waving’ and a lot of unjustified assumptions underlying the factors.
Table 1 – Basis for common fatigue scatter factors (s is sd on log-life, PoF = Probability of Failure)
Note: Guy Habermann, 2007, came up with a novel alternative he calls the Mean Estimation Method, which addresses a number of issues with the factors discussed in Table 2; however it still makes the assumption that a thin-tailed distribution is used (albeit it appears that it could be extended to include fat-tailed distributions if the distributions were understood (which we don’t currently)).
Closure
What to make of this?
- It appears that there is plenty of evidence that the same arguably faulty methods used in the financial sector to model risks (i.e. Bell Curve/Gaussian Distribution) are just as faulty in aviation safety, and there appears to be lots of evidence that both sectors suffer unexpected, high-impact, failures as a result.
- It appears that, although some in the industry understood the failings of the methods considered, the regulators universally pushed for these faulty methods to continue to be used and furthermore reverse engineered their numbers to justify continued use of existing factors.
- Those owning aviation risks are told that risks are no higher than ‘x’, yet they are being sold an oversimplification. The risks are higher than ‘x’. If this is understood, then a duty holder can better make a judgement when operating at reduced levels of safety.
If you’re not yet convinced here is another nail in the coffin… uncontained engine failures are meant to be 1E-9/hr events – i.e. they should never happen – yet on the 30th of September 2017, Air France Flight 66, another A380 with Engine Alliance rather than Rolls Royce engines, suffered its own uncontained engine failure (Figure 3) attributed to fatigue cracks in the fan stage after just 6 years in service. This stuff keeps happening.
Figure 3 – Air France Flight 66, A380 #4 Engine failure, 2017.
All this said I guess I could have just replaced this article with a warning on the dangers of extrapolation…
Credit: xkcd/605 (CC BY-NC 2.5)
“Things always become obvious after the fact” Nassim Nicholas Taleb
Fin
Footnotes
{1} The ATSB report cannot convey how close Qantas Flight 32 came to disaster; however, at so many stages it was whisker close – if the shrapnel distribution had been slightly different, if there hadn’t been 2-extra check pilots on board to deal with massive increase in crew workload associated with landing when so many significant aircraft systems were inoperative, if the literally tons of fuel falling out of the engines onto red-hot brakes on the ground had ignited, etc. For a discussion of the role played by Captain Richard Champion de Crespigny see this article here. Also to note that overall safety is improving in the civil sector, but I’m talking about something different here – preventable accidents.
{2} With respect to costs – The repair on VH-OQA (Flight 32) cost Qantas’ insurers A$139 Million, and furthermore Rolls Royce agreed to pay Qantas A$95 Million in compensation for other impacts on the airline (source). The situation was finally resolved when all the non-complying oil feed stub pipes had been replaced and an overspeed protection system put in place (cf. EAD 2010-0242-CN). How much would it have cost to inspect and replace the discrepant oil feed stub pipes, which were responsible for the failure, under service bulletin? Service Bulletins are not public domain, so without access to Rolls Royce NMSB 72-AG590 we won’t know – it wouldn’t have been cheap as it required engine swaps, albeit ones that would have been aligned with existing maintenance. The cost of 17 engine swaps combined with the fleet of 6 aircraft grounded for 20 days, an aircraft out of action for 18 months, and other costs was estimated to have cost Aus$ 80 million (same source), but the vast majority of that would be cost of over 600-aircraft-days non-revenue return due to grounding and repair.
The background here is that the failure was due to oil feed stub pipes being manufactured with over thin walls as a result of tolerancing issues. Rolls Royce realised back in 2007 that their supplier Hucknall had been supplying non-conforming parts due to a culture of inspectors turning a blind eye to non-conformances (ATSB report section 4.10), so 100 non-conforming oil feed stub pipes were retrospectively concessed, but was judged by the ATSB report to have been carried out incorrectly. There would have been huge pressure on Hucknall to ‘show the oil pipes good’, and they did. As a result, there was no containment action believed required and the oil feed stub pipes remained in service. However, even after Hucknall Major Quality investigation was complete the facility still was producing non-conforming units that were concessed as “having no limitations or conditions on its use” – Unfortunately the problem remained due to problems with datums used by the inspectors differing from those in the design data.
{3} Other examples of black swan events include: the 737 Max disasters, A380 Rib Feet Cracking, which reflected a largely unknown damage mechanism for certain types of aluminium alloys, the Nimrod disaster, 747 explosive decompression due to temporary repair paperwork being lost, 737 cabriolet (roof ripped off) due to poor understanding of the effect of multi-site fatigue damage, multiple DC-10 floor collapses after cargo floor failures, and of course the Comet disasters.
{4} What we mean here by a fatigue scatter factor is a factor on the assumed mean life of the component or aircraft. This can account for variation in fatigue performance, uncertainties in usage, uncertainties in accuracy of analysis, and uncertainties in the sample test life vs a considered ‘true’ mean life for the type assuming it is in accordance with the design standard and used as assumed by that standard. In many cases a log-normal distribution is assumed, and if the mean and standard deviation are known, then a constant factor on that mean life represents a fixed probability of failure. Also noting that although civil aircraft and many military aircraft are damage tolerant, the use of fatigue scatter factors on test results is still used throughout the aviation domain.
{5} If they didn’t then we wouldn’t see events like QF32 in practice; however, it appears that the unconservatism was recognized back in the 1960s though the concept of fat tails was not well understood at that time. For example USAF AFFDL-TR-66-197 by Abelkis in 1967 assessed a large amount of fatigue test data and found that outside of 2 standard deviations from the mean that the data did not fit. If you look at Figures 32 and 33 one sees the fat tail effect – orders of magnitude higher probability of failure at 3-standard-deviations from the mean and greater. This was pointed out, yet ironically it appears that the paper is now only used to establish the standard deviation on aluminium structures used by the FAA (s=0.14), but plugged into a method discredited by the author!
{6} Now of course this is a dilemma in its own right – structures-based regulations are set around ideas of deterministic requirements. You state what requirement the structure must meet, and in the main you don’t consider what probability of failure that incurs (though probabilities are baked into the back end). On the other hand engineers designing aircraft systems (rather than Systems Engineers) work almost entirely in probabilities. Which is not to say that they cover themselves in statistical glory either, but at least there is some attempt made to quantify risks as per CS25.1309.
{7} Note it is normal to put ‘worst case’ manufacturing flaws (damage below the detection threshold of the manufacturing process), and worst-case accidental damage into test articles in order to mitigate this effect. The problem is that these tests can only incorporate so much induced damage and it relies on those developing the test knowing with certainty what the possible critical damage might represent. In many cases the most critical Manufacturing errors, Maintenance errors and AD/ED was not anticipated (otherwise we wouldn’t be seeing the FAA statistics discussed above), so introducing damage to the test article is not a panacea. Furthermore it is common for designers to not understand the most critical usage, most critical parts which must be tested appropriately (as all tests involve compromises on accuracy and completeness of loads), and adequacy of the assumptions and inevitable gaps in the test pyramid.
{8} Now obviously I’ve not split out maintenance contributions to structural failures, but the inference is clear – we didn’t maintain our aircraft the way we intended for that time period, and I would say that, notwithstanding the big improvements in safety in the time since then, we still don’t.
{9} The log-normal distribution looks like a normal distribution of the probability of failure over time except that here the time axis is transformed to log-scale. If the standard deviation on log-life s is 0.14 of the mean (see Table 1), then 3.09 standard deviations equates to a total probability of failure (area under the PDF to the left of this point) of 0.001, and this point is 3.09 x 0.14 = 0.4326 times the mean log-life, or 100.4326 = 2.71 times shorter than the true mean. The reason why this number does not tie up with that in Table 1, is that this assumes that our test result is identical to the true mean, and so a greater factor is needed since we have not sampled the full population.
{10} Note some of the data used to support the assumed standard deviations on log-life are based upon teardown data and thus represent a fat tail distribution (though more often on fatigue test results which will usually exclude these fat tail effects); however this data is then fit to a thin tail distribution, so any factors generated will not reflect the fat tails.
{11} I’ve not mentioned Safe-SN yet, which is a UK Def-Stan 00-970 methodology for carrying out fatigue analysis (so not used to factor test lives). That is a subject which I’ll hopefully address another time.
{12} Intriguingly the ‘tolerable’ bit has only recently entered the requirements – before that it was fine for a Duty Holder to have an ALARP risk regardless of whether it was tolerable or not. The whole question of what is ALARP and Tolerable is covered in RA1210 Annex B. When contrasting civil and military required safety levels - we must remember that for military aircraft the goal of no fatigue failures can backfire - making the aircraft less suitable for its role and more likely to sucumb to enemy action. And also there is inherently a greater accepted level of risk for military personnel (One civilian life being equated to three military lives (per RA1210, Annex A), though the possible effect of a military aircraft accident on civilians also needs to be considered
{13} This is discussed in detail in Taleb’s book The Black Swan. See here for more information, and tips for how to address the issue.
{14} Attempts to assess overall (rather than just structural) aircraft accident statistics in terms of extreme value distributions, which are a type of fat tail, exist (eg. Diamoutene 2019 and Das 2016), but these currently are very difficult to develop and the researchers have had to ignore significant effects like the fact that overall accident rates have significantly decreased over the time considered by their sample whilst the exposure to risk (number of flights per year) has significantly increased. More pertinent here, Abelkis in USAF AFFDL-TR-66-197 attempted to come up with a distribution which better fit the data in the tails back in 1967, but it is not clear if this method was ever widely accepted nor whether the distributions were able to be adequately fit to the data-set (Though at least one researcher attempted to compare Abelkis results with other methods, Bhonsle 1991).
{15} I’m ignoring here the extra factors to account for things like unmonitored loading and how many critical parts are on the aircraft. The basis for these factors is even less transparent, as they don’t tend to be published in the open literature.
Banner credits: KatrinaTuliao (CC BY 2.0), Kiril Krastev (CC BY-SA 3.0), & the ATSB (CC BY-SA 3.0 AU)
? Stephen Dosman, 2019. Unauthorized use and/or duplication of this material without express and written permission from the author is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Stephen Dosman and this site with appropriate and specific direction to the original content.
FRAeS, Consulting services incl. CVE, Dynamics, Test Development, Fatigue/Damage Tolerance, & Continuing Structural Integrity
4 年Joshua Hoole I'd be interested in your thoughts - this article wasn't talking about the type of analysis you do, but do fat tails have a place there?
Freelance copywriter and journalist specialising in technology
4 年Even as a layman, I can follow most of that. A great article - thanks Steve!
FRAeS, Consulting services incl. CVE, Dynamics, Test Development, Fatigue/Damage Tolerance, & Continuing Structural Integrity
4 年For those interested this is the Taleb lecture -?https://sms.cam.ac.uk/media/2406625