Normal accident theory and learning from major accidents at the National Aeronautics and Space Administration (NASA)

Normal accident theory and learning from major accidents at the National Aeronautics and Space Administration (NASA)

This paper discussed four major NASA accidents in the context of Normal Accident Theory (NAT), high reliability and some other aspects of organisational theory.

Then they discuss some ‘remedies’ to counter some of the organisational risk factors.

I’ve skipped large amounts of this paper, so much that maybe even parts of this summary won’t make a lot of sense…who knows.

First they observe that despite some high profile disasters, “the overwhelming majority of [NASA’s] human space missions have been successful”.

Four major accidents were Apollo launch pad fire, Apollo 13, Challenger and Columbia. The last major accident in NASA occurred back in 2003. And hence, from 2003 to today, this has been NASA’s “longest period in its history without a major accident”.

A recent report highlighted the importance of pointing out the “reality distortion field” that can be imposed by cost considerations, political pressure and scheduling pressures”, which have beset NASA as with other companies.

In that vein, they say NAT can provide useful insights which may help the agency navigate this reality distortion field.

NAT focuses on system accidents that result from complex and tightly coupled systems, despite some considerable engineering and approaches to prevent these accidents. In these complex and tightly coupled systems, unexpected interactions can result and these make for extremely difficult situations for people to understand the emerging signals about failure.

Based on their analyses, they contend that NAT accidents are extremely rare, and only Apollo 13 meets Perrow’s criteria, with Apollo launch, Challenger and Columbia not being normal accidents. Based on Perrow’s typology, these other accidents align more with component failure accidents – said to be, for example, “failures in the design, manufacture and approval of individual components installed in the system” [*** Note this is based on Perrow’s typology, and others would count these more as organisational/system accidents rather than component failure accidents.]

Curiously, they argue that “The evidence from the investigations into the four major accidents suggests that managers and engineers of complex and tightly coupled systems should worry less about the unexpected system interactions that lead to normal accidents and worry more about component failures”. ?

[*** I suppose purely in the context of extremely rare and highly selective ‘normal accidents’, rather than unexpected system interactions which are found in almost all sources of organisational success and failure? E.g. I think they’re saying not to worry too much about ‘normal accidents’, since they’re really rare, and instead focus on the N mechanisms for other major accident types.]

Normal accident theory and the inevitable accident

Here it’s said that “NAT’s unit of analysis is the system, which comprises the technology, and the human operators who directly interface with it”. Further argued is that NAT focuses on “the cognitive impediments faced by human operators when making such inferences. It does not dwell on the effects of management practices and organizational structure”.

In support of this argument, it’s stated that “Perrow received considerable criticism for what many viewed as technological determinism”. In NAT, systems have two defining characteristics – their complexity and coupling.

Linear systems are represented by a sequenced interaction between parts, units and subsystems. In complex systems, parts, units etc. may be in close proximity but may not be in direct production sequence. Hence, “operators … must rely on indirect or inferential information about system status”.

Moreover, the amount of information generated within ‘malfunctioning’ complex and tightly coupled systems can “quickly become overwhelming”. Resultingly, people can find it difficult to sense emerging signals about failure. Complexity also “means that damaged components cannot be removed without considering overall system effect”, which can trigger unexpected outcomes.

Coupling ranges from loose to tight. A loosely coupled system is one where the constituent parts and their relationships are buffered from each other and aren’t critically linked with immediate sequences. A tightly coupled system has time-dependent and inflexible processes.


It’s said that any buffers or redundancies in tightly coupled systems “must be deliberately included by design”. The likelihood of a normal accident increases as the interactive complexity and coupling tightness increases, because understanding all possible interactions is nigh impossible.

They talk about the challenge of decentralisation in managing complex and tightly coupled systems. On the one hand, “engineering logic dictates that a high degree of centralization is necessary to reduce the risk”, but high centralisation also means “that an even smaller number of operators must now make sense of the system”.

In any case, the author points out that normal accidents per Perrow’s definition are rare, and instead most accidents “in complex, tightly coupled systems are not attributable to multiple unexpected failures that are misunderstood by operators. They are, instead, attributable to organizational failures associated with design, equipment, procedures, operators, supplies and materials”, etc.

The author then discusses the relationship with power and system safety, as covered by Perrow. All organisations and interconnected ‘systems’, regardless of their complexity and coupling, “are administered in organizations where power is routinely exercised”. Power relations can be an additional risk factor for major collapses, and has the effect of increasing production pressure.

Production pressures may not be a direct factor in an accident’s genesis, but “may act as a catalyst that accelerates its operation–- with a consequent increase in risk”. Perrow warned of the effects of production pressures and other political pressures, particularly in the context of “increasingly privatized and deregulated systems”.

Analysis of the accidents

Next the author provides the rationale for whether the four listed accidents were normal accidents. I’ve skipped all of this section pretty much—a lot of the paper. Hopefully the resulting arguments still make sense without these sections.

In any case, Apollo 13 is said to be the only accident of this cohort that meets the criteria for being a normal accident. According to Perrow, it was ‘normal’ because the accident wasn’t rooted in the way that NASA organised the Apollo program, but “in the complex and tightly coupled Apollo technology”. [** Which relates to the criticism by some that NAT was marked by technological determinism.]

Here, the review board supported Perrow’s contention that the NASA management structure for Apollo was effective, and that the organisation was “as well positioned to avoid a similar accident as could be expected”. NASA also had management controls around design review approvals, manufacturing processes, test procedures, hardware acceptance and more.

Hence, by this line of argumentation “Apollo 13 may, in fact, be a case where decision-making at NASA – at least at that time in its history – was effectively de-centralized during the accident recovery operation”.

In contrast, Challenger had issues relating to decision processes, production pressures, evolution of adaptive norms and practices and more. The structure of NASA and the program also “impeded the flow of relevant risk information to decision-makers about the component and its associated risks”. The other three accidents are argued to not be normal accidents.

Can NASA become a high reliability organization?

Next the author addresses whether NASA can become a HRO. I’ve skipped a lot in this section.

They discuss the common 5 principles of HRO, which ultimately are said to sustain “an organizational culture focused on safety and imbued with the appropriate symbols and values”. These values then impact how people sense-make and respond in crises. Two principles of crisis management, based on the HRO values, are said to be:

1) Let technical experts figure more prominently in decision-making.

2) Internalized organizational values should prevail over hierarchy

They point to Weick’s description of the Mann Gulch forest fire, which accordingly “the meaning derived from structured organizational roles can be lost when its members are surprised or confounded by low probability events”. It’s said that resilience in these crises isn’t derived from hierarchy, but in trust “built through social ties between members), cultural values, and respect”.


Further, during crises, “good leaders are not autocratic, rather they display humility and a willingness to learn from the experts”. Leaders, then, should project competence, but also a fallibility among their teams by listening more and talking less.

The ability to be able to effectively listen can be hindered by organisational pressures of production and scheduling. Following the CAIB report, there was an optimism around NASA transforming into a HRT (high reliability) organisation, which may help counter some of these pressures. However, cynically, one researcher commented that “HRT may not be the solution” and he suspects that “the root of the problem may “lurk in NASA’s relations with both Congress and the space agency’s own extensive contractual community.”

Instead, some argue that NASA shouldn’t seek to become a HRO or adopt these principles verbatim, but rather “at becoming an HRT variant they described as a reliability seeking organization (RSO)”.

Moreover, existing HROs seem to develop in organisations that are closely regulated and shielded from full exposure to market and other pressures or competition; shielding that NASA apparently has never had (e.g. congress or the public).

RSOs have three dimensions:

1)????? Safety is a core, but is assessed against a set of alternatives

2)????? A continuous search for errors or major failure potential – this can include having practices that detect the inhibition of upward communication (particularly relating to major failures), and having “veto points” distributed through the organisation. Veto points allow operations to call stop

3)????? The preservation of institutional integrity

They also say this reliability seeking approach could incorporate recommendations from Diane Vaughan’s work, relating to how “managers and engineers adapted to production pressures and tight flight schedules not by breaking safety rules, but rather by interpreting and adapting them to the circumstances”.

In the case of Challenger, a culture of production drove business logics, which “necessitated a quick review of evidence about performance deviations and failures and timely decisions about their significance”. Hence, the issues that resulted was not born from malice, but “ a well-intentioned, gradual and, subtle, cultural adaptation to a demanding production culture”.

Structural secrecy was another factor, where the organisational structure of NASA “systematically undermined senior managers’ efforts to learn the details of a safety issue”. Again, this wasn’t a deliberate decision or malice, but “the result of benign organizational practices instituted to reduce information overload and distraction in materials sent to decision-makers”.

Hence, these practices “succeeded” in minimising information overload, but conversely masked important details or signals out of sight, or “obscured in small print during presentations to senior managers”.

Next they talk about the use of the Technical Authority in NASA. I’ve skipped a lot of this, but notably the paper argues that even though it was supposed to have broad powers and provide dissenting technical opinions, “it cannot override a decision made by the Program or Project Manager”.

Conclusion

In sum:

·?????? While NAT revolutionised thinking about major risks, it needs more empirical validation and conceptual rigour

·?????? While only one out of the four major NASA accidents are considered normal accidents, the concept still provides useful insights

·?????? They propose that a modified version of HRT/HRO, which incorporates suggestions from others like Vaughan, may be useful for organisations like NASA

·?????? Nevertheless, they aptly note that “NASA, of course, can never escape its political and social environment”; and few concepts can change that

·?????? Moreover, “while the changes suggested in this article may help prevent component failure accidents, none of these changes would prevent the occurrence of a normal accident”

·?????? This is because once a system is interactively complex and tightly coupled, and has major accident potential, those accidents “though rare, are to be expected”

·?????? They say that AI and machine learning may help to mitigate some of these risks, but “These enhancements, however, may also serve to render the system even more opaque for the human operators who must ultimately decide whether or not to intervene”

·?????? Further, it’s argued that NAT isn’t really about preventing major accidents, but more about addressing the consequences of the choices made to use particular high-risk technologies

Link in comments.

Ref: Tasca, L. (2024). Normal accident theory and learning from major accidents at the National Aeronautics and Space Administration (NASA).?Space Policy, 101653.

Dr Kevin J. Foster PhD CPEng NER FIEAust

Regional VP ASIS Int (Australia), Standards Au Committee MB-025 (Security & Resilience), ISO 22366 (Energy Resilience) & 22372 (Infrastructure Resilience); Past-Chair Risk Eng Soc (WA); Chair (WA) Inst Strat Risk Mgt

5 天前

Perhaps HRO and NAT describe two different states of a dynamic decision system? HRO for when routine risk averse operations are important and NAT when risk taking is important for an opportunistic payoff.

Dr Richard Agnew

Managing Director Business Assurance Australia

5 天前

Interesting... my doctorate was also on HROs - Todd La Port, Donaldson, Perrow, et al... so so much has been written on this, but still organisational manages keep repeating the same old things.

Mike Lutomski

Spaceflight | STEM/STEAM | Sustainability Advocate/Experimenter | Speaker | Mentor | Professional Boat Captain

5 天前

Some very familiar ideas, topics, and approaches. System interactions, verse components. NASA tries to capture both of these things. But as a lot of us are talking, a lot of it lies in the culture and human behavior. There is no magic process or singular idea that keeps us from avoiding accidents. We use many processes. But the culture of ownership, accountability, and responsibility might be the most important.

Richard Lisewski

Experienced Manager working in Chemicals, Pharmaceutical and Biotechnology industries. Specialising in implementing change. Chartered Chemist. C.Chem, MRSC.

5 天前

Ben, it’s worth taking a look at C Perrow’s book, Normal Accidents, Living With High Risk Technologies.

Dr. Megan Tranter

Helping Purpose-Driven Leaders Elevate their Impact | Former Netflix, Amazon, PepsiCo Exec | Leadership & Career Strategist | Follow for Daily Insights on Purpose, Resilience & Impact.

6 天前

Thanks for sharing Ben Hutchinson

要查看或添加评论,请登录

社区洞察

其他会员也浏览了