The Aggregation of Marginal Failures: How Small Problems Can Lead To Disaster
David Denyer
Professor at Cranfield School of Management | Strategic Adviser | Executive Development
A concept known as the aggregation of marginal gains is a performance improvement concept made famous by British cycling coach Dave Brailsford. In 1996 Team GB was ranked 17th in the world and won just two bronze medals at the Olympics in Atlanta. By 2012, Britain was the number one team in the world, and British riders won 12 medals (8 gold) at the London Olympic Games. Brailsford took the winning formula to Team Sky, recently renamed Team Ineos, winning seven of the last eight editions of the Tour de France. The approach is not unique to Brailsford. Rugby World Cup-winning coach Sir Clive Woodward emphasized the importance of ‘critical non-essentials’ in the run-up to the team’s 2003 success.
So, what does this mean? To make a 100% improvement in performance,
Do 100 Things 1% Better
Aggregation of margin gains means focusing on a 1% margin for improvement in everything you do.
"The whole principle came from the idea that if you broke down everything you could think of that goes into riding a bike, and then improved it by 1%, you will get a significant increase when you put them all together.”
(Brailsford speaking to the BBC)
That is, focus on improving every aspect of performance, and these small improvements will add up. As Brailsford says:
“Forget about perfection; focus on progression and compound the improvement.”
Examples of seemingly insignificant things they focused on included:
· Hiring surgeons to teach athletes about proper hand-washing to avoid illnesses during competition
· Choosing not to shake any hands during the Olympics for the same reason
· Implementing precise food preparation procedures
· Bringing their our mattresses and pillows so athletes could sleep in the same posture every night
· Painting the floor of their bike trailer white so they could more easily identify dust and remove it
James Clear’s article on marginal gains makes two crucial points:
1. A 1% gain isn’t notable, and sometimes it isn’t even noticeable
2. Marginal failures occur in the same way
The Aggregation of Marginal Failures
I suggest that the aggregation of marginal failures causes many of the challenges that we see in complex systems (organizations, infrastructure, critical services etc).
Incidents in complex socio-technical systems do not necessarily have notable and noticeable causes. The causality of events involves the combination and interaction of numerous marginal failures over time. Individually, each failure is necessary but insufficient to cause the incident.
Snook’s (2000) systems analysis of one of the worst air-to-air friendly fire accidents involving U.S. aircraft in military history. The accidental shootdown of two Black Hawk helicopters carrying 26 people occurred in campaign called ‘Operation Provide Comfort’ involving the coordinated action of the US army and air force and more specifically, on that fateful day, the breakdown of collective effort between the Black Hawk helicopters, two F15 fighters, and an AWACS airplane.
Although the shoot down unraveled in a matter of minutes, Snook’s ‘causal map’ extends over many years, starting with the macro-level effects of the fall of the Soviet Union and the resulting shrinking defense budget. Snook also draws attention to the unfolding of a multinational humanitarian effort to relieve the suffering of hundreds of thousands of Kurdish refugees, creating an emerging and competing logic of ‘operations’ rather than ‘war.’ At an organizational level, Snook reports downsizing and undermanning, a long history of inter-service rivalry between the army and air force, lack of training opportunities, and local practices that had taken over from written procedure.
These situational factors led to the collapse of collective action over time, resulting in communication failures, diffused responsibilities, inattention, anxiety, and misinterpretation of vital information. Snook identifies this slow steady uncoupling of local practice from written procedure as "practical drift" observable at nested levels as well as the whole system as the event explanation.
You really must read Snook’s book if you are interested in risk, resilience, and safety (no, I am not on commission! I think that it is brilliant).
See the diagram below for a list of failures identified in the incident:
You can download a PDF here: https://bit.ly/2KQ4esr
What Was The Root Cause?
Snook’s causal map emphasizes systemic factors at different levels of analysis, combining and interacting over time, setting the conditions for the event to occur. It was the aggregation of marginal failures, and the fact that they happened together, which provided the trigger for system failure.
Snook concluded,
“Nothing broke, and nobody was to blame, yet everything broke, and everyone was to blame”
Typically, the work of both Barry Turner and James Reason is used to explain incidents like Friendly Fire. According to Turner, events proceed through the phases of warning signs, incubation period, precipitating event, rescue and salvage, and ‘cultural readjustment.’ Turner revealed that events could have long `incubation periods,' where a chain of discrepant events, or several chains of discrepant events, develop and accumulate unnoticed. James Reason’s Swiss Cheese model shows how a series of barriers, represented as slices of the cheese, can erode over time, and if they line up, can result in system failure. I also like John Stuart Mill’s notion of “chemical causation” – a modest variation in the mix of ingredients can make a big difference. Just think how changing one ingredient can significantly alter the taste of an otherwise appetizing meal.
What Does This Mean?
Snook’s analysis reveals several things, which are analogous to the marginal gains concept,
1. Serious events such as this are often complex, multi-level and extended in time
2. They are usually caused by a combination of multiple, small failures, each individually insufficient to cause an incident
3. These failures are slow-moving, often over months or years
4. 1% failures aren’t notable, and sometimes they aren’t even noticeable
5. There is a compound effect of not getting back on track quickly
What Does This Mean?
For complex organizations, it means that the system’s characteristics make it inherently vulnerable to such incidents.
It means that you are preparing for threats that do not yet exist.
You will need to deal with latent underlying issues you don’t even know are problems yet, using systems, processes, and technologies that are degrading.
What does all of this mean?
We need to organize in ways that increase the quality of mindfulness across the organization,
Thereby enhancing people’s anticipation, awareness, and attention.
So that they can detect and attend to subtle marginal failures and adapt and adjust to prevent these problems from compounding.
How does your organization anticipate, prepare for, respond and adapt to the aggregation of marginal failures?
Please add your comments below.
About David Denyer
David Denyer is a highly cited author, engaging keynote speaker and an inspiring educator. He is a Professor of Leadership and Organizational Change, as well as a Commercial Director, at Cranfield School of Management. He runs the Organizational Resilience and Change Leadership Group. David is a trusted advisor to the leaders of some of the world's most renowned companies and government organisations. He helps them to understand issues, identifies their specific needs and then works with them to produce solutions that bring immediate improvement to their business. David also runs the Leading Organisational Resilience Programme at Cranfield, which is consistently rated as one of the world's top providers of executive development.
Associate Director(Senior Manager-Consulting) @ Cognizant | PMP, PRINCE 2 Practitioner| Digital Transformation with Cloud and Agile| Business Strategy Development & Execution| Aerospace and Manufacturing
4 年Great article, I am sure it rings bells with all of us who have deliberated on failures and tried to investigate causes and build resilient systems. However, what I have found to be most challenging is to recognise the build-up of failures in time to prevent the catastrophe. That surely would need a belief in management/decision makers for a cultural shift . A first step in this direction is to believe in what you discuss here. But bigger need is to build systems in organisation to arrest build-up of marginal failure. I would look forward to your thoughts in this direction.? Learning from you , the concept of marginal gains and building resilience sounds very much an owned idea. Thanks for sharing such insightful articles.?
Industrial Hygiene Specialist at Westlake
4 年Great article David! In major incidents, there are typically multiple management systems failures occurring simultaneously. Here, it appears there were several recognized weak signals that weren’t acted upon. NASA refers to the Snook concept of “practical drift" as “Normalization of deviance.” I support the concept of marginal failures. However, if leaders aren’t willing to enforce non-marginal things like written procedures, how are we going to get them to respond when marginal issues are identified/raised?
We're all learning!
5 年Have you tried FRAM on this - it seems a classic case of cascading / propagating effects of natural (not Black Swans!) variabilities, and the eventual resonance which can result in military "friendly fire" and "never events" in Healthcare