Goldmine of near misses. Learn from big mistakes You almost make. Heinrich's 1:29:300 rule
A near miss not reported in the next major incident

Goldmine of near misses. Learn from big mistakes You almost make. Heinrich's 1:29:300 rule

Context:

Near misses reporting is a common practice across manufacturing, assets intensive, healthcare, airlines industries. They provide valuable insights to reduce the occurrence , prevent occurrence of major injuries and fatalities.

Recently an US airline flight mistakenly took off from a runway that had been closed for construction. Another incident at Mumbai International airport where an airline flight landed on the runway as another flight was taking off from it. An employee in a workspace nearly slips on a recently mopped floor that was not marked with a wet floor sign. A worker in a factory trips over a small pile of boxes left in a walkway, manages to hold on a nearby table.

All these are near miss incidents. They are a potential hazard in which no individual was injured and / or there was no damage to property. But, with a shift in timing or position, a possible damage or injury could have occurred.

Why am I am bringing this topic today? We recently witnessed a massive outage caused by a software update, impacting global businesses from airlines, banks, healthcare into chaos. In 2018, millions of an UK bank customers were locked out of their accounts after an upgrade to the software led to a massive banking outage. A software update for a ‘smart’ thermostat went wrong, forced the device’s batteries to drain, and a temperature drop.

There have been instances of similar nature but of a lower business impact in the past, which caused disruptions due to application of software patches, upgrades gone wrong. The root causes of these incidents vary, pointing to gaps in quality assurance, software update testing processes. The moot question is could near-miss reporting, deriving learnings from them, cascading to the teams enable reduction of major incidents, system outages in Managed IT Services? Will Gen.AI enable to get the learnings from near-misses? This is an area worth to dwell into.

Caveat: There is a ton of content on the root cause of the recent EDR outage. The intent is to not double click on the same, and focus on the value from near-misses reporting, culture.


What is a near-miss reporting? What is the significance of 1:29:300 rule?

Near Miss reporting can deliver visibility into the elements that contribute to an incident before the incident occurs. The focus is on prevention, rather than purely on fixes.

Origin of 1:29:300 rule: Herbert Heinrich was an American industrial safety pioneer, OHS researcher in 1930’s. He proposed 1:29:300 hypothesis. It stated that in a workplace, for every accident that causes a major injury, there are 29 accidents that cause minor injuries, and 300 accidents that cause no injuries (near-misses). In essence, if organisations’ focus on mitigating near misses, minor injuries, they can effectively reduce the occurrence of major injuries and fatalities.

This hypothesis identified causal factors of industrial accidents include a combination of “unsafe acts of people” and “unsafe mechanical or physical conditions”.

Can we extend this hypothesis to Managed IT Services, with the intent to track, report, learn from the near-miss incidents. It starts with defining how to identify a near-miss.


What are the type of indicators to gauge Managed IT Services delivery?

Organisations across industries tend to have a robust Managed IT Services construct. These services are based on a comprehensive ITIL framework, adopting proven practices, process standardisation, and steering effective IT service management.

The delivery of Managed IT services are measured by performance indicators (KPIs), like First Contact Resolution (FCR), Average Handling Time (AHT), System Uptime. These KPIs tend to be a showcase of the outcomes achieved.? In addition, metrics related to defect rates, CSAT, incidents reduction ... are essential to track continuous improvement.

Attention is focused on the performance indicators, as they provide an immediate, tangible view of the outcomes achieved. In addition few industries use ‘Risk’ indicators. They signal the occurrence of a specific event, with the focus to prevent potential consequences of the event. In essence, Risk indicators are akin to detecting a spark, which if caught earlier, and remedied can help prevent the risk of serious fire.


What is the fundamental difference between ‘Performance’ and ‘Risk’ indicators?

Performance indicators are easier to identify, and are tangible than risk indicators. They are defined upfront as a desired result. On the other hand, a near-miss is a consequence of an unexpected gap, and the timing of the occurrence remains unknown.

A near miss is an unplanned event that can potentially develop unintended consequence, but does not actually develop them. Identifying risk indicators like near-misses tend to be difficult, as they are not part of the original idea of what is to be achieved. The elements to determine a near-miss are discovered only in the operational phase.

Performance indicators are akin to ‘knowledge’, they are important and easy to achieve. Risk indicators like near-misses are akin to ‘wisdom’, take time to build. Without near-misses, the IT team could have blind spots, their actions could lead to unintended consequences.


How can near-misses help to provide insights including for Managed IT Services?

Near misses are a valuable source of information, ideal candidates for Risk indicators. They enable to identify gaps, weakness in the risk assessment, and management program of an organisation to correct them to prevent future incidents.

In the Managed IT Services context, a near-miss is an opportunity to improve the systems resilience, and reduce downtime for conditions with potential serious consequences. Few examples of near-misses for Managed IT Services are:

  • A support team member can choose a target environment to deploy a software patch by clicking on the drop-down list. A support member could inadvertently select Production instead of UAT, and cause disruption due to an untested software.
  • Usage of rm * command for Unix, Linux Admin roles could have serious consequence if a character is shifted.
  • BCP-DR tests failover between primary and secondary environments. This testing at times may skip installing the backup’ed content, and testing the applications on the same. Having this step is essential to meet RPO.
  • Not keeping the development, test, production environment software, libraries in sync could result in testing inadequacy.
  • Near-misses could include a violation of policies, guidelines, regulations or gaps in certain guidelines in a new context.

Reporting near-misses is overlooked, yet it is equally important. Documenting these events helps to identify potential hazards before they result in real harm. It involves recording the nature of the near-miss, conditions at the time, and any immediate corrective actions taken. This fosters a learning culture that proactively addresses safety risks. Create a log of regular checks, and the safety issues identified during these checks. This proactive approach helps in immediate risk mitigation and in long-term safety planning.


Do employees have the psychological safety to report, reflect on near-misses? How can near-miss reporting be standardised?

An encounter with disaster can inspire significant innovations. How can businesses learn from their near-misses without incurring the costs associated before they covert into major outages?

Do the employees have the psychological safety to report these near-misses, or do they fear they will come under scrutiny ? It is crucial that these near-misses are framed as key learning opportunities. This will encourage psychological safety amongst the employees, businesses can encourage discussion of these near-misses, elicit, cascade learnings. This will potentially enable to avoid costly errors for the future.

If near-misses are tagged as failures, then employees will not report them in the fear of getting admonished, and no one will hear about them. Businesses need to frame near-misses as examples of being vigilant, learning opportunities, encourage people to speak up.

Currently there are few organisations, that apply structured near-miss management systems (NMS), covering collection to analysis, dissemination of knowledge to all stakeholders. There is an opportunity to standardise NMS for the benefits of industries.


Conclusion and key take-aways:

The concept of near miss in the context of a worker safety, and avoiding an equipment damage is spreading from pioneer sectors like aviation, chemicals, to construction, manufacturing, hospitality. Heinrich’s 1-29-300 hypothesis states that in a workplace, for every accident that causes a major injury, there are 29 accidents that cause minor injuries, and 300 accidents that cause no injuries (near-misses).

It is worth to extend this hypothesis to Managed IT Services, with the intent to track, report, learn from the near-miss incidents. Near-misses need to be reported, investigated, refinements identified to strengthen the ‘protection’, and risk management system.

Currently there are few organisations, that apply structured near-miss management systems (NMS). There is an opportunity to standardise NMS for the benefits of industries.

Near-misses are ‘goldmine’ of avoided catastrophes. They provide the wisdom for the employees to avoid blind spots, enable learning from the Big Mistake You Almost Make.

References: Heinrich's Theory of accident Causation by Abd El-Rahman Abd El-Hafez (LinkedIn). Artwork by Anita D'Souza.


Supal Desai

Innovative IT Leader | Driving Digital Transformation, Cloud Solutions & Cybersecurity | Passionate About Enhancing Operational Efficiency & Business Growth

7 个月

Insightful! The increase in reported near-misses can be a double-edged sword. While some may view this as a reflection of the IT team’s faltering competency, I believe it underscores a culture of transparency and continuous learning. I also advocate for acceptance and a robust safety net. This empowers our responders to openly report near-misses, transforming potential issues into valuable lessons to improve service and avoid major incidents.

Satyaki Mookerjee

Chief Digital Officer Jio-bp II Ex Accenture II Digital Transformation

7 个月

Reporting “near misses” as an institutionalised practice takes significant amount of cultural transformation. Many near misses occur at an operational level where junior members of the team are operating ( not always but in many cases). Do people across levels feel comfortable sharing these incidents without feeling “judged” or being held against them ? Do people at a senior level drive the culture of sharing their own stories of “near misses” ? Does the company have a culture of NOT blaming the individual at the first instance but look objectively at processes , methods and automated interlocks for preventing near misses ? Without the psychological safety net , people would not be open to share these “near misses”. Once the near misses are reported in the system , Gen AI can then do the work of bringing out key learnings

要查看或添加评论,请登录

Prashant Dhume的更多文章

社区洞察

其他会员也浏览了