Assessments of Single Point of Failure in Business and Industry Assets (SPOF)
Mosaed Al Garni (????? ??????)
Consulting | Training | Reliability | Leadership | Asset Integrity | Organization Effectiveness | Risk Analysis | RCA | Business Health Analysis
What Does SPOF Mean?
Experts define SPOF in ways that reflect similarity across platforms and applications. Among the recognized definitions for SPOF are the following:
·??????A component in a device, or a point in a network, that, if it were to fail would cause the entire device or network to fail; normally eliminated by adding redundancy. (Wiktionary).
·??????A point or part of a system where there is no backup in case of failure. The whole system will become disabled. (The Law Dictionary).
·??????Any?computer?or?communications?system?that?contains?only?one?component?to?do?a?job?creates?a?single?point?of?failure.?If?that?single?component?fails,?there?is?no?alternate?one?to?take?its?place. (The Free Dictionary).
·??????A potential risk posed by a flaw in the design, implementation or configuration of a circuit or system in which one fault or malfunction causes an entire system to stop operating. (TechTarget).
·??????A single point of failure is essentially any event or business component outage that has the potential to shut down the revenue generating part of the business. (David Taylor, Thinking Business blog)
How Critical SPOF Is?
By definition as shown above, SPOF assets are extremely critical to any business as a single asset can drive the whole business into complete paralysis, severe physical damages, or unrecoverable financial losses.
The sad fact is that majority of industry and business leaders do not invest enough time and resources to properly identify SPOF assets within their domains or, worse, when they identify them, they evolve only limited mitigation for SPOFs they identify.
Examples for SPOFs
Examples of SPOFs are too many to count. The following examples, however, could be among the most relevant to people in industry and emergency services:
·??????A single server in a network hosting a system operating the entire business. When this single server malfunctions, the whole business goes into a grinding halt.
·??????A pump for a cooling media of a critical system such as a reactor cooling loop in a nuclear power plant. If such a pump fails, and no functional standby pump is available, an extremely dangerous over-reaction could occur and cause devastating disasters.
·??????A single ambulance vehicle in a hospital providing emergency services. One is invited to imagine what would be the situation if such a hospital provides emergency services during a major natural disaster or at a battlefield during war.
·??????A subject matter expert (SME) overseeing a critical function without a standby SME. If this SME may get sick or even resign suddenly, that function would severely struggle or even stop altogether.
·??????A single-source supplier for a key spare part needed for a critical system. If this supplier may go bankrupt or decline to provide the needed spares for some reason, the relevant system may go into a complete halt, with possible devastating impact on the business.
·??????A single power supply for a critical industrial complex. If there is not a backup power supply, at least for emergency control of operation, that complex will stop functioning. Worse, outage may lead to uncontrollable operational disturbances such as overheating or overstressing.
Some people may argue that these examples are not realistic as none of such critical systems would ever be left dependent upon a single component/link.
However, experience shows that, although we almost always install/assign redundancies (backups/standbys), we tend to forget to maintain regular checking for those redundancies to ensure their seamless and smooth operation when needed.
Along my long career, and throughout my investigations, I have witnessed this kind of shortfall happening time and again. Hence, although we have in theory, or by design, a redundancy for a function, we in reality have a single component/link, or a SPOF!
How Do We Identify SPOFs?
Systemically identifying SPOFs requires that we follow a certain sequence that ensures leaving no gaps.
Once we assign a capable facilitator and provide him with a strong team, the sequence goes typically as follows:
Listing all our assets
Assets in the present context are not only physical parts and systems. Rather, we should account for 4 main categories of assets:
1.????Physical parts and components (basically, equipment and machines).
2.????Enablers (basically, software and tools).
3.????People (essentially, key matter experts, operators, and technicians).
4.????Logistics (especially in regard to critical spare parts and key support services).
领英推荐
Conducting Asset Criticality Analysis (ACA)
For each category, we perform a comprehensive ACA, based on a well-established risk criterion (typically represented by a formal Risk Matrix). This analysis, performed by highly experienced and well-informed team, should categorize assets into 3 to 5 classes according to their criticality to the business.
Typically, we end up with some criticality classification that looks like the following:
Class 1 (or A): If ACA is performed properly, this class should normally be within the range of 3-5% of the entire list of the category assets.
Class 2 (or B): This class should be within the range of 5-15% of the entire list of the category assets.
Class 3 (or C): This class should be within the range of 15-30% of the entire list of the category assets.
Class 4 (or D): This class should cover the remaining assets of the entire list of the category assets.
Analysis of Class 1 redundancies
Here, we investigate within this top criticality class two key aspects:
1.????Are there well-defined, and properly established alternatives for every asset listed in the class?
2.????Are these alternatives fully functional and ready to take over as immediately as the original asset seizes to perform its function?
This redundancy analysis helps us to identify SPOFs that are either overlooked altogether, or we know but we are not paying enough attention to in terms of regular maintenance, inspection, and readiness for reliable operation.
It is to be emphasized here that, in some critical industries such as power generation and oil and gas, we may need to extend such detailed analysis to Class 2 or even Class 3.
How Do We Mitigate Identified SPOFs?
Investing in mitigating a SPOF could be prohibitively expensive. Therefore, before deciding to invest in creating alternatives or establishing redundancies, we should first investigate whether we have present alternatives or redundancies that are not functioning properly. If such alternatives or redundancies are not functioning properly, we should address their deficiencies and reestablish their fail-proof functionality.
If there are not alternatives or redundancies existing, then we dive deep and decide to embrace one of the following three options.
Decision should be based on the outcome of a comprehensive business case analysis, balancing collective risk (likelihood and impact) and investment required to establish alternatives or redundancies that we conclude to be needed.
1.????Invest in establishing required alternatives or redundancies for the SPOFs that we identified.
2.????Examine whether there are ways to bypass the SPOF and/or effectively employ less costly measures such as more frequent maintenance and/or close monitoring.
3.????If these two options are deemed to be extremely costly or impractical, then we may decide to live with the SPOF. However, it is worth reemphasizing here that such a decision shall be thoroughly contemplated, based on extensive experience, and associated with learned understanding of all its major consequences.
In conclusion, it may be enriching to depict these insights below by means of an overall flowchart summarizing their key concepts.
What are the Key Requirements for Analysis of SPOFs?
Essentially, the following requirements should be furnished so that an effective identification and mitigation of SPOFs are achieved:
1.????A most experience and capable facilitator who is familiar with risk analysis, root cause analysis (RCA), failure modes and effects analysis (FMEA), and problem solving.
2.????An authorized team. The team should be technically strong and highly aware of business opportunities and potential risks.
3.????Visible ownership and support from top management, not only for the SPOFs assessment activities but also for securing proper investment and resources that are necessary for effective mitigation.
4.????Tools, which are software that we normally use to facilitate typical reliability analysis. Although they can remarkably help saving effort and time, they are not a must, provided that we have the right facilitator and team who can do without.
Reliability & Asset Management Expert
2 年Thanks Abu Abdullah Mosaed Al Garni for the great sharing.SPOF analysis usually support in business continuity and Startegies risk identification..
Reliability & Asset Management Consultant @ Freelance | Certified Reliability Leader
2 年Thanks for sharing these experiences Mosaed Al Garni In fact, endless number of outages are coming from what’s supposed to be redundant systems! So, the human factor is key in addition to strong organisational structures, processes and leadership. RCAs are telling precious stories here!