Operating Redundant Systems
Dane Boers
??End to end, data to decisions. Maximize value through modelling, optimization and knowledge sharing. ?? Your asset analytics coach - Modla ??
Introduction:
There are two schools of thought (scenarios) when it comes to the operation of redundant systems:
1) Operate Asset A as much as possible, and if/when it Asset A fails switch to Asset B, and schedule a replacement for Asset A.
2) Cycle operation as evenly as possible between Asset A and Asset B then replace them both at the end of life (typically in a planned shutdown).
TLDR: Scenario 1 is preferred in most cases (See exceptions).
Assumptions / Setup:
For the purposes of this article we're going to assume a duel redundant system where Asset A is identical to Asset B.
Most assets have a combination of failure modes tied to either calendar time (ageing) or operating time (wear out). Since the failure modes of Asset A and Asset B are identical and changes in operation only affect the modes tied to operational time, we will only compare the operational time modes.
We're also going to assume that if both assets exist in a failed state, then this is a catastrophic event.
Reasoning:
For Scenario 1
If the estimates are conservative, then there are no premature replacements. This maximises "useful life", and decreases the cost / life ratio of each asset. If the estimates are inaccurate, when asset A fails, the chance that the redundant system is also in poor condition is low. This means there is a low of a catastrophic event.
For Scenario 2, the rate of decay is halved, since operation is shared between both assets as evenly as possible. In this scenario there are 2 asset failures over the same period as scenario 1, however these are both expected to occur at a similar time.
The idea is that you can replace both assets at the "End of life" proactively during a shutdown. If the estimates are conservative, then it is likely that the premature replacement is wasting "useful life", as well as increasing the cost / life ratio of each asset. If the estimates are inaccurate, when asset A fails, the chance that the redundant system is also in poor condition is also high. This means there is a high risk of a catastrophic event.
Conclusion:
Since both scenarios have the same effective cost (Assuming Scenario 2 is replaced optimally), Scenario 1 reduces the risk of catastrophic events because the redundant system is in the best possible state at the time of the first failure.
Exceptions to this recommendation:
1) If lack of operation increases the likelihood of other modes: e.g. Lack rotation causes static corrosion due to uneven distribution of grease. Note: The presence of a mode like this does not invalidate the above conclusion, and the goal should still be to minimize the cost/life ratio and risk. This means that periodic rotation or mitigating task for this mode may be effective, whilst still adhering to Scenario 1's operation.
2) If unplanned switching losses are disproportionately high.
Helping out @ Carrapateena
4 年Scenario 1 is usually better until Asset B fails before a replacement for Asset A is ready for deployment because of external factors such as Covid 19.
Reliability Leadership - Asset Management, RAM(S) Engineering, Maintenance, RCM / FMECA, ERP/EAM, Reliability, FTA, RCFA. KTP Supervision.
4 年?Both options assume that no defects are introduced during the passive standby phase such as stiction, binding, false brinelling (as noted), lube oil settle out etc. As mentioned above the loss of the dormant standby is a hidden functional failure hence the need to test a start on demand, but additionally that operation also provides a useful maintenance role in getting everything moving again and redistribute lubricants etc. That said, a while back I came across a case where the unofficial strategy was as follows – upon failure of the duty unit the standby unit was operated, this they then believed gave them plenty of time to investigate the repair, strip down, discover difficulty in part identification, find parts were not spared, difficult to order, slow through the system and then slow to be installed. This meant that for significant periods they were operating 1oo1. This creates a significant window of opportunity for concurrent failure with ensuing production loss. The point is - Whatever strategy you adopt when one units fails get it serviceable again as quick as possible! whether it’s returned to duty or becomes the standby. Get it fixed, having the standby doesn’t give you unlimited time to repair at your leisure.
Maintenance Reliability Engineer
4 年The swapping strategy for the systems with redundant equipment should be carefully selected as it directly impacts the availability. The consequences due to failure of any of those assets should be kept in mind for this decision. If it's a critical system, keeping the redundant asset B idle till the asset A fails may not be the best choice as some hidden failure modes like false brinelling on the bearings, rotor bend, corrosion inside the casing etc could lead to asset B's failure when needed on demand and the consequences would be huge downtime. On the other hand, too many starts and stops of both the assets in 50:50 philosophy could lead to unnecessary stresses on the components causing the premature failure. The swapping philosophy and the swapping cycle duration should be aligned with the predictive/preventative maintenance plans so no PPMs are missed due to equipment not ready for it.
Strategy, Asset Management, Digital Engineering, IoT, Industry 4.0
4 年Or Three, operate Asset A at 30% more than Asset B and stagger your replacement costs and allow for both assets to be replaced in a planned manner, with Asset A being your leading indicator for asset performance.
RAMS Engineer
4 年The purpose of redundancy is to achieve the highest possible availability of the system but the threat of common cause failures always prevail along with redundancy. Here your assumption is missing the presence of hidden failure of Asset B. You can determine hidden failure by switching operation to redundant equipment to see if it is able to operate on demand. But the problem is that just switching would lead to heavy load on equipment, reducing its life. So it is advisable to operate for a long time when you have already switched rather than turning it off. Your assumption in case B is right that we cannot operate equally as it would lead to simultaneous failure ( not due to common cause). So the best strategy is to operate both 30/70 so that we optimize operation under both concerns.