Maintenance History is (mostly) bunk
Mark Horton
Reliability-centred Maintenance trainer, consultant, presenter and auditor. Specialist in risk, reliability and scientific computing.
I have been trying to remember the last occasion when an engineer told me they had learned something important from their maintenance history. Perhaps I'm keeping the wrong company at the moment, but I'm struggling to think of one.
Doesn’t that seem strange? As far as "big data" goes, the failure and planned maintenance records in an ERP are about as big as they come. Every year we add tens or hundreds of thousands of them. The result is gigabytes of data waiting to be coaxed into giving up their secrets: the key to improved profitability, safety, environmental integrity and a better working life.
Every large industrial organisation spends money on collecting and storing all this maintenance history and I meet engineers who feel genuinely guilty about not using it. If you are one of them, the core message of this article is for you.
You probably couldn’t use most of your maintenance history if you tried.
And it’s not your fault.
Optimum Maintenance from Failure History
You want to set up a planned maintenance policy for product dryers used in a chemical plant. Examination of the dryers’ bearings has shown that, unlike most bearings of this type which fail at random, they are subject to a pattern of wear. You expect to put some sort of preventive task in place.
How often should the bearing be replaced?
Here is the plan of campaign.
- Download the bearing’s failure history from your organisation’s ERP.
- Draw a survival chart using the recorded failure history.
- Fit a Weibull curve to model its reliability.
- Use the reliability curve and cost information to work out the optimum planned maintenance interval
1 Download failure history
Easy. A quick query for all the dryer bearings produces a total of 20 failure records. A little arithmetic gives you each bearing’s age at failure.
2 Draw a survival chart
Each circle on the chart below represents a failure, and its age at failure is plotted along the x-axis. The y-axis is the proportion of bearings surviving to any given age. The shortest bearing life is about 13000 hours, the longest just over 60000 hours, and half the bearings have lasted about 35000 hours.
3 Fit a Weibull curve
The orange curve represents a best-fit Weilbull survival curve through the data.
4 Work out the optimum planned maintenance interval
Knowing the cost of planned maintenance and the cost of an unplanned breakdown together with the Weibull curve and some maths, you can work out the optimum maintenance interval. This is the period that results in the best trade-off between two types of expenditure:
- Planned maintenance costs, which increase as the interval becomes shorter
- Breakdown costs, which increase as the maintenance period gets longer and more breakdowns occur.
There are a number of uncertainties in the costs incurred, but the best estimates available are these.
- Cost of planned replacement: $30,000
- Cost of unplanned breakdown maintenance including lost production: $250,000
In the graph below you can see how the hourly costs of unexpected failures and planned maintenance change as the bearing’s replacement interval is varied. The minimum total cost is the best compromise between the risk of unplanned failure and the cost of planned maintenance.
Changing the bearing at about 14000 hours incurs the lowest overall cost of about $3 per hour.
Maintenance history and costs in, optimum maintenance out: that’s a perfect result.
Wake up, sleepy head
If you can, imagine that you have just woken up. You were dreaming that you wanted to find the most cost-effective planned replacement interval for a dryer bearing. You found the relevant failure history, fitted a Weibull survival curve and used cost data to find the optimum maintenance period.
Now you are awake.
Today’s job is to review the planned maintenance interval for the dryer bearings.
1 Download failure history
Log in to your ERP and download the bearing’s failure records.
2 Draw a survival chart
There’s one record, and here it is in all its glory on a chart.
This is why you can’t use your historical failure data.
It’s because you don’t have any.
When the dryers were first installed, planned bearing replacement was scheduled every 15000 hours. Why? Because an in-service failure matters: it costs around $250,000 every time it happens. The original maintenance schedule was intended to prevent in-service failures by replacing the bearing before it failed, assuming that it would last for at least 15000 hours.
As it turns out, that assumption was wrong: one bearing did fail early. But the scheduled replacement means that the other 19 failure records you saw in your dream, those to the right of the red line, never happened. There is a big, blank space where they aren’t. Every event recorded in the maintenance database applies only to equipment that is less than 15000 hours old. You wanted to use the recorded history to set an optimum maintenance policy, but one record is all that is available.
You have no idea what would have happened if the equipment had been allowed to fail in that region. None at all.
There is no hope of predicting the bearing’s reliability at 20000 hours, 40000 hours or any other time from the evidence of a single failure. The survival curve could slope gracefully down because of random failure. It could fall off a cliff edge, showing sudden wear out. There could be a plateau; the single recorded failure might have been a manufacturing defect, the result of misoperation or just very bad luck. In the absence of knowing the physics of failure development, there’s no amount of analysis of the one data point below 15000 hours that could tell us. The chances of achieving any level of certainty seem minimal even if we can find a physical failure model.
This is the paradox of failure recording. There are plenty of records that don’t matter: those that have trivial consequences such as dead indicator lights and seals where leakage caused a slight inconvenience.
Unless something is very wrong, the database doesn't contain many records of turbines failing catastrophically in service, pressure vessels exploding or critical pipelines corroding right through. If a failure matters, the chances are that a pro-active task has already been put in place to prevent the failure. If the failure is prevented, there’s no failure history and no data for analysing reliability trends.
As a result you probably don’t have historical data that could be used to optimise age-based replacement intervals for failures that really matter.
Just to be clear, a failure “really matters” if one or more of these conditions applies to it.
- It would be expensive to fix
- It would lead to significant and costly production downtime
- The failure could lead to a safety or environmental incident
Resnikoff Already Knew
None of this is new.
If there were a prize for honesty in reliability analysis, it would have to be awarded to Howard L Resnikoff. His extended paper Mathematical Aspects of Reliability-Centered Maintenance published in 1978 is a sort of companion piece to Stanley Nowlan and Howard Heap’s legendary Reliability-Centered Maintenance.
Resnikoff says this in his introduction before he even begins the main sections.
“One of the most important contributions of the Reliability-Centered Maintenance Program is its explicit recognition that certain types of information heretofore actively sought as a product of maintenance activities are, in principle, as well as in practice, unobtainable.“
After six chapters covering the statistics of survival distributions, hazard rates, inference, Bayes’ theorem and system reliability modelling, you might expect him to conclude by emphasising how important thorough statistical analysis is to RCM decisions. Not H L Resnikoff. Instead, this is what he says about the availability of data in the real world.
“The more effective the [existing maintenance] program is, the fewer critical failures will occur, and correspondingly less information about operational failures will be available to the maintenance policy designer.
“That the optimal policy must be designed in the absence of critical failure information, utilizing only the results of component tests and prior experience with related but different complex systems, is an apparently paradoxical situation.
“Moreover, the applicability of statistical theories of reliability to the?very small populations of large-scale complex systems typically encountered in practice is questionable and calls for some discussion. Each of these distinct viewpoints leads to the conclusion that maintenance policy design is necessarily conducted with extremely limited information of dubious reproducibility, and we must consider why it is nevertheless possible, and how it can be done.”
In other words: “You may think that you have usable information, but you don’t have it and you probably can’t get it.”
Does it matter?
Here’s the good news. Finding the right maintenance task doesn’t often depend on having failure data.
Although that information is sometimes useful, it’s not an essential part of the RCM task selection process. To see why, take a look at the RCM task questions in the order they are asked.
There are two reasons why RCM review groups don’t often answer “yes” to the scheduled restoration and scheduled discard task questions.
- Condition-based maintenance is the first task selection question, before any other maintenance policy. Most failures that could be candidates for scheduled replacement or overhaul are managed by condition-based tasks instead.
2. Very few failure rates increase with time. Nowlan and Heap’s table shows a total of 11% of failure modes with age-related failure rates.
Perhaps you aren’t convinced that RCM task schedules contain so few hard time (restoration and discard) tasks. I'm a natural sceptic; neither was I. So I extracted RCM maintenance tasks from a database of nearly 20,000 high quality equipment failure modes on a mixture of real systems. I’m not going to pretend that the results are generally valid or representative of your own industry, but they are derived from real RCM-based schedules.
Only 2% of the failure modes were managed by a fixed interval scheduled discard task. That’s two out of every hundred failure modes where the RCM group could even have considered using some form of lifetime analysis to determine the optimum replacement time.
I was surprised that 9% of failure modes that were managed by scheduled restoration, so I took a closer look. Was there some scope for age analysis here?
Of all the scheduled restoration tasks, very nearly half were routine calibrations. A substantial proportion of the remainder were simple cleaning and lubrication tasks. Although there is some scope for working out the rate of instrumental drift or the condition of equipment when it was cleaned, I could find only a few tasks that might benefit from statistical analysis, and even then the cost of doing the analysis could easily have been more than any benefit from optimising the maintenance interval.
In this data set I discovered something that Nowlan and Heap knew when they developed RCM: there are only a few failure modes that can be managed by an age-based policy. Of those that could be handled by scheduled discard, most are managed by condition monitoring instead. Why?
RCM deliberately selects a condition-based task in preference to scheduled overhaul or discard because it maximises the life that is achieved from the asset.
Here are four failure events from the history of the dryer bearing with the scheduled 15000-hour replacement task.
A planned replacement is made every 15000 hours, so the total life of the four bearings is 60000 hours.
The picture is very different with the same failures but applying a condition-based task.
The total life achieved is now 144000 hours. Scheduled replacement—and remember, this is at the lowest cost, so-called “optimum” interval—throws away over half the available asset life. You can see what happens to all that life in the first diagram, where only one of the four bearings is anywhere near the start of deterioration before it is replaced.
Conclusions
I’ve never been tempted to write about the problems of generating solar power during the night or getting teenagers out of bed in the morning. What was the point of writing about something that can’t be done most of the time?
There are a few reasons.
1 Don’t get obsessed with fixed-time replacement and overhaul
The RCM task logic puts condition-based maintenance first for a good reason.
Most of the time there are alternative tasks that provide at least the same level of reliability that cost less than fixed-time replacement and which deliver longer average asset life.
2 Spend your time where you get results
Don’t feel guilty about not being able to analyse your maintenance history to get useful life data. Your maintenance system can’t deliver what isn’t there.
If you find a failure that really, genuinely, absolutely needs life analysis, make a plan that will get the information you need: from manufacturers, OEMs, other asset users or—if it’s really worth it—your own age exploration study.
Otherwise don’t fret about the data you will never have. Spend your time on other ways of managing the failure and making improvements that will deliver benefits.
3 Know how to use historical data to develop effective maintenance policies
Wait.
Wasn’t this whole paper about why we can’t use historical data?
Yes, but if you are feeling less guilty about never reviewing your historical maintenance data, don’t relax yet.
I have only covered only half the historical data story: that you probably can’t optimise preventive maintenance for critical failures because you have so little of it. This paper is long enough already.
The second part of the story is far more worthwhile. By knowing what information the RCM process needs, you can use maintenance history to make focussed improvements to your maintenance schedules.
That’s Part 2.
Terms of use and Copyright
Neither the author nor the publisher accepts any responsibility for the application of the information and techniques presented in this document, nor for any errors or omissions. The reader should satisfy herself or himself of the correctness and applicability of the techniques described in this document, and bears full responsibility for the consequences of any application.
Copyright ? 2017 numeratis.com.
Licensed for personal use only under a Creative Commons Attribution-Noncommercial-No Derivatives 3.0 Unported Licence. You may use this work for non-commercial purposes only. You may copy and distribute this work in its entirety provided that it is attributed to the author in the same way as in the original document and includes the original Terms of Use and Copyright statements. You may not create derivative works based on this work. You may not copy or use the images within this work except when copying or distributing the entire work.
Tech Product Strategy & Management @ Amazon Ads | Servant Leader
7 年The only organization that we would hope to have any significant failure records would be the OEM. But even here, there is only a small set of OEMs manufacturing critical and expensive equipment which have any significant failure history. And frankly even here, for a specific failure mode you may find records in the high single to low double digits (Statistically insignificant to say) Even for OEMs, once you find those low double digit cases, how can you truly compare them if the operation, environmental and maintenance conditions and practices suffered significantly for each of the equipment instances??? Unless you have a huge fleet of equipment with very similar and controlled operating & environmental conditions with standardized maintenance practices (very few verticals exist with these characteristics where you also have data available), Condition Based Maintenance is the only way to achieve the most cost effective outcome.
WOW....I wish I wrote it myself!
Head (Director) - Industry Consulting | Business Transformation | Products and Engineering Innovation
7 年Nice article Mark Horton. Good to see your step-by-step approach that I really missed as a reference when I executed Weibull based maintenance optimization work over 10 years ago. If I may guess for Part 2, you will include Probabilistic simulation as a method for augmenting historical life data?
CMRP, CRL, CRE , Reliability Engineering Trainer & Consultant
7 年What Doug said! All you really need to do to have a reliable plant is create excellent lubrication, essential care, condition monitoring and precision maintenance programs. In fifteen years of rolling around the globe I have not found one organization that has anything like data good enough to perform life data analysis.
Summit Reliability
7 年Excellent article. Look forward to part 2.