It's About Time! 5 Points About Reliability (& Failure)
Alejandro Erives
Creating the future in maintenance, reliability, and your organization.
We've all heard the phrase "Time is money". In the reliability world, understanding how time affects failure or reliability can be a competitive advantage. The following list is an example of how the ever forward march of time can affect reliability and in turn profitability.
- Reliability, by definition (in some definitions) is literally a function of TIME! In their report, Reliability-centered Maintenance, Nowlan & Heap defined failure as "... the probability that an item will survive to a specified operating age..." (i.e. to a certain time). In some cases, the probability of failure increases as the equipment ages. In other cases, probability of failure decreases. Weibull analysis can identify which failure patterns apply to your equipment. Most failure patterns look like straight lines on a Weibull plot (see figure below). The second figure shows the conditional probability of failure for various Weibull shape parameters (i.e. these are the more familiar failure patterns).
- Benjamin Franklin once said "nothing can be said to be certain, except death and taxes". If he worked in Reliability, he may have added "failures" to this list. Failure is inevitable and given enough TIME failures will happen. Trending how quickly failures accumulate provides a lot of valuable insight. From a Crow-AMSAA plot, we can determine if we are improving our overall reliability or if it is deteriorating. We do this by observing the slope (beta) on the plot. In addition to how reliability is trending over time, the Crow-AMSAA plot may hint at major step changes in our reliability (perhaps an improvement was made to a pump). It can also hint at the nature of the failure modes involved (batch problems and infant mortality failure patterns tend to have a unique appearance).
- When failures seem to be predictable and repeatable, we tend to think we can avoid failure by prescribing a TIME-based maintenance action. The plot below shows two curves. The first, narrower curve, shows a typical range in which a PM may be completed for a filter replacement on a rotary lobe blower (in this case a time-based PM frequency of 6 months). Depending on your maintenance program, this PM may be completed a little bit earlier or later than the nominal 6 months (this leads to a distribution around 6 months). The second curve shows the time in service it may take for the filter to plug with debris and collapse (on average ~9 months for this curve). Unfortunately, due to the variation in execution time of the PM in this example (considerable) and variation in how quickly the filter collected dust/debris (also considerable), the PM curve and the failure curve have substantial overlap. This overlap can lead to a plugged or collapsed filter prior to PM change out, and dust/debris in the blower (and ultimately into the process). Being able to plot the normal variation in time of both maintenance activities and of failure mechanisms can provide real insights that we can use to improve reliability. As maintenance & reliability professionals, we can use this information to guide an updated PM frequency or perhaps a change to a condition monitoring (e.g. monitoring pressure drop) approach.
- Failures are not typically isolated events, but rather a sequence of cause & effect events that take place over TIME. An example of this can be a gear drive failure. The gear drive's reliability can be described as a function of time, but perhaps more relevant to our reliability and maintenance roles may be to understand how time affects the events that led up to the failure. The example below shows how a mineral oil may age and oxidize in a gear drive (in this case over a period of 1-5 years). Understanding the time frame involved can influence decisions related to lubricant analysis frequency, oil change frequency, system design and possibly even lubricant selection.
- Failures rarely happen without warning. Although, we may wish we have more TIME between a potential failure and a true functional failure. The time from an observed potential failure to a functional failure is referred to as the P-F Interval. In the case of the gear drive above, the observation of oxidized oil (by lubricant testing) can be considered a potential failure. The seizure of the gear box or dangerous vibration levels could be considered the functional failure (functional requirement defined by user). With proper analysis and sufficient data, P-F intervals can be estimated if you know how. The plot below shows the P-F interval estimates for several potential failures observable with a typical lubricant analysis report (e.g. contamination, oxidation or viscosity issues, and excess wear metals). As noted in the previous section, failures do not happen as singular events, and each of these potential failures may lead to eventual functional failure (water contamination may lead to oil oxidation and viscosity changes, which in turn may lead to mechanical wear / metal debris, which could lead to high vibration and ultimately functional failure).
No company or organization has the TIME or money (resources) to completely eliminate failure. Even on our best days, with a great design, precision installation, capable predictive maintenance, and great planning / scheduling / maintenance response times, the probability of failure at any given moment is always greater than 0. Understanding how failure or reliability relates to TIME is therefore a prerequisite to optimal management of our organizations. Contact Alejandro Erives at Blackstart Reliability LLC to get started applying these to your business.
Organizational Effectiveness Resource & Certified RCFA Principal Investigator - Retired yet seeking opportunities to teach & mentor investigators
5 年Nice article and easy to follow and understand. ? I agree with the information but the filter and plugging is not a clear relationship issue. ? ?The equipment plugging is the filter so they are inclusive events. ? I think a better example is redundant pair or trio of pumps with different failure rates that eventually you lose the redundancy because multiple pumps fail at the same time. ? ie One with 6 month the other with 8 months. ?
How to Get Your Boss's Boss to Understand by Communicating with FINESSE | Solutions for people, facilities, infrastructure, and the environment.
5 年Nice, straightforward article.? Time is indeed the important dimension.? Maybe your next article can be on why it is missing from the definition of risk.? Thanks for posting this one.
Maintenance & Reliability Professional | Latina Entrepreneur | Passionate Advocate for Diversity & Inclusion
5 年Great article Alejandro! My concept of time relating to a failure however, isn't always linear. Failure often times, starts at conception. The conception of design to fulfill the objective it means to serve. We don't seem to give sufficient time and consideration to defining the parameters of what we intend our equipment/system to do to meet our end goal.? Terrance's comment brings me back to our conversations of what is a Maintenance Engineer VS. Reliability Engineer? They are not one in the same but they do influence each other (in my opinion). Hope all is well and good luck to you in your new endeavor!