Predictive Maintenance part 1 - a supervised approach
Abstract
Improving equipment effectiveness is a key factor in the industry. In particular we can consider the uptime and avoiding catastrophic failures: we know, every piece of machinery wears out and occasionally fails, this is why we have to constantly take care of it, checking and maintaining.
Long story short, we can identify four different approaches to maintenance:
- reactive: we simply run it (the equipment) till it fails, then we react repairing or replacing the broken component;
- preventive (scheduled): we schedule a periodic program of maintenance to check and prevent failures;
- condition based: we act whenever we have a clue of an anomaly, based on a particular signal trespassing a threshold or a known behavior;
- predictive: we try to analytically describe the wear status of the equipment and predict if it’s failing and possibly when.
It is clear that the latter brings substantial advantages with respect to the previous ones: surely we can plan maintenance more efficiently avoiding extra and un-necessary actions and, crucial, we have an additional safety measure against catastrophic events with potentially costly consequences (e.g. an expensive piece of machinery or an airplane engine during flight).
Here (and in the following issues) we want to focus on the predictive maintenance using a Machine Learning approach in both supervised and unsupervised approaches.
Predictive Maintenance: some insights
Given the variety of possible cases and applications it's difficult to end up with a single generic solution. First: what do we want to predict? We want to know if our asset is breaking down and possibly how many hours/cycles left there are. This means that in order to train our predictor we should collect this information. But here we collide with our first obstacle: can we really collect it? I mean, imagine we are monitoring our asset and collecting all the sensor data: there is no intrinsic way to directly determine the remaining useful life (and given the case we could, well, we would not need any predictor at all). One way could be running the asset till failure and then counting the time backwards, but this clearly isn't a feasible way for the majority of applications. And even if a failure occurs it would be difficult to track it given the fact that we wouldn't be able to measure the impact of the eventual maintenance over the remaining life. Shortly: it's hard if not impossible to consistently measure the asset's wear status. This means that if not in possess of an output parameter to predict we have to find another way. Going back to our Machine Learning framework we can identify two approaches:
- Supervised learning: we have both the input data and the output data to predict. In this case all the settings and sensor data as input and the remaining useful life (or any wear metric) as output.
- Unsupervised learning: we have to rely on the input data only, so we have to find a way around to order the samples in a meaningful way and coherently identify anomalous observations.
We will explore both approaches, so we start with the supervised one in this paper, moving on with some unsupervised algorithms in the next one and after with some unlabeled data augmentation and a try of a novel technique of transfer learning.
Before going on we shall focus on the data structure. Analyzing an asset we could encounter different situations: binary data (e.g. a valve opening condition), categorical data (e.g. a machine status), continuos data that could be transient, stationary, cycling with different frequencies and so on. To be said that different scenarios should be properly treated. Here we are facing the stationary case.
Case Definition
In this paper we use sensor time series from a fleet of turbofan engines of the same type that start with different wear condition and unknown manufacturing variation. The dataset is provided by NASA (courtesy of: A. Saxena, K. Goebel, D. Simon, and N. Eklund, “Damage Propagation Modeling for Aircraft Engine Run-to-Failure Simulation”, in the Proceedings of the 1st International Conference on Prognostics and Health Management (PHM08), Denver CO, Oct 2008.), here the source and here the reference paper.
We report the experimental scenario as attached to the dataset:
Data sets consists of multiple multivariate time series. Each data set is further divided into training and test subsets. Each time series is from a different engine: i.e., the data can be considered to be from a fleet of engines of the same type. Each engine starts with different degrees of initial wear and manufacturing variation which is unknown to the user. This wear and variation is considered normal, i.e., it is not considered a fault condition. There are three operational settings that have a substantial effect on engine performance. These settings are also included in the data. The data is contaminated with sensor noise.
The engine is operating normally at the start of each time series, and develops a fault at some point during the series. In the training set, the fault grows in magnitude until system failure. In the test set, the time series ends some time prior to system failure. The objective of the competition is to predict the number of remaining operational cycles before failure in the test set, i.e., the number of operational cycles after the last cycle that the engine will continue to operate. Also provided a vector of true Remaining Useful Life (RUL) values for the test data.
We additionally report that the data is provided into four sets:
- one operating condition (sea level) and one fault (HPC Degradation);
- six conditions and one fault (HPC Degradation);
- one operating condition (sea level) and two faults (HPC Degradation, Fan Degradation);
- six conditions and one two faults (HPC Degradation, Fan Degradation);
In this issue we concentrate on the first case.
Giving a good view of the data structure could be difficult due to the high dimensionality (21 sensors plus 3 settings), but we can still perform a dimensionality reduction via PCA (Principal Component Analysis, of course after having normalized the whole data set) in order to plot it in two dimensions.
Shortly, with this operation we are sectioning the 24-dimensional data cloud along the two direction with maximum variance and then plotting the point projections on that 2-dimensional plane. Thus we have subdivided the points into three risk categories:
- High risk when there are less than 50 life cycles left;
- Medium risk when we have between 100 and 50 cycles left;
- Low risk when we have more than 100 cycles;
and colored them to better show how engines operating in good conditions differ from the ones operating in bad condition. We have also tracked a single engine trajectory showing its transition as it wears down till failure.
Just with this figure we can have an intuition. First we clearly see how the Low risk zone is separated (maybe with some overlapping, remember that this still is a 2-dimensional projection) from the High risk one. We can also see how the trajectory behaves, persisting in the compact Low risk zone in the first part of its life and then rapidly moving away towards the outskirts of the High risk zone.
Thus we could sketch a quick (-and-not-so-dirty) classifier, just approximatively splitting the Low risk region from the Medium risk and so on, and then predicting the engine wear just looking where it places in the space. Furthermore we could use more advance clustering techniques, but this goes beyond the purposes of this paper.
Data selection
Basically all the data has already been selected by the original authors in order to be meaningful. Anyway, being in the first dataset our engines are working in one stationary condition -we can imagine them still on the test bench- so all the settings and some sensors are constant. This means that we will not use all of them. In particular we have discarded all the settings and kept just 13 out of the 21 sensors.
For the Shallow Neural Network we work on a single cycle basis, this means that a sample consists on the whole sensors' information for a given life cycle formatted in a single vector and its correspondent output is the RUL of that cycle.
For the Recurrent Neural Network as we will see we reason on a sequence basis, meaning that we will pick many consecutive cycles (25 in this case) and shape them in a matrix form (one direction is the sensor id, the other is the time). The output is the RUL of the final cycle.
Shallow Neural Network
This approach is quite simple: for each observation (a single sample consisting of all the data picked up at a certain timestamp/cycle) we want to directly predict the RUL for that observation. For this purpose we build a simple Neural Network to feed with our observation.
For the architecture we used the rule of thumb without a proper “hyper parameter search” (assuming that we include the network choice as an hyper parameter), so it was a Neural Network with 3 hidden layers (with respectively 10, 5 and 3 neurons) trained with a Bayesian Regularization algorithm.
Of course here the purpose of the Neural Network is to capture highly non-linear patterns in the input space and then infer the corresponding RUL. Without wasting more time, in order to better visualize this fact we refer to this nice and straightforward toy from Tensorflow.
The training “regularization” was a simple early stopping performed looking at the train and validation performance parameters.
Recurrent Neural Network
The Neural Network certainly is a powerful tool to detect non-linearities, but we can go a little further. Knowing the time-series nature of the data we would enrich the input information, not just with the single instant observation but with a sample containing the pattern of subsequent observations. Why this? Well, as we have seen the “wear trajectory” is not smooth, so predicting the RUL from a single point could be tricky if not a little misleading. Conversely if we feed our algorithm with the actual point and the previous ones we should reach better results. In particular we choose a network architecture specialized for time-series: an RNN (Recurrent Neural Network) with LSTM units. A discussion should be done here on the unit type choice: the LSTM units was chosen due to Matlab restrictions (yes, I started this project with Matlab, of course using Python could have given me a little more flexibility): actually for RNNs only LSTM units are available. But thinking about the data structure we can imagine how the engine decay is a continuos process, thus we probably do not need cells capable of memorizing events along significative long timespans (the feature that characterize LSTM cells), so maybe simpler units could do the job, e.g. GRU units.
Anyway, the RNN stack is: Input layer, LSTM layer (return sequences), Dropout layer, LSTM layer (return last), Dropout layer, Fully Connected layer, Regression layer.
Training was performed in the “classic” way, running and manually tuning the learning rate (gradually reducing) and the batch size till a convergence of the RMSE was reached, i.e. until the training error stabilizes, so we can infer that we’ve reached the maximum accuracy and further training would bring no more improvement. Of course we know that this is not the true story, that we could do some architecture search and apply some novel techniques on cycling the learning rate and so on, but we also want to keep it simple, so for the moment we will say that we are satisfied with this.
Training, validation and test
After we have reached a relatively good outcome from the networks training we can try to visualize how they will perform with new samples from the test set.
An immediate way is plotting for each sample the predicted RUL against the true one.
Two facts:
- in the early stage there’s a lot of confusion and the predictors seem both pretty unaccurate overestimating the engine wear. But as we move towards the end-of-life the predictions begin to lie around the 100% accuracy curve (the black diagonal line), and fortunately this is the stage we are most interested in.
- it is clear that the LSTM network performs way better than the simple Neural Network
Finally we plot the distribution of the prediction error, checking it over different percentiles of the engines population, and obtaining a “distribution cone”.
Now, this was if we want to directly predict the RUL. Imagine that instead we are not interested on this in-depth knowledge, but we simply want to have an alarm firing whenever the engine is operating in near failing conditions. For this purpose we can recover the three risk categories from the introduction and then move to a binary problem, i.e. if the engine has a trespassed the Warning threshold (Medium risk zone) or not and if has trespassed the Alarm threshold (High risk zone) or not.
From this new perspective we can investigate further plotting the ROC (Receiver Operating Characteristic) curve, for both the Warning and Alarm threshold trespassing binary problems. This curve illustrates how the true positive rate changes over the false positive rate as we move the threshold that we use to discriminate whether the engine is, for example, in Alarm or not. If the curve lies on the diagonal this means that we are unable to predict anything, as it moves towards the up-left corner we have more prediction power, till we reach the perfect accuracy when it overlaps with the borders of the graph.
Just a mention. At this point I personally had an intuition/doubt: looking at the “distribution cone” plot we can see how the predictions are drifting away from the zero-error line, so why not trying to re-parameterize the outputs to bend the median over the zero (using a monotone transformation function)?
Well, if we look at the ROC we would have no difference at all (just recall the definitions of true positive rate and false positive rate and what would happen if we apply a monotone transformation on top of those, how the thresholds values change and so on). Looking at cone instead we would end up with more dispersed predictions and a wider distribution (just imagine horizontally sections of the regression plot and the reassignment of the predictions moving from left to right along the section). So it was a try, but wasn't worth.
To conclude we find a valid threshold value for both the binary problems and then we investigate when the alarms would fire over time and how accurate they are among the whole engine population.
Here we see how many alarms (one for each of the 100 engines) have been fired over time in percentage. Of course changing the threshold would translate the curves left or right; anyway we want a good compromise between false positives and false negatives (or true positives staying consistent with the ROC).
Conclusions
We started with the most complete information and we got a pretty accurate model. Not only we can have a good estimate of how many life cycles an engine could support -especially in the most interesting end-of-life phase-, but also have set up an alarm to give us a good safety margin before the engine fails. Just look at the last picture: we would have been warned for every single engine before any eventual catastrophic event. And while we could have some false Warning we will have nearly little if no false Alarm.
Just the simple Neural Network behaves well, especially given the stationary nature of the data. Anyway the RNN performs much better: as we have seen its input information is richer and more complete. But we have another pro: due to the nature of the network architecture we could expect good results not only for stationary data but also for other kinds of behaviors.
But to remember what we have said at the beginning: for these stunning results we had to rely on a precious but rare piece of information -the RUL-, so we are still unarmed in front of the majority of unsupervised cases. So, in the next Aissue we will try to deal with it.
Founder and CTO at Intelecy
6 年Good writeup Leonardo. Now the really tricky part begins, improving accuracy and speed. GRU will give you approx 25% speedup and same accuracy. Looking forward to part 2.