Intro to Predictive Maintenance on NASA turbofan engine dataset using Machine Learning

An explanation of what Predictive Maintenance is, and a demonstration of how a PdM algorithm may be implemented in the real world.

Predictive maintenance (PdM) is maintenance that monitors the performance and condition of equipment during normal operation to reduce the likelihood of failures

There are generally three different types of maintenance:

  • Reactive maintenance is the process of repairing assets to standard operating conditions after poor performance or breakdown is observed.
  • Preventive maintenance usually occurs on some type of schedule. Preventive maintenance is designed to keep machinery and parts in good condition but does not take the state or process into account.
  • Predictive maintenance occurs as needed, drawing on real-time collection and analysis of machine operation data to identify issues before they can interrupt production. With predictive maintenance, repairs happen during machine operation and address an actual problem. If a shutdown is required, it will be shorter and more targeted.

While the planned downtime in preventive maintenance may cause a decrease in overall capacity and/or availability, it is favoured over the unplanned downtime of reactive maintenance, where costs and duration may be unknown until the problem is diagnosed and addressed. It is also likely to interrupt other scheduling and planning which will cause further downstream time losses.

The aim of this post is to demystify some technical aspects of predictive maintenance through a Python solution to a real-world problem: turbofan engine degradation.

Problem Statement

Our task is to determine whether a Machine Learning model could be used to perform Predictive Maintenance on turbofan engines. For the purposes of this tutorial, we will assume that the following information has been ascertained through consultation with the company operating the turbofans:

  • The maintenance schedule of the turbofans is flexible. There would be no use carrying out this analysis if the schedule cannot be changed.
  • The analysis would generate long-term value for the operating company.

Given that the above points are true, the problem now lies in the analysis. We will use sensor data to predict the Remaining Useful Life (RUL) of turbofan engines. This RUL prediction can then be used to facilitate predictive maintenance.

Dataset Description

The data used in this notebook is based off a subset of the popular NASA Turbofan Engine Degradation Simulation Data Set. It contains data for 100 different turbofans.

Engine degradation simulation was carried out using C-MAPSS. Four different sets were simulated (using set 3 here) under different combinations of operational conditions and fault modes. Records several sensor channels to characterize fault evolution. The data set was provided by the Prognostics CoE at NASA Ames.


The first step is to import the libraries:

import pandas as pd
import numpy as np
import scipy as sp
import scipy.signal as ss
import matplotlib.pyplot as plt
import seaborn as sns
import optuna
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import xgboost
import catboost
plt.rcParams['figure.figsize'] = 20, 20

Next, we import the data into a Pandas data frame. The data are in the form of text files, which can be obtained from the above link. There is a data frame for the training set, the test set and the validation testing data:

index_names = ['unit_number', 'time_cycles']
setting_names = ['setting_1', 'setting_2', 'setting_3']
sensor_names = ['s_{}'.format(i+1) for i in range(0,21)]
col_names = index_names + setting_names + sensor_names
directory = r'C:\PATH\TO\YOUR\DATA\FOLDER'
train_df = pd.read_csv(directory+r'\train_FD003.txt', 
train = train_df.copy()
test_df = pd.read_csv(directory+r'\test_FD003.txt', 
test = test_df.copy()
y_test = pd.read_csv(directory+r'\RUL_FD003.txt', 

At this stage, we note that we will be using only a training and a test set. In practice, a validation set should also be used to ensure the model works well on multiple test sets.

Now we can perform some basic diagnostic analysis of the data to confirm that it has been imported correctly, and to get an idea of what the entire data set looks like through descriptive statistics:

(24720, 26)
Table of train head
Table showing the trained data for location

We can see that there is a row for each time cycle for each unit, and that there is a column for each sensor reading.

We now define a function that adds a column “RUL” to a given data frame based on a unit’s maximum cycle number and the time cycle it is in currently. We use this function to modify the training set:

def add_remaining_useful_life(df):
    grouped_by_unit = df.groupby(by='unit_number') 
    max_cycle = grouped_by_unit['time_cycles'].max() 
    result_frame = df.merge(max_cycle.to_frame(name='max_cycle'), 
    # Calculate remaining useful life for each row 
    remaining_useful_life = result_frame["max_cycle"] - result_frame['time_cycles']
    result_frame["RUL"] = remaining_useful_life 
    # drop max_cycle as it's no longer needed 
    result_frame = result_frame.drop("max_cycle", axis=1) 
    return result_frame
train = add_remaining_useful_life(train)
Table with trained data after RUL

We then find the maximum RUL for each unit and visualise their frequencies using a histogram:

max_ruls = train.groupby('unit_number').max().reset_index()
Table with data after max RUL
Histogram of max RUL results

The distribution of RULs looks log-normal with most of the max RUL data in the 150-250 range. One insight we can gather from this is that, if there were many more simulations, we could be confident that the RUL would almost never be less than 150 cycles. This information could be used to clip predictions at a particular maximum which may make the algorithm more reliable/accurate in production.

Visualising sensor signals

Now we investigate what the sensor signals look like, which will help determine “good” and “bad” sensors (i.e. sensors that contain a lot of information vs ones that don’t). We use a function to plot the sensor signals for every 10th turbofan. Plots for sensors 1, 2, 6 and 7 are displayed below as examples:

Brief sensor analysis

Looking at all visualisations, it seems as though sensors 1 (included above), 5, 16, 18 and 19 have very little to no information to help predict the RUL. These will be removed before prediction below to help the speed and generalisation of the algorithm. Further testing may reveal that sensors with patterns like sensor 6 (above) should also be removed from the features.

Next, we prepare the training and test sets:

drop_sensors = ['s_1','s_5','s_16','s_18','s_19']
drop_labels = index_names+setting_names+drop_sensors
remaining_sensors = ['s_2', 's_3', 's_4', 's_6', 's_7', 's_8', 's_9', 's_10',
        's_11', 's_12', 's_13', 's_14', 's_15', 's_17', 's_20', 's_21']
X_train = train.drop(drop_labels, axis=1)
y_train = X_train.pop('RUL')
X_test = test.groupby('unit_number').last().reset_index().drop(drop_labels, axis=1)

X_train and X_test contain only the sensor readings, whilst y_train and y_test contain the RUL values.

Now, we define a function that evaluates the total error in our predictions. We have chosen the root mean square error (RMSE) as our error metric, although there may be other metrics that would work better:

def evaluate(y_true, y_hat, label='test'):
    mse = mean_squared_error(y_true, y_hat)
    rmse = np.sqrt(mse)
    variance = r2_score(y_true, y_hat)
    print('{} set RMSE:{}, R2:{}'.format(label, rmse, variance))

Training the algorithm

This section of the tutorial will cover the building and implementation of a Random Forest Regression model. It is good to test multiple different types of models in practice – as such, XGBoost and CatBoost models were also built and tested, but that will not be covered here.

We note here that, in most model types (e.g. Linear Regression, SVM), feature scaling is required to normalise the data for training. All of the models built as part of this process are forms of Decision Tree regressors, which do not require feature scaling.

Building and training the model is surprisingly straightforward thanks to Scikit-learn’s powerful Random Forest module:

rf = RandomForestRegressor(max_features="sqrt", random_state=42), y_train)

# predict and evaluate
y_hat_train = rf.predict(X_train)
evaluate(y_train, y_hat_train, 'train')

y_hat_test = rf.predict(X_test)
evaluate(y_test, y_hat_test)
train set RMSE:21.025240351169202, R2:0.9547545019533699 
test set RMSE:46.358798830427006, R2:-0.254167389706895

We will now interrogate our results through a visualisation to confirm whether they are reasonable. It is always important to visualise the predictions against the actual values, as a single metric does not always tell the whole story. For example, most predictions may be good but there could be a large outlier which would render the model unacceptable to put into production.

Our algorithm appears to overestimate in its predictions. We will try to remedy this by clipping the RUL as mentioned previously. Applying RUL clipping makes the maximum RUL value 115, which helps the predictions as we will see below. Intuitively, this makes sense: the sensor values with a 200 RUL are similar to those with a 115 RUL, so the algorithm will not be able to distinguish between these well. Also, the maximum RUL in the test set is 115, which is around the point where the sensor readings start to change significantly (referring to the line graphs above).

drop_sensors = ['s_1','s_5','s_16','s_18','s_19']
drop_labels = index_names+setting_names+drop_sensors
remaining_sensors = ['s_2', 's_3', 's_4', 's_6', 's_7', 's_8', 's_9', 's_10',
        's_11', 's_12', 's_13', 's_14', 's_15', 's_17', 's_20', 's_21']
X_train = train.drop(drop_labels, axis=1)
y_train = X_train.pop('RUL')
y_train_clipped = y_train.clip(upper=115)  # apply RUL clipping
X_test = test.groupby('unit_number').last().reset_index().drop(drop_labels, axis=1)

Now, we retrain the model:

rf = RandomForestRegressor(max_features="sqrt", random_state=42), y_train_clipped)

# predict and evaluate
y_hat_train = rf.predict(X_train)
evaluate(y_train_clipped, y_hat_train, 'train')

y_hat_test = rf.predict(X_test)
evaluate(y_test, y_hat_test)
train set RMSE:5.158896101752112, R2:0.9802797863716584 
test set RMSE:20.225672522811198, R2:0.761275442379237

After clipping the RUL, we see a decrease in the RMSE by more than half, indicating that our algorithm now makes better predictions. This is visualised below:

Histogram of results with RUL clipping

Another step we can take to improve the predictive power of our algorithm is signal smoothing. We can see from the sensor reading plots that the signals are “noisy” – i.e. lots of short-term variance. To fix this, we smooth the signals using Scipy’s Savitsky-Golay filter:

def apply_scipy_filter(df, scipy_filter):
    for unit in df['unit_number'].unique():
        for sensor in df.loc[:,'s_1':]:
            if sensor != 'RUL': 
                df.loc[df['unit_number']==unit,sensor] = scipy_filter(df.loc[df['unit_number']==unit,sensor], 
                window_length=19, polyorder=1, deriv=0, mode='interp')  
    return df

We fit the filter three times, then visualise the results:

train = apply_scipy_filter(train, ss.savgol_filter)
train = apply_scipy_filter(train, ss.savgol_filter)
train = apply_scipy_filter(train, ss.savgol_filter)

test = apply_scipy_filter(test, ss.savgol_filter)
test = apply_scipy_filter(test, ss.savgol_filter)
test = apply_scipy_filter(test, ss.savgol_filter)
Line chart with S1 smoothed
Line chart with S2 Smoothed
Line chart with S6 smoothed
Line chart with S7 smoothed

The plots are now much smoother.

The model may now be fit to the transformed data:

drop_sensors = ['s_1','s_5','s_16','s_18','s_19'] 
drop_labels = index_names+setting_names+drop_sensors
remaining_sensors = ['s_2', 's_3', 's_4', 's_6', 's_7', 's_8', 's_9', 's_10',
        's_11', 's_12', 's_13', 's_14', 's_15', 's_17', 's_20', 's_21']

X_train = train.drop(drop_labels, axis=1)
y_train = X_train.pop('RUL')
y_train_clipped = y_train.clip(upper=125)  # apply RUL clipping

# Since the true RUL values for the test set are only provided for the last time cycle of 
# each engine, the test set is subsetted to represent the same
X_test = test.groupby('unit_number').last().reset_index().drop(drop_labels, axis=1)
train set RMSE:1.65129502601594, R2:0.998347906044556 
test set RMSE:18.805190001699, R2:0.7936299799906349

The RMSE on the training set has been reduced substantially again, but it did not decrease by much on the test set. This suggests that perhaps our model is overfitted, which means the model is very accurate on the training set but inaccurate on the test set.

An additional step we took (not covered in this tutorial) was to optimise the model, given our training and test data, using Bayesian Optimisation. We then fed the parameters obtained through this optimisation into the Random Forest model:

rf = RandomForestRegressor(n_estimators = 32,
                            max_depth = 22,
                            min_samples_split = 6,
                            max_features = 1,
                            min_samples_leaf = 8,
                            random_state = 42), y_train_clipped)

y_hat_train = rf.predict(X_train)
evaluate(y_train_clipped, y_hat_train, 'train')

y_hat_test = rf.predict(X_test)
evaluate(y_test, y_hat_test)
train set RMSE:8.092176262356801, R2:0.9603250795515506 
test set RMSE:19.824936715182147, R2:0.7706415346514391

This accuracy looks worse than before, but it is more general (not much worse performance on the test set than on the training set). It is important to remember that, with these parameters, the algorithm is performant on multiple different training sets. If the Bayesian Optimisation algorithm had been allowed to run for longer, or if more arguments were allowed to be varied, even better accuracy could be achieved as well.

Potential (further) improvements

Many more improvements and adjustments are possible in models like this. Some of them are:

  • Adding features. It would be possible to generate new features from already-existing features, and perhaps find extra data to add to the data set. Some examples are:
    • Lagged features, i.e. columns for time t-1, t-2 etc.
    • Derivatives of features, i.e. average rate of change over a lagged period
    • Geographic data e.g. location, humidity, ambient temp
  • Detailed feature analysis. In practice, it is good to have an expert analyse the data set and determine which features to use in the model. It is also possible to employ feature selection methods to avoid problems such as the algorithm fitting to noisy data. Further statistical analysis may also be used to find features particularly prone to under/overfitting.
  • Voting/stacking regressors. Some implementations have multiple regression models working in tandem (known as ensemble learning), which reduces variance in predictions and may increase accuracy.
  • Using a CNN to capture more steps. CNN is used primarily in Computer Vision problems, but they are also useful for capturing data from multiple time steps simultaneously.
  • Use a different loss function. In problems like this, overestimating the RUL is potentially far more dangerous and costly than underestimating it. So, it may be prudent to choose a loss function that penalises overestimates more than it does for underestimates.

That’s the end of the tutorial – well done for reading through to the end! If you think that you or your business could benefit from a predictive Machine Learning implementation like this, please contact the 4CDA team. We would love to help!


A special thank you goes to 4CDA alumni James White for creating the initial code base used in the tutorial section.

Please find a the link to the full codebase for this example.

More from the blog