Intro to Predictive Maintenance on NASA turbofan engine dataset using Machine Learning

An explanation of what Predictive Maintenance is, and a demonstration of how a PdM algorithm may be implemented in the real world.

An explanation of what Predictive Maintenance is, and a demonstration of how a PdM algorithm may be implemented in the real world.

Predictive maintenance (PdM) is maintenance that monitors the performance and condition of equipment during normal operation to reduce the likelihood of failures

There are generally three different types of maintenance:

  • Reactive maintenance is the process of repairing assets to standard operating conditions after poor performance or breakdown is observed.
  • Preventive maintenance usually occurs on some type of schedule. Preventive maintenance is designed to keep machinery and parts in good condition but does not take the state or process into account.
  • Predictive maintenance occurs as needed, drawing on real-time collection and analysis of machine operation data to identify issues before they can interrupt production. With predictive maintenance, repairs happen during machine operation and address an actual problem. If a shutdown is required, it will be shorter and more targeted.

While the planned downtime in preventive maintenance may cause a decrease in overall capacity and/or availability, it is favoured over the unplanned downtime of reactive maintenance, where costs and duration may be unknown until the problem is diagnosed and addressed. It is also likely to interrupt other scheduling and planning which will cause further downstream time losses.

The aim of this post is to demystify some technical aspects of predictive maintenance through a Python solution to a real-world problem: turbofan engine degradation.

Problem Statement

Our task is to determine whether a Machine Learning model could be used to perform Predictive Maintenance on turbofan engines. For the purposes of this tutorial, we will assume that the following information has been ascertained through consultation with the company operating the turbofans:

  • The maintenance schedule of the turbofans is flexible. There would be no use carrying out this analysis if the schedule cannot be changed.
  • The analysis would generate long-term value for the operating company.

Given that the above points are true, the problem now lies in the analysis. We will use sensor data to predict the Remaining Useful Life (RUL) of turbofan engines. This RUL prediction can then be used to facilitate predictive maintenance.

Dataset Description

The data used in this notebook is based off a subset of the popular NASA Turbofan Engine Degradation Simulation Data Set. It contains data for 100 different turbofans.

Engine degradation simulation was carried out using C-MAPSS. Four different sets were simulated (using set 3 here) under different combinations of operational conditions and fault modes. Records several sensor channels to characterize fault evolution. The data set was provided by the Prognostics CoE at NASA Ames.


The first step is to import the libraries:

import pandas as pd
import numpy as np
import scipy as sp
import scipy.signal as ss
import matplotlib.pyplot as plt
import seaborn as sns
import optuna
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import xgboost
import catboost
plt.rcParams['figure.figsize'] = 20, 20

Next, we import the data into a Pandas data frame. The data are in the form of text files, which can be obtained from the above link. There is a data frame for the training set, the test set and the validation testing data:

index_names = ['unit_number', 'time_cycles']
setting_names = ['setting_1', 'setting_2', 'setting_3']
sensor_names = ['s_{}'.format(i+1) for i in range(0,21)]
col_names = index_names + setting_names + sensor_names
directory = r'C:\PATH\TO\YOUR\DATA\FOLDER'
train_df = pd.read_csv(directory+r'\train_FD003.txt', 
train = train_df.copy()
test_df = pd.read_csv(directory+r'\test_FD003.txt', 
test = test_df.copy()
y_test = pd.read_csv(directory+r'\RUL_FD003.txt', 

At this stage, we note that we will be using only a training and a test set. In practice, a validation set should also be used to ensure the model works well on multiple test sets.

Now we can perform some basic diagnostic analysis of the data to confirm that it has been imported correctly, and to get an idea of what the entire data set looks like through descriptive statistics:

(24720, 26)
Table of train head
Table showing the trained data for location

We can see that there is a row for each time cycle for each unit, and that there is a column for each sensor reading.

We now define a function that adds a column “RUL” to a given data frame based on a unit’s maximum cycle number and the time cycle it is in currently. We use this function to modify the training set:

def add_remaining_useful_life(df):
    grouped_by_unit = df.groupby(by='unit_number') 
    max_cycle = grouped_by_unit['time_cycles'].max() 
    result_frame = df.merge(max_cycle.to_frame(name='max_cycle'), 
    # Calculate remaining useful life for each row 
    remaining_useful_life = result_frame["max_cycle"] - result_frame['time_cycles']
    result_frame["RUL"] = remaining_useful_life 
    # drop max_cycle as it's no longer needed 
    result_frame = result_frame.drop("max_cycle", axis=1) 
    return result_frame
train = add_remaining_useful_life(train)
Table with trained data after RUL

We then find the maximum RUL for each unit and visualise their frequencies using a histogram:

max_ruls = train.groupby('unit_number').max().reset_index()
Table with data after max RUL
Histogram of max RUL results

The distribution of RULs looks log-normal with most of the max RUL data in the 150-250 range. One insight we can gather from this is that, if there were many more simulations, we could be confident that the RUL would almost never be less than 150 cycles. This information could be used to clip predictions at a particular maximum which may make the algorithm more reliable/accurate in production.

Visualising sensor signals

Now we investigate what the sensor signals look like, which will help determine “good” and “bad” sensors (i.e. sensors that contain a lot of information vs ones that don’t). We use a function to plot the sensor signals for every 10th turbofan. Plots for sensors 1, 2, 6 and 7 are displayed below as examples:

Brief sensor analysis

Looking at all visualisations, it seems as though sensors 1 (included above), 5, 16, 18 and 19 have very little to no information to help predict the RUL. These will be removed before prediction below to help the speed and generalisation of the algorithm. Further testing may reveal that sensors with patterns like sensor 6 (above) should also be removed from the features.

Next, we prepare the training and test sets:

drop_sensors = ['s_1','s_5','s_16','s_18','s_19']
drop_labels = index_names+setting_names+drop_sensors
remaining_sensors = ['s_2', 's_3', 's_4', 's_6', 's_7', 's_8', 's_9', 's_10',
        's_11', 's_12', 's_13', 's_14', 's_15', 's_17', 's_20', 's_21']
X_train = train.drop(drop_labels, axis=1)
y_train = X_train.pop('RUL')
X_test = test.groupby('unit_number').last().reset_index().drop(drop_labels, axis=1)

X_train and X_test contain only the sensor readings, whilst y_train and y_test contain the RUL values.

Now, we define a function that evaluates the total error in our predictions. We have chosen the root mean square error (RMSE) as our error metric, although there may be other metrics that would work better:

def evaluate(y_true, y_hat, label='test'):
    mse = mean_squared_error(y_true, y_hat)
    rmse = np.sqrt(mse)
    variance = r2_score(y_true, y_hat)
    print('{} set RMSE:{}, R2:{}'.format(label, rmse, variance))

Training the algorithm

This section of the tutorial will cover the building and implementation of a Random Forest Regression model. It is good to test multiple different types of models in practice – as such, XGBoost and CatBoost models were also built and tested, but that will not be covered here.

We note here that, in most model types (e.g. Linear Regression, SVM), feature scaling is required to normalise the data for training. All of the models built as part of this process are forms of Decision Tree regressors, which do not require feature scaling.

Building and training the model is surprisingly straightforward thanks to Scikit-learn’s powerful Random Forest module:

rf = RandomForestRegressor(max_features="sqrt", random_state=42), y_train)

# predict and evaluate
y_hat_train = rf.predict(X_train)
evaluate(y_train, y_hat_train, 'train')

y_hat_test = rf.predict(X_test)
evaluate(y_test, y_hat_test)
train set RMSE:21.025240351169202, R2:0.9547545019533699 
test set RMSE:46.358798830427006, R2:-0.254167389706895

We will now interrogate our results through a visualisation to confirm whether they are reasonable. It is always important to visualise the predictions against the actual values, as a single metric does not always tell the whole story. For example, most predictions may be good but there could be a large outlier which would render the model unacceptable to put into production.

Our algorithm appears to overestimate in its predictions. We will try to remedy this by clipping the RUL as mentioned previously. Applying RUL clipping makes the maximum RUL value 115, which helps the predictions as we will see below. Intuitively, this makes sense: the sensor values with a 200 RUL are similar to those with a 115 RUL, so the algorithm will not be able to distinguish between these well. Also, the maximum RUL in the test set is 115, which is around the point where the sensor readings start to change significantly (referring to the line graphs above).

drop_sensors = ['s_1','s_5','s_16','s_18','s_19']
drop_labels = index_names+setting_names+drop_sensors
remaining_sensors = ['s_2', 's_3', 's_4', 's_6', 's_7', 's_8', 's_9', 's_10',
        's_11', 's_12', 's_13', 's_14', 's_15', 's_17', 's_20', 's_21']
X_train = train.drop(drop_labels, axis=1)
y_train = X_train.pop('RUL')
y_train_clipped = y_train.clip(upper=115)  # apply RUL clipping
X_test = test.groupby('unit_number').last().reset_index().drop(drop_labels, axis=1)

Now, we retrain the model:

rf = RandomForestRegressor(max_features="sqrt", random_state=42), y_train_clipped)

# predict and evaluate
y_hat_train = rf.predict(X_train)
evaluate(y_train_clipped, y_hat_train, 'train')

y_hat_test = rf.predict(X_test)
evaluate(y_test, y_hat_test)
train set RMSE:5.158896101752112, R2:0.9802797863716584 
test set RMSE:20.225672522811198, R2:0.761275442379237

After clipping the RUL, we see a decrease in the RMSE by more than half, indicating that our algorithm now makes better predictions. This is visualised below:

Histogram of results with RUL clipping

Another step we can take to improve the predictive power of our algorithm is signal smoothing. We can see from the sensor reading plots that the signals are “noisy” – i.e. lots of short-term variance. To fix this, we smooth the signals using Scipy’s Savitsky-Golay filter:

def apply_scipy_filter(df, scipy_filter):
    for unit in df['unit_number'].unique():
        for sensor in df.loc[:,'s_1':]:
            if sensor != 'RUL': 
                df.loc[df['unit_number']==unit,sensor] = scipy_filter(df.loc[df['unit_number']==unit,sensor], 
                window_length=19, polyorder=1, deriv=0, mode='interp')  
    return df

We fit the filter three times, then visualise the results:

train = apply_scipy_filter(train, ss.savgol_filter)
train = apply_scipy_filter(train, ss.savgol_filter)
train = apply_scipy_filter(train, ss.savgol_filter)

test = apply_scipy_filter(test, ss.savgol_filter)
test = apply_scipy_filter(test, ss.savgol_filter)
test = apply_scipy_filter(test, ss.savgol_filter)
Line chart with S1 smoothed
Line chart with S2 Smoothed
Line chart with S6 smoothed
Line chart with S7 smoothed

The plots are now much smoother.

The model may now be fit to the transformed data:

drop_sensors = ['s_1','s_5','s_16','s_18','s_19'] 
drop_labels = index_names+setting_names+drop_sensors
remaining_sensors = ['s_2', 's_3', 's_4', 's_6', 's_7', 's_8', 's_9', 's_10',
        's_11', 's_12', 's_13', 's_14', 's_15', 's_17', 's_20', 's_21']

X_train = train.drop(drop_labels, axis=1)
y_train = X_train.pop('RUL')
y_train_clipped = y_train.clip(upper=125)  # apply RUL clipping

# Since the true RUL values for the test set are only provided for the last time cycle of 
# each engine, the test set is subsetted to represent the same
X_test = test.groupby('unit_number').last().reset_index().drop(drop_labels, axis=1)
train set RMSE:1.65129502601594, R2:0.998347906044556 
test set RMSE:18.805190001699, R2:0.7936299799906349

The RMSE on the training set has been reduced substantially again, but it did not decrease by much on the test set. This suggests that perhaps our model is overfitted, which means the model is very accurate on the training set but inaccurate on the test set.

An additional step we took (not covered in this tutorial) was to optimise the model, given our training and test data, using Bayesian Optimisation. We then fed the parameters obtained through this optimisation into the Random Forest model:

rf = RandomForestRegressor(n_estimators = 32,
                            max_depth = 22,
                            min_samples_split = 6,
                            max_features = 1,
                            min_samples_leaf = 8,
                            random_state = 42), y_train_clipped)

y_hat_train = rf.predict(X_train)
evaluate(y_train_clipped, y_hat_train, 'train')

y_hat_test = rf.predict(X_test)
evaluate(y_test, y_hat_test)
train set RMSE:8.092176262356801, R2:0.9603250795515506 
test set RMSE:19.824936715182147, R2:0.7706415346514391

This accuracy looks worse than before, but it is more general (not much worse performance on the test set than on the training set). It is important to remember that, with these parameters, the algorithm is performant on multiple different training sets. If the Bayesian Optimisation algorithm had been allowed to run for longer, or if more arguments were allowed to be varied, even better accuracy could be achieved as well.

Potential (further) improvements

Many more improvements and adjustments are possible in models like this. Some of them are:

  • Adding features. It would be possible to generate new features from already-existing features, and perhaps find extra data to add to the data set. Some examples are:
    • Lagged features, i.e. columns for time t-1, t-2 etc.
    • Derivatives of features, i.e. average rate of change over a lagged period
    • Geographic data e.g. location, humidity, ambient temp
  • Detailed feature analysis. In practice, it is good to have an expert analyse the data set and determine which features to use in the model. It is also possible to employ feature selection methods to avoid problems such as the algorithm fitting to noisy data. Further statistical analysis may also be used to find features particularly prone to under/overfitting.
  • Voting/stacking regressors. Some implementations have multiple regression models working in tandem (known as ensemble learning), which reduces variance in predictions and may increase accuracy.
  • Using a CNN to capture more steps. CNN is used primarily in Computer Vision problems, but they are also useful for capturing data from multiple time steps simultaneously.
  • Use a different loss function. In problems like this, overestimating the RUL is potentially far more dangerous and costly than underestimating it. So, it may be prudent to choose a loss function that penalises overestimates more than it does for underestimates.

That’s the end of the tutorial – well done for reading through to the end! If you think that you or your business could benefit from a predictive Machine Learning implementation like this, please contact the 4CDA team. We would love to help!


A special thank you goes to 4CDA alumni James White for creating the initial code base used in the tutorial section.

Please find a the link to the full codebase for this example.

Ever wondered how an Artificial Neural Network works? Here’s a simple tutorial on the basics of implementing an ANN as a solution to a data problem using a real-world example.

In this tutorial, we will use artificial neural networks (ANN), implemented via the popular open source library TensorFlow, to predict the outcome of games of Australian Rules Football (AFL). We will use historical data to train the model and then use those parameters in the predictive model. The process will include an end-to-end approach including data acquisition, cleansing and organising, before the application of the ANN and verification of results.

This tutorial assumes some prior basic knowledge in Python.

What is an artificial neural network (ANN)?

The definition according to Wikipedia for an artificial neural network is:

ANNs, usually simply called neural networks (NNs), are computing systems vaguely inspired by the biological neural networks that constitute animal brains.

– Wikipedia definition

ANNs can solve new problems by analysing previous examples of similar problems. In this example, we have chosen ANNs to demonstrate their power in classification and regression problems. Similar results could be gained using linear or logistic regression (see our introductory post on linear regression), random forest classification, etc..

Project Structure

The structure we’ll follow to create an AFL predictor involves six steps:

  1. Find/create your dataset
  2. Clean the data
  3. Organise the data
  4. Set up the Neural Network
  5. Feed the data into the Neural Network
  6. Verify results

This structure may be generalised to solve many different problems, so I encourage readers to use this article as an example of how to implement an ANN in Python and then use the method to solve an alternative problem. Furthermore, I would suggest choosing a project that you’re passionate and interested in – I found that this motivated me to continue working and tinkering even when I wasn’t sure what to do next.

The final code used in this tutorial is available in this GitHub repository.

Step 1: find/create your dataset

Before we can do anything Data Science-related, we need data. I found a Kaggle dataset containing AFL player data from (almost) all matches between 2012 and 2018. It has a manageable 63,712 rows and 37 columns.

We need to import it into Jupyter Notebook, the Python workbook we’ll be using, before we can see or do anything with it. Let’s start by importing some essential libraries and loading our data:

import tensorflow as tf
import numpy as np
import pandas as pd

TensorFlow is an open source machine learning platform from Google that contains the Neural Networks we will use in our predictor, Numpy is used extensively for matrix and array operations, and Pandas is used to put the data into a data frame just to make our lives easier.

The next part of this step is to load the data using Pandas:

afl = pd.read_csv('C:/path/to/your/csv/file.csv')

This will place the data from your CSV file into a Pandas data frame. Data frames are great for working with large data sets, as they allow for easy row/column operations and sub-setting amongst many other things.

Now that the data is loaded, we can see the column names using the .info() method:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63712 entries, 0 to 63711
Data columns (total 37 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Team                    63712 non-null  object 
 1   Player                  63712 non-null  object 
 2   D.O.B                   63712 non-null  object 
 3   Height                  63712 non-null  int64  
 4   Weight                  63712 non-null  int64  
 5   Position                63712 non-null  object 
 6   Season                  63712 non-null  int64  
 7   Round                   63712 non-null  object 
 8   Date                    63624 non-null  object 
 9   Score                   63624 non-null  float64
 10  Margin                  63624 non-null  float64
 11  WinLoss                 63624 non-null  object 
 12  Opposition              63624 non-null  object 
 13  Venue                   63624 non-null  object 
 14  Disposals               63712 non-null  int64  
 15  Kicks                   63712 non-null  int64  
 16  Marks                   63712 non-null  int64  
 17  Handballs               63712 non-null  int64  
 18  Goals                   63712 non-null  int64  
 19  Behinds                 63712 non-null  int64  
 20  Hitouts                 63712 non-null  int64  
 21  Tackles                 63712 non-null  int64  
 22  Rebound50s              63712 non-null  int64  
 23  Inside50s               63712 non-null  int64  
 24  Clearances              63712 non-null  int64  
 25  Clangers                63712 non-null  int64  
 26  FreesFor                63712 non-null  int64  
 27  FreesAgainst            63712 non-null  int64  
 28  BrownlowVotes           63712 non-null  int64  
 29  ContendedPossessions    63712 non-null  int64  
 30  UncontendedPossessions  63712 non-null  int64  
 31  ContestedMarks          63712 non-null  int64  
 32  MarksInside50           63712 non-null  int64  
 33  OnePercenters           63712 non-null  int64  
 34  Bounces                 63712 non-null  int64  
 35  GoalAssists             63712 non-null  int64  
 36  PercentPlayed           63712 non-null  int64  
dtypes: float64(2), int64(26), object(9)
memory usage: 18.0+ MB

Step 2: clean the data

An essential component of any data-centric project is to clean your data set. Luckily, this one did not require too much cleaning – the only necessary change was to get rid of any rows containing NA entries – but I chose to demonstrate some data handling capabilities of the Pandas library by also adding an “Age” column, which may also be relevant in what we are trying to predict:

import datetime

afl = afl.dropna(axis=0)
afl = afl[afl['WinLoss'] != 'D']
afl['D.O.B'] = pd.to_datetime(afl['D.O.B'])
afl['Date'] = pd.to_datetime(afl['Date'])
age_in_days = (afl['Date']-afl['D.O.B'])
age_in_years = age_in_days.dt.days/365.2425
afl['Age'] = age_in_years

The first line deletes any rows containing NA entries. The second line deletes rows where the match result was a draw – I chose to do this because draws are so infrequent in the AFL that it’s pretty much pointless to try predicting them. The rest of this code block finds each player’s age in years at the time they played each game and adds a column for this data.

Step 3: organise the data

This is arguably the most critical component of the project. We need to decide which data to use, and how to feed it into the ANN. I considered many different approaches to this step, but the structure I settled on is described below:

This may seem complicated, but the code snippets below should help explain what’s going on.

Firstly, the data is grouped by match, a list of players is generated for each of these matches, the team lists are shuffled and the result is converted to a Numpy array:

import random
grouped_data = afl.groupby(['Team','Season','Round','WinLoss','Opposition','Venue'])['Player'].apply(list).reset_index()
players = grouped_data['Player'].to_numpy()
for item in players:
grouped_data['Player'] = players
grouped_data = grouped_data.to_numpy()

We shuffle the team lists to prevent bias towards some players in the neural network. For example, Gary Ablett is one of the best players of all time, so his average stats would probably be in the top 5% of players across the league. His name would be first on almost every one of his team lists, meaning the neural network may assign weights differently to his stats columns than other player columns. Shuffling the team lists ensures we don’t encounter this issue.

We can look at the first entry to see what each array element looks like:

array(['Adelaide', 2012, 'PF', 'L', 'Hawthorn', 'M.C.G.',
       list(['Thompson, Scott', 'Smith, Brodie', 'Porplyzia, Jason', 'Sloane, Rory', 'Reilly, Brent', 'Rutten, Ben', 'Douglas, Richard', 'van Berlo, Nathan', 'Tippett, Kurt', 'Callinan, Ian', 'Henderson, Ricky', 'Thompson, Luke', 'Johncock, Graham', 'Doughty, Michael', 'Jacobs, Sam', 'Otten, Andy', 'Wright, Matthew', 'Walker, Taylor', 'Mackay, David', 'Petrenko, Jared', 'Dangerfield, Patrick', 'Vince, Bernie'])],

We can now see that each array element are themselves arrays containing the team, season, round, result, opposing team, ground and list of players on the team.

The next part of this step is to convert non-numerical information into numerical data so we can feed it into the ANN. The best way of doing this is by using column encoding – we won’t go into too much detail in this article, but there are plenty of tutorials online covering this topic (here is an example). So, we convert the match result, opposing team and ground columns into encoded columns, which requires the OneHotEncoder and ColumnTransformer modules out of Scikit-learn.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct1 = ColumnTransformer([('encoder',OneHotEncoder(),[3,4,5])], remainder='passthrough',sparse_threshold=0)
grouped_data = ct1.fit_transform(grouped_data)

That’s all there is to it – let’s take another look at the first array element to see what each entry looks like after the transformation:

array([1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 'Adelaide', 2012, 'PF',
       list(['Thompson, Scott', 'Smith, Brodie', 'Porplyzia, Jason', 'Sloane, Rory', 'Reilly, Brent', 'Rutten, Ben', 'Douglas, Richard', 'van Berlo, Nathan', 'Tippett, Kurt', 'Callinan, Ian', 'Henderson, Ricky', 'Thompson, Luke', 'Johncock, Graham', 'Doughty, Michael', 'Jacobs, Sam', 'Otten, Andy', 'Wright, Matthew', 'Walker, Taylor', 'Mackay, David', 'Petrenko, Jared', 'Dangerfield, Patrick', 'Vince, Bernie'])],

Note how the encoded columns were pushed to the front of the array. Now, we need a training set and a test set. The training set will be fed into the ANN and used to develop the model, and the test set will allow us to measure the performance of the model. Usually, I would recommend choosing training and test sets using train_test_split out of Scikit-learn, however I went for a different approach in this project: I chose the data from the 2012-2017 season as my training set and data from the 2018 season as my test set. Firstly, split grouped data:

training = [x for x in grouped_data if x[43]<2018]
test = [x for x in grouped_data if x[43]==2018]

So “training” and “test” are just subsets of grouped_data.

Next, we find an encoded column corresponding to whether the team won or lost and use this as our dependent variable (as this is what we’re trying to predict):

y_train = np.array([x[0] for x in training])
y_train = y_train.reshape(-1,1)
y_test = np.array([x[0] for x in test])
y_test = y_test.reshape(-1,1)

The result is a two-dimensional array with 1’s corresponding to a win and 0’s to a loss:


Next, we find average stats for each player for the respective training and test sets and shuffle the resulting data to prevent bias:

# player_stats is each players' average stats between 2012 and 2017.
player_stats = afl[afl['Season'] < 2018].groupby('Player',sort=False).mean()
player_stats = player_stats.reset_index()
player_stats = player_stats.drop(['Season','Score','Margin'],axis=1)
player_stats = player_stats.sample(frac=1).reset_index(drop=True)
player_stats = player_stats.to_numpy()

# Player average stats for 2018
player_stats_2018 = afl[afl['Season'] == 2018].groupby('Player',sort=False).mean()
player_stats_2018 = player_stats_2018.reset_index()
player_stats_2018 = player_stats_2018.drop(['Season','Score','Margin'],axis=1)
player_stats_2018 = player_stats_2018.sample(frac=1).reset_index(drop=True)
player_stats_2018 = player_stats_2018.to_numpy()

We choose to ignore the season, score and margin columns. I guessed that the “season” column would not improve the prediction, which was later verified through testing. The “score” and “margin” columns are direct indicators of the match result, so if they were included in the analysis the neural network would achieve an accuracy of close to 100% on the training set but would perform very poorly on the test set.

We can see what each entry of these player statistic arrays looks like:

array(['Goodes, Brett', 183.0, 89.0, 17.045454545454547,
       10.772727272727273, 4.045454545454546, 6.2727272727272725,
       0.18181818181818182, 0.09090909090909091, 0.13636363636363635,
       2.6363636363636362, 2.8181818181818183, 2.272727272727273,
       1.3636363636363635, 2.772727272727273, 0.9545454545454546, 1.0,
       0.09090909090909091, 6.2727272727272725, 9.909090909090908,
       0.22727272727272727, 0.0, 1.8636363636363635, 0.5,
       0.18181818181818182, 78.5, 29.979335024613157], dtype=object)

Now we want to combine our encoded venue and opposition data with our average player data. Firstly, we get our encoded data from the “training” and “test” arrays we created previously:

opp_teams_train = np.array([x[2:42] for x in training])
opp_teams_test = np.array([x[2:42] for x in test])

The following “for” loop then reads the list of players on a team for each match, finds each player’s average stats in player_stats and adds them to an array, i.e. each row of X_train contains every player statistic for each game. The same thing is done for player_stats_2018, with the results stored in X_test:

X_train = [0]*len(training)
for i in range(0,len(training)):
    player_list = []
    j = 0
    for j in range(0,len(player_stats)):
        if player_stats[j][0] in training[i][-1]:
    X_train[i] = player_list

X_train = [np.concatenate(x) for x in X_train]
X_test = [0]*len(test)
for i in range(0,len(test)):
    player_list = []
    j = 0
    for j in range(0,len(player_stats_2018)):
        if player_stats_2018[j][0] in test[i][-1]:
    X_test[i] = player_list

X_test = [np.concatenate(x) for x in X_test]

The final part of this data engineering step involves something called feature scaling. Again, we won’t explore this concept in-depth, but there’s a lot of learning material out there if you’re curious (this Towards Data Science article, for example). Essentially, feature scaling improves the convergence rate of the ANN we’re about to create. We only need to scale our non-binary data, so we can leave our dependent variable arrays and opposition/venue data alone.

For this project I just used StandardScaler class out of the Scikit-learn pre-processing module:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
scaled_X_train = sc.fit_transform(X_train)
scaled_X_test = sc.fit_transform(X_test)

After feature scaling, we just combine our player and opposition/venue data, which gives us our matrix of features:

scaled_X_train = np.hstack((scaled_X_train, opp_teams_train))
scaled_X_train = np.asarray(scaled_X_train).astype(np.float32)

scaled_X_test = np.hstack((scaled_X_test, opp_teams_test))
scaled_X_test = np.asarray(scaled_X_test).astype(np.float32)

Run the below print statement to check that the data combined successfully:


The reason we now have negative values is because of the feature scaling.

Step 4: building the ANN

This part of the project is surprisingly simple, as all the heavy lifting is done by the TensorFlow library. We’ll use a network with two hidden layers, each with six neurons and both utilising the rectified linear unit (ReLU) activation function. ReLU is one of the most diverse and adaptable activation functions, which is why we’re using it. We’re opting for this network layout for simplicity, but feel free to experiment by adding more layers and/or neurons! For the output layer, we will use a sigmoid activation function as we are looking for a probabilistic output. For more reading on how a basic neural network functions, there is plenty of online material.

In TensorFlow (tf), the layers are added individually using the .add() method:

# Build the ANN
ann = tf.keras.models.Sequential()
# First hidden layer
ann.add(tf.keras.layers.Dense(units = 6, activation = 'relu'))
# Second hidden layer
ann.add(tf.keras.layers.Dense(units = 6, activation = 'relu'))
# Output layer
ann.add(tf.keras.layers.Dense(units = 1, activation = 'sigmoid'))

We then compile the ANN using the Adam optimiser and the binary cross-entropy loss function. Adam is selected again for its adaptability to a range of problems, whilst the loss function is chosen due to its effectiveness for binary outputs.

ann.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

Step 5: feed the data into the ANN

After compiling, the ANN is ready to be trained on the data. We will use a for loop to fit our model, as we would like to analyse the results later. For each loop, we re-fit the model and evaluate the loss and accuracy on both the training and test sets:

steps = []
accs = []
test_accs = []
for i in range(0, 25):, y_train, epochs = 1)
    [loss, accuracy] = ann.evaluate(scaled_X_train, y_train)
    [loss_t, accuracy_t] = ann.evaluate(scaled_X_test, y_test)
    accs.append(accuracy), steps.append(i), test_accs.append(accuracy_t)

Step 6: verify results

While the training loop is running, we can see the training progress being printed, including the accuracy and loss. These are the last few lines of output:

78/78 [==============================] - 0s 628us/step - loss: 0.3769 - accuracy: 0.7887
78/78 [==============================] - 0s 462us/step - loss: 0.3508 - accuracy: 0.7952
13/13 [==============================] - 0s 539us/step - loss: 0.8428 - accuracy: 0.6176

On the bottom line, we can see the final accuracy on the test set of 61.8%. So, after all that work, we only get an accuracy of 61.8%… hardly seems worth the effort, right?

We can plot training set accuracy (blue) and test accuracy (orange) over time steps to investigate how accuracy fluctuates over training steps (epochs):

import matplotlib.pyplot as plt
plt.plot(steps, accs)
plt.plot(steps, test_accs)
[<matplotlib.lines.Line2D at 0x1f797461688>]

We can see that accuracy on the test set increases steadily over the first five epochs, then remains relatively flat while the training accuracy climbs gradually. Ideally, we would like to see a gradual increase in the test set accuracy as well.

Potential improvements

There are several improvements that can be made to this basic model:

  • Adjusting hyperparameters using a validation set. Good neural networks use validation sets to investigate the effects of adjusting their hyperparameters, which in this case include the number of layers, the learning rate, the type of activation function and the batch size. This article sums these up in an easy-to-understand way.
  • Using better data. The data set I found online was relatively simplistic in that some vital statistics weren’t included. For example, it is well known that “points from clearances” and “points from turnovers” are key indicators of successful teams. If such statistics were included in the data, the neural network may have performed better.
  • Changing the structure of the data fed into the ANN. In this project, for simplicity, I used players’ data averaged over the whole time period for the respective training and test sets. However, this doesn’t really make sense: for example, the average stats for a player who played a game in 2012 would include data from 2012 to 2017, i.e. from the future. This problem could be remedied, but it would be more challenging to derive the training and test sets. Check out this code on Kaggle for an example of how a (slightly more realistic) statistical Machine Learning approach may be used on this kind of data set.

So, the moral of the story is that it’s relatively straightforward to create a simple AI. There are several libraries out there – such as Pandas, Numpy, Scikit-learn and Tensorflow – to help you out. Hopefully this article has provided you with some ideas and inspiration on how you can apply machine learning techniques for prediction.

Get in touch with the 4CDA team if you would like to learn more on how you can apply machine learning approaches to solve problems!