An Introduction to Discrete Event Simulation

An overview of discrete event simulation and its applications.

An overview of discrete event simulation and its applications.

Discrete event simulation is a powerful tool used across industries to analyse and optimise operations and processes. Ever since its founding in the late 1950s, alongside the emergence of computer science, discrete event simulation has come a long way in terms of development and usefulness.

In this introduction to discrete event simulation, we aim to:

  • Provide an overview of discrete event simulation
  • Provide an overview of the software tools available for discrete event simulation
  • Pinpoint what problems discrete event simulation can solve and how it can be useful for your business
  • Explain the steps involved in building a discrete event simulation model
  • Provide example scenarios in which discrete event simulation can be applied, including simple and complex scenarios.

What is Discrete Event Simulation (DES)?

A discrete-event simulation (DES) models the operation of a system as a (discrete) sequence of events in time.

Discrete Event Simulation, Wikipedia

DES can be used to simulate almost any process you can think of whether it be as simple as waiting at the deli to receive an order, or as complicated as advanced manufacturing. Regardless of how complicated a DES model may be, there are three key characteristics that are seen within all simulations:

  1. Event-Based: The simulation progresses by moving from one event to the next in time (rather than discrete even time slices).
  2. Discrete State Changes: The state of the system changes only at these discrete points in time.
  3. Stochastic or Deterministic: Events can occur based on a probability distribution (stochastic) or can be predetermined (deterministic).

The third characteristic, specifically the stochastic trait, is likely the most important as this trait allows us to model real-life operations within realistic parameters and data gathered from past experiments, which can help us gather realistic results and help solve problems.

Each DES model requires the following to properly function:

  1. Simulation Entity – A basic entity that can run through the simulation
  2. Entity Generator – Generates entities to run in the simulation (usually serves as the beginning of a process)
  3. Queue – Stores received entities until they are needed for the next component
  4. Server – Processes a received entity for a specific duration (represents a timed event that requires a specific entity)
  5. Entity Sink – Destroys any entity received (serves as the end of a process).

To show how all of this comes together, a very basic model that contains and links each part together to create a process flow is illustrated below.

DES Model Example

An explanation of the the above example:

  • The simulation begins with the simulation entity “SimEntity1” being generated by the entity generator “EntityGenerator1
  • After being generated, the entity travels to the server’s queue “Queue1” and “Server1” and waits there until all other entities have been processed by the server
  • Once its past the queue, the server processes the entity for a specific time
  • Finally after the server, the entity travels to the entity sink “EntitySink1″ where the entity is then destroyed, thus representing the end of the process for that entity.

Note: All the DES model examples in this article are made using a open source simulation software JaamSim, which will be explained in detail later in this article.

DES Approach to Problem Solving

Discrete event simulation thrives when it comes to running multiple nodes at once, debottlenecking and optimization.

To put this into perspective, lets explore the hypothetical example of designing a new airport. You know that for people to get to their respective gate and flight, there must be several services desks for checking-in bags and other processes, a security process, and several different terminals for passengers to travel to. Other amenities include shops, restaurants, bathrooms, elevators, escalators, and so on. Applying DES here would take all of these “nodes” into account when building the airport and determine what is the best way to optimally utilise each node. So for this example, we would build an example model of the airport with DES under the allocated resources (money, staff, etc.) that are available and run that model within specific parameters to gather passengers’ total time traveling throughout the airport as well as other data. Then, we would view the data and try to determine what node has the greatest impact on the passengers’ overall travel time. Finally, we would return to the model, edit this node or “potential bottleneck” and run the model again to get different results. We would then repeat these steps in a trial-and-error fashion until we find the bottleneck of the airport system and create the most optimised layout of the airport.

Ultimately, DES is utilised to solve efficiency-related business process problems, as DES always proposes the following question: “What are the best changes that can be made in order to truly optimise a process?”  

The specific steps taken for organisations to reach this optimisation and efficiency can be shown by the DES Engagement Process. The engagement process is outlined below.

1. Understanding the Process

When you approach a problem, understanding what you are actually solving and optimising is likely the most important part of the process. This includes not only having well documented background information, but also having tacit knowledge of the problem. There is a lot of value to be held within this “discovery phase”, as this phase can lead to a lot of opportunities for improvement right away if done properly.

Important considerations include:

  • Learning what process you are modelling
  • What events are involved and the order of these events
  • How these events start, how they end, and what they produce
  • Which parts of the model can represent which events (what servers and other process flow parts can represent).

2. Gathering Statistical Distributions

It is important to gather the proper statistics for a problem while keeping note of the objective function. This means that the data you collect represents something that can be improved upon and measured by going through this entire process, whether it be making something cheaper, more efficient, higher throughput, and so on. However, keeping in mind the realistic parameters of the problem is also key, as every system has its tradeoffs. Finally, determining whether the data applies to this problem or not.

Important considerations include:

  • Which distributions and values apply to what events and elements
  • Distributions can be based on observed data, information in a database, or assumptions depending on the desired accuracy.

3. Building the Model

Typical steps in building a DES model include:

  • Creating the flow and its respective distributions in a DES
  • Showing what the process produces as a whole
  • Linking each event to its respective element, distributions, values, and code within the program.

4. History Matching

It’s important to note that if you are not able to history match the model with its realistic counterpart, then its highly likely that you have misunderstood something or an important assumption that you previously made was incorrect. This is why steps 1 and 2 are the most important, because if there’s an incorrect perception of the entire problem, or if the data is wrong, then it can lead to issues within the validity of the model. In summary, if you fail to history match the model, then return to steps 1 or 2 as there is likely a mistake.

Important considerations include:

  • Verifying that the model produces its real-life counterpart
  • Ensuring values and distributions used are realistic and make sense
  • Ensuring the final product represents what is made/desired in reality.

5. Running the Experiment

The final step is running the experiment, considerations include:

  • Utilising scenarios and replications to create accurate and realistically distributed results
  • Changing nodes to see the effects on results
  • Learning how to achieve efficiency through minimalistic, realistic, and cost-efficient changes.

DES Software Tools

There are a various DES software tools available. Some of these tools include but are not limited to the following:

  1. AnyLogic – A large, general-purpose, multi-method modelling tool. Combines agent-based system dynamics, and discrete event modelling.
  2. Rockwell Automation – also known as Arena, now a part of Rockwell, is a large, general-purpose, multi-method simuilation modelling tool.
  3. Simul8 – A commercial simulation software, especially popular in Australia.
  4. Simcad Pro – A simulation software that includes a virtual reality and physics engine, and allows for model changes while the simulation is running.
  5. Enterprise Dynamics – a platform that allows you to create a digital twin with applications in manufacturing, material handling, and logistics.
  6. JaamSim – Open source simulation software that has 2D/3D graphics and a drag-and-drop interface.

It’s important to note that each of these tools have their own uses in terms of the scale of the projects. For example, if a project has a smaller scale and less of a budget, “Simul8” may be an appropriate option due to its lightweight nature, or open source “JaamSim” may be appropriate due to it being free and simple to use. For larger projects with bigger budgets and high complexity/compute requirements, “AnyLogic” and “Rockwell Automation” may be a better fit.

Some of the tools mentioned above have free and “pro” versions. For example, the free version of “JaamSim” provides all the basic tools used to build a model, but “JaamSim Pro” provides faster execution speed, more flexible data interoperability, and access to web models where users can access a model and run it through their own parameters.

Finally, the size of each of these tools is what makes discrete event simulation so interesting. The flexibility that can be seen within these tools shows how there are so many opportunities to utilise DES, and that there is always a lot of value to be held within models despite how big or small they are.

Example DES Models

Below are some examples of hypothetical “real world” applications of discrete event simulation models in two separate industries/use cases.

Example 1: Basic DES Model – Airport Security

In this example let us imagine that the manager of an Airport is looking to improve the efficiency of its security process. A basic model below could look like the below.

Basic Airport Security DES Model

A summary of the above:

  • The entity is presented as a “Flight_Passenger“, and the entity generator represents the passenger arriving at the airport via the entity generator labeled “Airport_Arrival“.
  • The passenger waits in the security queue, “Security_Queue”, until they can enter the server labeled “Security“, which represents the passenger going through security.
  • After making it through security, the passenger then goes to board their flight which is represented by the entity sink labeled “Board_Flight”, marking the end of the process.

Example 2: Basic DES Model – Gold Beneficiation

In this example we have a Gold Mine Manager who is seeking to improve the process of finding and preparing gold for the market. A basic model below could look like the below.

Basic Gold Beneficiation DES Model

A summary of the above:

  • The entity is presented as “Gold_Ore”, and this ore is mined/created via the entity generator labeled “Gold_Mine”.
  • After being mined, the gold ore waits in the queue labeled “Beneficiation_Queue” until it can enter the server labeled “Beneficiation”.
  • After the gold ore goes through the beneficiation process, it is shipped off to the gold market which is represented by the entity sink labeled “Gold_Market”.

Notice how this model relates to a much more common and relatable scenario than the previous. This relates back to that point that DES can be applied to any process, whether these processes involve niche manufacturing or common day-to-day tasks. As for both models, they can each seek efficiency to a certain degree by applying pre-determined values to each part of the model. However, to get realistic data, we will need to create a stochastic model with many more process-flow parts.

Example 3: Basic DES Model – Gold Beneficiation

The basic Gold Beneficiation DES model can be useful to an extent, but its simplicity can serve as a major obstacle towards gathering realistic data. Some unrealistic implications of the model include:

  1. The entire gold beneficiation process is represented as a single server. In reality however, the beneficiation process is composed of several other steps including crushing, grinding, and extracting the ore.
  2. There is no transportation time between each part, even though in reality, the ore takes at least a little bit of time to travel from event to event.
  3. Each event has a set time value that it takes to complete. In reality, nearly no process can repeat itself while taking the same amount of time as the last run-through.
  4. There is always a chance that some of the ore may go to waste, and the model above has no way of identifying which ore is wasteful.

In order to compensate for these issues, we can:

  1. Add more servers and queues
  2. Create entity conveyor belts which transport received entities on a specific path at a certain speed
  3. Add normal distributions to each server that each have their own mean, maximum, minimum, and standard deviation value
  4. Add a discrete distribution connected to a branch event. The branch, indicating the success rate of the ore, chooses whether the ore is wasted or is shipped to the market via the discrete distribution, which has set probabilities to whether the ore is wasted or not.

Putting this all together now, a more complex version of this model may look like the below:

Complex Gold Beneficiation DES Model

A summary of the above:

  • The previous server in the Basic Gold Beneficiation model labelled “Beneficiation” has been divided into three separate servers labelled “Crushing_Process”, “Grinding_Process”, and “Extraction_Process”.
  • Each of these servers are linked to their respective distributions above each of them. Additionally, various entity conveyors including “Ore_Transportation”, “Conveyor_Belt” and “Shipping_to_Market” have been added as well as a new entity sink, branch, and discrete distribution.
  • The “Success_rate” chooses whether the ore is shipped to the market or goes to the “Wasted_Gold” entity sink based on the value chosen by the “Success_Rate_Distribution”.
  • This success rate process that we used represents the yield of gold and how all gold is not usually perfectly recovered when it goes through this process.
  • Additionally, we can add graphical components to our model so we can understand what is happening on a more visual and analytical basis.

Complex Gold Beneficiation DES Graphical Model with Statistics

In the above graphical model, a statistics section containing average queue wait times and counters has been added. Additionally, a graph of the combined average queue wait length is created and bar gauges next to each server indicating their progress completed on their respective entities being worked on.

Upon running the simulation, we can see the following results.

Complex Gold Beneficiation DES Graphical Model with Statistics Running

Now that the model is more realistic and stochastic (despite the demo values of seconds and minutes being used above), the Mine Manager can utilise the data presented above and the data in the downloaded Excel report to go back and tweak one of the nodes in a trial-and-error fashion in order to determine which changes to a node produce the most efficient and optimised outcome.

Even this “complex” example only scratches the surface. A real life DES model for gold processing would be closely linked to the process flow of that specific gold mine taking into consideration the specific processing and equipment used at that site, such as whether it employs carbon in leach (CIL) or carbon in pulp (CIP) processes, the various types/stages of grinding, the type of mill, whether gravity screening is employed, the amount/quality of sensors, reliability of data in the historian, and so on.

In Summary

Discrete Event Simulation is an incredibly useful and flexible technique that can be used in many organisations. Whether you are simply collecting a couple data points from a simple model, building a model with hundreds of nodes, or even setting your own parameters in a pre-built model, DES will provide the flexibility within its tools to allow you to do as you please.

DES modelling can help your organisation to improve decision making, identify bottlenecks, reduces costs and wastage, optimise resources, minimise downtime, predict system performance under various conditions, and conduct risk-free experimentation before costly investment.

Despite its versatility, DES still isn’t a technique that can do everything efficiently. Due to its events being clinical, discrete occurrences rather than places where the variables of the system can change, it’s not great at modelling variable changes within each event or effects on a system as a whole, such as modelling fluids, kinematics, and traffic. Other modelling techniques that can be used for this can include continuous modelling or agent-based modelling.

Hopefully, this article was an informative introduction to DES, its applications, and how it may apply to your organisation. Please do not hesitate to contact the team if you would like to know more about discrete event simulation, or the implementation of other simulation and optimisation techniques.

Acknowledgements

A special thank you goes to our New York based 4CDA intern Nick Vowinkel, who helped put this article together during his internship.

An explanation of what Predictive Maintenance is, and a demonstration of how a PdM algorithm may be implemented in the real world.

Predictive maintenance (PdM) is maintenance that monitors the performance and condition of equipment during normal operation to reduce the likelihood of failures

www.reliableplant.com

There are generally three different types of maintenance:

  • Reactive maintenance is the process of repairing assets to standard operating conditions after poor performance or breakdown is observed.
  • Preventive maintenance usually occurs on some type of schedule. Preventive maintenance is designed to keep machinery and parts in good condition but does not take the state or process into account.
  • Predictive maintenance occurs as needed, drawing on real-time collection and analysis of machine operation data to identify issues before they can interrupt production. With predictive maintenance, repairs happen during machine operation and address an actual problem. If a shutdown is required, it will be shorter and more targeted.

While the planned downtime in preventive maintenance may cause a decrease in overall capacity and/or availability, it is favoured over the unplanned downtime of reactive maintenance, where costs and duration may be unknown until the problem is diagnosed and addressed. It is also likely to interrupt other scheduling and planning which will cause further downstream time losses.

The aim of this post is to demystify some technical aspects of predictive maintenance through a Python solution to a real-world problem: turbofan engine degradation.

Problem Statement

Our task is to determine whether a Machine Learning model could be used to perform Predictive Maintenance on turbofan engines. For the purposes of this tutorial, we will assume that the following information has been ascertained through consultation with the company operating the turbofans:

  • The maintenance schedule of the turbofans is flexible. There would be no use carrying out this analysis if the schedule cannot be changed.
  • The analysis would generate long-term value for the operating company.

Given that the above points are true, the problem now lies in the analysis. We will use sensor data to predict the Remaining Useful Life (RUL) of turbofan engines. This RUL prediction can then be used to facilitate predictive maintenance.

Dataset Description

The data used in this notebook is based off a subset of the popular NASA Turbofan Engine Degradation Simulation Data Set. It contains data for 100 different turbofans.

Engine degradation simulation was carried out using C-MAPSS. Four different sets were simulated (using set 3 here) under different combinations of operational conditions and fault modes. Records several sensor channels to characterize fault evolution. The data set was provided by the Prognostics CoE at NASA Ames.


https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository

Implementation

The first step is to import the libraries:

import pandas as pd
import numpy as np
import scipy as sp
import scipy.signal as ss
import matplotlib.pyplot as plt
import seaborn as sns
import optuna
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import xgboost
import catboost
plt.rcParams['figure.figsize'] = 20, 20

Next, we import the data into a Pandas data frame. The data are in the form of text files, which can be obtained from the above link. There is a data frame for the training set, the test set and the validation testing data:

index_names = ['unit_number', 'time_cycles']
setting_names = ['setting_1', 'setting_2', 'setting_3']
sensor_names = ['s_{}'.format(i+1) for i in range(0,21)]
col_names = index_names + setting_names + sensor_names
directory = r'C:\PATH\TO\YOUR\DATA\FOLDER'
train_df = pd.read_csv(directory+r'\train_FD003.txt', 
                     sep='\s+', 
                     header=None,
                     index_col=False,
                     names=col_names)
train = train_df.copy()
test_df = pd.read_csv(directory+r'\test_FD003.txt', 
                     sep='\s+', 
                     header=None,
                     index_col=False,
                     names=col_names)
test = test_df.copy()
y_test = pd.read_csv(directory+r'\RUL_FD003.txt', 
                      sep='\s+', 
                      header=None,
                      index_col=False,
                      names=['RUL'])

At this stage, we note that we will be using only a training and a test set. In practice, a validation set should also be used to ensure the model works well on multiple test sets.

Now we can perform some basic diagnostic analysis of the data to confirm that it has been imported correctly, and to get an idea of what the entire data set looks like through descriptive statistics:

train.shape
(24720, 26)
train.head()
Table of train head
train.loc[:,'s_1':].describe().transpose()
Table showing the trained data for location

We can see that there is a row for each time cycle for each unit, and that there is a column for each sensor reading.

We now define a function that adds a column “RUL” to a given data frame based on a unit’s maximum cycle number and the time cycle it is in currently. We use this function to modify the training set:

def add_remaining_useful_life(df):
    grouped_by_unit = df.groupby(by='unit_number') 
    max_cycle = grouped_by_unit['time_cycles'].max() 
    result_frame = df.merge(max_cycle.to_frame(name='max_cycle'), 
                            left_on='unit_number',
                            right_index=True
    # Calculate remaining useful life for each row 
    remaining_useful_life = result_frame["max_cycle"] - result_frame['time_cycles']
    result_frame["RUL"] = remaining_useful_life 
    # drop max_cycle as it's no longer needed 
    result_frame = result_frame.drop("max_cycle", axis=1) 
    return result_frame
train = add_remaining_useful_life(train)
Table with trained data after RUL

We then find the maximum RUL for each unit and visualise their frequencies using a histogram:

max_ruls = train.groupby('unit_number').max().reset_index()
max_ruls.head()
Table with data after max RUL
Histogram of max RUL results

The distribution of RULs looks log-normal with most of the max RUL data in the 150-250 range. One insight we can gather from this is that, if there were many more simulations, we could be confident that the RUL would almost never be less than 150 cycles. This information could be used to clip predictions at a particular maximum which may make the algorithm more reliable/accurate in production.

Visualising sensor signals

Now we investigate what the sensor signals look like, which will help determine “good” and “bad” sensors (i.e. sensors that contain a lot of information vs ones that don’t). We use a function to plot the sensor signals for every 10th turbofan. Plots for sensors 1, 2, 6 and 7 are displayed below as examples:

Brief sensor analysis

Looking at all visualisations, it seems as though sensors 1 (included above), 5, 16, 18 and 19 have very little to no information to help predict the RUL. These will be removed before prediction below to help the speed and generalisation of the algorithm. Further testing may reveal that sensors with patterns like sensor 6 (above) should also be removed from the features.

Next, we prepare the training and test sets:

drop_sensors = ['s_1','s_5','s_16','s_18','s_19']
drop_labels = index_names+setting_names+drop_sensors
remaining_sensors = ['s_2', 's_3', 's_4', 's_6', 's_7', 's_8', 's_9', 's_10',
        's_11', 's_12', 's_13', 's_14', 's_15', 's_17', 's_20', 's_21']
X_train = train.drop(drop_labels, axis=1)
y_train = X_train.pop('RUL')
X_test = test.groupby('unit_number').last().reset_index().drop(drop_labels, axis=1)

X_train and X_test contain only the sensor readings, whilst y_train and y_test contain the RUL values.

Now, we define a function that evaluates the total error in our predictions. We have chosen the root mean square error (RMSE) as our error metric, although there may be other metrics that would work better:

def evaluate(y_true, y_hat, label='test'):
    mse = mean_squared_error(y_true, y_hat)
    rmse = np.sqrt(mse)
    variance = r2_score(y_true, y_hat)
    print('{} set RMSE:{}, R2:{}'.format(label, rmse, variance))

Training the algorithm

This section of the tutorial will cover the building and implementation of a Random Forest Regression model. It is good to test multiple different types of models in practice – as such, XGBoost and CatBoost models were also built and tested, but that will not be covered here.

We note here that, in most model types (e.g. Linear Regression, SVM), feature scaling is required to normalise the data for training. All of the models built as part of this process are forms of Decision Tree regressors, which do not require feature scaling.

Building and training the model is surprisingly straightforward thanks to Scikit-learn’s powerful Random Forest module:

rf = RandomForestRegressor(max_features="sqrt", random_state=42)
rf.fit(X_train, y_train)

# predict and evaluate
y_hat_train = rf.predict(X_train)
evaluate(y_train, y_hat_train, 'train')

y_hat_test = rf.predict(X_test)
evaluate(y_test, y_hat_test)
train set RMSE:21.025240351169202, R2:0.9547545019533699 
test set RMSE:46.358798830427006, R2:-0.254167389706895

We will now interrogate our results through a visualisation to confirm whether they are reasonable. It is always important to visualise the predictions against the actual values, as a single metric does not always tell the whole story. For example, most predictions may be good but there could be a large outlier which would render the model unacceptable to put into production.

Our algorithm appears to overestimate in its predictions. We will try to remedy this by clipping the RUL as mentioned previously. Applying RUL clipping makes the maximum RUL value 115, which helps the predictions as we will see below. Intuitively, this makes sense: the sensor values with a 200 RUL are similar to those with a 115 RUL, so the algorithm will not be able to distinguish between these well. Also, the maximum RUL in the test set is 115, which is around the point where the sensor readings start to change significantly (referring to the line graphs above).

drop_sensors = ['s_1','s_5','s_16','s_18','s_19']
drop_labels = index_names+setting_names+drop_sensors
remaining_sensors = ['s_2', 's_3', 's_4', 's_6', 's_7', 's_8', 's_9', 's_10',
        's_11', 's_12', 's_13', 's_14', 's_15', 's_17', 's_20', 's_21']
X_train = train.drop(drop_labels, axis=1)
y_train = X_train.pop('RUL')
y_train_clipped = y_train.clip(upper=115)  # apply RUL clipping
X_test = test.groupby('unit_number').last().reset_index().drop(drop_labels, axis=1)

Now, we retrain the model:

rf = RandomForestRegressor(max_features="sqrt", random_state=42)
rf.fit(X_train, y_train_clipped)

# predict and evaluate
y_hat_train = rf.predict(X_train)
evaluate(y_train_clipped, y_hat_train, 'train')

y_hat_test = rf.predict(X_test)
evaluate(y_test, y_hat_test)
train set RMSE:5.158896101752112, R2:0.9802797863716584 
test set RMSE:20.225672522811198, R2:0.761275442379237

After clipping the RUL, we see a decrease in the RMSE by more than half, indicating that our algorithm now makes better predictions. This is visualised below:

Histogram of results with RUL clipping

Another step we can take to improve the predictive power of our algorithm is signal smoothing. We can see from the sensor reading plots that the signals are “noisy” – i.e. lots of short-term variance. To fix this, we smooth the signals using Scipy’s Savitsky-Golay filter:

def apply_scipy_filter(df, scipy_filter):
    for unit in df['unit_number'].unique():
        for sensor in df.loc[:,'s_1':]:
            if sensor != 'RUL': 
                df.loc[df['unit_number']==unit,sensor] = scipy_filter(df.loc[df['unit_number']==unit,sensor], 
                window_length=19, polyorder=1, deriv=0, mode='interp')  
    return df

We fit the filter three times, then visualise the results:

train = apply_scipy_filter(train, ss.savgol_filter)
train = apply_scipy_filter(train, ss.savgol_filter)
train = apply_scipy_filter(train, ss.savgol_filter)

test = apply_scipy_filter(test, ss.savgol_filter)
test = apply_scipy_filter(test, ss.savgol_filter)
test = apply_scipy_filter(test, ss.savgol_filter)
Line chart with S1 smoothed
Line chart with S2 Smoothed
Line chart with S6 smoothed
Line chart with S7 smoothed

The plots are now much smoother.

The model may now be fit to the transformed data:

drop_sensors = ['s_1','s_5','s_16','s_18','s_19'] 
drop_labels = index_names+setting_names+drop_sensors
remaining_sensors = ['s_2', 's_3', 's_4', 's_6', 's_7', 's_8', 's_9', 's_10',
        's_11', 's_12', 's_13', 's_14', 's_15', 's_17', 's_20', 's_21']

X_train = train.drop(drop_labels, axis=1)
y_train = X_train.pop('RUL')
y_train_clipped = y_train.clip(upper=125)  # apply RUL clipping

# Since the true RUL values for the test set are only provided for the last time cycle of 
# each engine, the test set is subsetted to represent the same
X_test = test.groupby('unit_number').last().reset_index().drop(drop_labels, axis=1)
train set RMSE:1.65129502601594, R2:0.998347906044556 
test set RMSE:18.805190001699, R2:0.7936299799906349

The RMSE on the training set has been reduced substantially again, but it did not decrease by much on the test set. This suggests that perhaps our model is overfitted, which means the model is very accurate on the training set but inaccurate on the test set.

An additional step we took (not covered in this tutorial) was to optimise the model, given our training and test data, using Bayesian Optimisation. We then fed the parameters obtained through this optimisation into the Random Forest model:

rf = RandomForestRegressor(n_estimators = 32,
                            max_depth = 22,
                            min_samples_split = 6,
                            max_features = 1,
                            min_samples_leaf = 8,
                            random_state = 42)
rf.fit(X_train, y_train_clipped)

y_hat_train = rf.predict(X_train)
evaluate(y_train_clipped, y_hat_train, 'train')

y_hat_test = rf.predict(X_test)
evaluate(y_test, y_hat_test)
train set RMSE:8.092176262356801, R2:0.9603250795515506 
test set RMSE:19.824936715182147, R2:0.7706415346514391

This accuracy looks worse than before, but it is more general (not much worse performance on the test set than on the training set). It is important to remember that, with these parameters, the algorithm is performant on multiple different training sets. If the Bayesian Optimisation algorithm had been allowed to run for longer, or if more arguments were allowed to be varied, even better accuracy could be achieved as well.

Potential (further) improvements

Many more improvements and adjustments are possible in models like this. Some of them are:

  • Adding features. It would be possible to generate new features from already-existing features, and perhaps find extra data to add to the data set. Some examples are:
    • Lagged features, i.e. columns for time t-1, t-2 etc.
    • Derivatives of features, i.e. average rate of change over a lagged period
    • Geographic data e.g. location, humidity, ambient temp
  • Detailed feature analysis. In practice, it is good to have an expert analyse the data set and determine which features to use in the model. It is also possible to employ feature selection methods to avoid problems such as the algorithm fitting to noisy data. Further statistical analysis may also be used to find features particularly prone to under/overfitting.
  • Voting/stacking regressors. Some implementations have multiple regression models working in tandem (known as ensemble learning), which reduces variance in predictions and may increase accuracy.
  • Using a CNN to capture more steps. CNN is used primarily in Computer Vision problems, but they are also useful for capturing data from multiple time steps simultaneously.
  • Use a different loss function. In problems like this, overestimating the RUL is potentially far more dangerous and costly than underestimating it. So, it may be prudent to choose a loss function that penalises overestimates more than it does for underestimates.

That’s the end of the tutorial – well done for reading through to the end! If you think that you or your business could benefit from a predictive Machine Learning implementation like this, please contact the 4CDA team. We would love to help!

Acknowledgements

A special thank you goes to 4CDA alumni James White for creating the initial code base used in the tutorial section.

Please find a the link to the full codebase for this example.

Ever wondered how an Artificial Neural Network works? Here’s a simple tutorial on the basics of implementing an ANN as a solution to a data problem using a real-world example.

In this tutorial, we will use artificial neural networks (ANN), implemented via the popular open source library TensorFlow, to predict the outcome of games of Australian Rules Football (AFL). We will use historical data to train the model and then use those parameters in the predictive model. The process will include an end-to-end approach including data acquisition, cleansing and organising, before the application of the ANN and verification of results.

This tutorial assumes some prior basic knowledge in Python.

What is an artificial neural network (ANN)?

The definition according to Wikipedia for an artificial neural network is:

ANNs, usually simply called neural networks (NNs), are computing systems vaguely inspired by the biological neural networks that constitute animal brains.

– Wikipedia definition

ANNs can solve new problems by analysing previous examples of similar problems. In this example, we have chosen ANNs to demonstrate their power in classification and regression problems. Similar results could be gained using linear or logistic regression (see our introductory post on linear regression), random forest classification, etc..

Project Structure

The structure we’ll follow to create an AFL predictor involves six steps:

  1. Find/create your dataset
  2. Clean the data
  3. Organise the data
  4. Set up the Neural Network
  5. Feed the data into the Neural Network
  6. Verify results

This structure may be generalised to solve many different problems, so I encourage readers to use this article as an example of how to implement an ANN in Python and then use the method to solve an alternative problem. Furthermore, I would suggest choosing a project that you’re passionate and interested in – I found that this motivated me to continue working and tinkering even when I wasn’t sure what to do next.

The final code used in this tutorial is available in this GitHub repository.

Step 1: find/create your dataset

Before we can do anything Data Science-related, we need data. I found a Kaggle dataset containing AFL player data from (almost) all matches between 2012 and 2018. It has a manageable 63,712 rows and 37 columns.

We need to import it into Jupyter Notebook, the Python workbook we’ll be using, before we can see or do anything with it. Let’s start by importing some essential libraries and loading our data:

import tensorflow as tf
import numpy as np
import pandas as pd

TensorFlow is an open source machine learning platform from Google that contains the Neural Networks we will use in our predictor, Numpy is used extensively for matrix and array operations, and Pandas is used to put the data into a data frame just to make our lives easier.

The next part of this step is to load the data using Pandas:

afl = pd.read_csv('C:/path/to/your/csv/file.csv')

This will place the data from your CSV file into a Pandas data frame. Data frames are great for working with large data sets, as they allow for easy row/column operations and sub-setting amongst many other things.

Now that the data is loaded, we can see the column names using the .info() method:

afl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63712 entries, 0 to 63711
Data columns (total 37 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Team                    63712 non-null  object 
 1   Player                  63712 non-null  object 
 2   D.O.B                   63712 non-null  object 
 3   Height                  63712 non-null  int64  
 4   Weight                  63712 non-null  int64  
 5   Position                63712 non-null  object 
 6   Season                  63712 non-null  int64  
 7   Round                   63712 non-null  object 
 8   Date                    63624 non-null  object 
 9   Score                   63624 non-null  float64
 10  Margin                  63624 non-null  float64
 11  WinLoss                 63624 non-null  object 
 12  Opposition              63624 non-null  object 
 13  Venue                   63624 non-null  object 
 14  Disposals               63712 non-null  int64  
 15  Kicks                   63712 non-null  int64  
 16  Marks                   63712 non-null  int64  
 17  Handballs               63712 non-null  int64  
 18  Goals                   63712 non-null  int64  
 19  Behinds                 63712 non-null  int64  
 20  Hitouts                 63712 non-null  int64  
 21  Tackles                 63712 non-null  int64  
 22  Rebound50s              63712 non-null  int64  
 23  Inside50s               63712 non-null  int64  
 24  Clearances              63712 non-null  int64  
 25  Clangers                63712 non-null  int64  
 26  FreesFor                63712 non-null  int64  
 27  FreesAgainst            63712 non-null  int64  
 28  BrownlowVotes           63712 non-null  int64  
 29  ContendedPossessions    63712 non-null  int64  
 30  UncontendedPossessions  63712 non-null  int64  
 31  ContestedMarks          63712 non-null  int64  
 32  MarksInside50           63712 non-null  int64  
 33  OnePercenters           63712 non-null  int64  
 34  Bounces                 63712 non-null  int64  
 35  GoalAssists             63712 non-null  int64  
 36  PercentPlayed           63712 non-null  int64  
dtypes: float64(2), int64(26), object(9)
memory usage: 18.0+ MB

Step 2: clean the data

An essential component of any data-centric project is to clean your data set. Luckily, this one did not require too much cleaning – the only necessary change was to get rid of any rows containing NA entries – but I chose to demonstrate some data handling capabilities of the Pandas library by also adding an “Age” column, which may also be relevant in what we are trying to predict:

import datetime

afl = afl.dropna(axis=0)
afl = afl[afl['WinLoss'] != 'D']
afl['D.O.B'] = pd.to_datetime(afl['D.O.B'])
afl['Date'] = pd.to_datetime(afl['Date'])
age_in_days = (afl['Date']-afl['D.O.B'])
age_in_years = age_in_days.dt.days/365.2425
afl['Age'] = age_in_years

The first line deletes any rows containing NA entries. The second line deletes rows where the match result was a draw – I chose to do this because draws are so infrequent in the AFL that it’s pretty much pointless to try predicting them. The rest of this code block finds each player’s age in years at the time they played each game and adds a column for this data.

Step 3: organise the data

This is arguably the most critical component of the project. We need to decide which data to use, and how to feed it into the ANN. I considered many different approaches to this step, but the structure I settled on is described below:

This may seem complicated, but the code snippets below should help explain what’s going on.

Firstly, the data is grouped by match, a list of players is generated for each of these matches, the team lists are shuffled and the result is converted to a Numpy array:

import random
grouped_data = afl.groupby(['Team','Season','Round','WinLoss','Opposition','Venue'])['Player'].apply(list).reset_index()
players = grouped_data['Player'].to_numpy()
for item in players:
    random.shuffle(item)
grouped_data['Player'] = players
grouped_data = grouped_data.to_numpy()

We shuffle the team lists to prevent bias towards some players in the neural network. For example, Gary Ablett is one of the best players of all time, so his average stats would probably be in the top 5% of players across the league. His name would be first on almost every one of his team lists, meaning the neural network may assign weights differently to his stats columns than other player columns. Shuffling the team lists ensures we don’t encounter this issue.

We can look at the first entry to see what each array element looks like:

grouped_data[0]
array(['Adelaide', 2012, 'PF', 'L', 'Hawthorn', 'M.C.G.',
       list(['Thompson, Scott', 'Smith, Brodie', 'Porplyzia, Jason', 'Sloane, Rory', 'Reilly, Brent', 'Rutten, Ben', 'Douglas, Richard', 'van Berlo, Nathan', 'Tippett, Kurt', 'Callinan, Ian', 'Henderson, Ricky', 'Thompson, Luke', 'Johncock, Graham', 'Doughty, Michael', 'Jacobs, Sam', 'Otten, Andy', 'Wright, Matthew', 'Walker, Taylor', 'Mackay, David', 'Petrenko, Jared', 'Dangerfield, Patrick', 'Vince, Bernie'])],
      dtype=object)

We can now see that each array element are themselves arrays containing the team, season, round, result, opposing team, ground and list of players on the team.

The next part of this step is to convert non-numerical information into numerical data so we can feed it into the ANN. The best way of doing this is by using column encoding – we won’t go into too much detail in this article, but there are plenty of tutorials online covering this topic (here is an example). So, we convert the match result, opposing team and ground columns into encoded columns, which requires the OneHotEncoder and ColumnTransformer modules out of Scikit-learn.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct1 = ColumnTransformer([('encoder',OneHotEncoder(),[3,4,5])], remainder='passthrough',sparse_threshold=0)
grouped_data = ct1.fit_transform(grouped_data)

That’s all there is to it – let’s take another look at the first array element to see what each entry looks like after the transformation:

grouped_data[0]
array([1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 'Adelaide', 2012, 'PF',
       list(['Thompson, Scott', 'Smith, Brodie', 'Porplyzia, Jason', 'Sloane, Rory', 'Reilly, Brent', 'Rutten, Ben', 'Douglas, Richard', 'van Berlo, Nathan', 'Tippett, Kurt', 'Callinan, Ian', 'Henderson, Ricky', 'Thompson, Luke', 'Johncock, Graham', 'Doughty, Michael', 'Jacobs, Sam', 'Otten, Andy', 'Wright, Matthew', 'Walker, Taylor', 'Mackay, David', 'Petrenko, Jared', 'Dangerfield, Patrick', 'Vince, Bernie'])],
      dtype=object)

Note how the encoded columns were pushed to the front of the array. Now, we need a training set and a test set. The training set will be fed into the ANN and used to develop the model, and the test set will allow us to measure the performance of the model. Usually, I would recommend choosing training and test sets using train_test_split out of Scikit-learn, however I went for a different approach in this project: I chose the data from the 2012-2017 season as my training set and data from the 2018 season as my test set. Firstly, split grouped data:

training = [x for x in grouped_data if x[43]<2018]
test = [x for x in grouped_data if x[43]==2018]

So “training” and “test” are just subsets of grouped_data.

Next, we find an encoded column corresponding to whether the team won or lost and use this as our dependent variable (as this is what we’re trying to predict):

y_train = np.array([x[0] for x in training])
y_train = y_train.reshape(-1,1)
y_test = np.array([x[0] for x in test])
y_test = y_test.reshape(-1,1)

The result is a two-dimensional array with 1’s corresponding to a win and 0’s to a loss:

print(y_train)
[[1.]
 [1.]
 [0.]
 ...
 [0.]
 [1.]
 [1.]]

Next, we find average stats for each player for the respective training and test sets and shuffle the resulting data to prevent bias:

# player_stats is each players' average stats between 2012 and 2017.
player_stats = afl[afl['Season'] < 2018].groupby('Player',sort=False).mean()
player_stats = player_stats.reset_index()
player_stats = player_stats.drop(['Season','Score','Margin'],axis=1)
player_stats = player_stats.sample(frac=1).reset_index(drop=True)
player_stats = player_stats.to_numpy()

# Player average stats for 2018
player_stats_2018 = afl[afl['Season'] == 2018].groupby('Player',sort=False).mean()
player_stats_2018 = player_stats_2018.reset_index()
player_stats_2018 = player_stats_2018.drop(['Season','Score','Margin'],axis=1)
player_stats_2018 = player_stats_2018.sample(frac=1).reset_index(drop=True)
player_stats_2018 = player_stats_2018.to_numpy()

We choose to ignore the season, score and margin columns. I guessed that the “season” column would not improve the prediction, which was later verified through testing. The “score” and “margin” columns are direct indicators of the match result, so if they were included in the analysis the neural network would achieve an accuracy of close to 100% on the training set but would perform very poorly on the test set.

We can see what each entry of these player statistic arrays looks like:

player_stats[0]
array(['Goodes, Brett', 183.0, 89.0, 17.045454545454547,
       10.772727272727273, 4.045454545454546, 6.2727272727272725,
       0.18181818181818182, 0.09090909090909091, 0.13636363636363635,
       2.6363636363636362, 2.8181818181818183, 2.272727272727273,
       1.3636363636363635, 2.772727272727273, 0.9545454545454546, 1.0,
       0.09090909090909091, 6.2727272727272725, 9.909090909090908,
       0.22727272727272727, 0.0, 1.8636363636363635, 0.5,
       0.18181818181818182, 78.5, 29.979335024613157], dtype=object)

Now we want to combine our encoded venue and opposition data with our average player data. Firstly, we get our encoded data from the “training” and “test” arrays we created previously:

opp_teams_train = np.array([x[2:42] for x in training])
opp_teams_test = np.array([x[2:42] for x in test])

The following “for” loop then reads the list of players on a team for each match, finds each player’s average stats in player_stats and adds them to an array, i.e. each row of X_train contains every player statistic for each game. The same thing is done for player_stats_2018, with the results stored in X_test:

X_train = [0]*len(training)
for i in range(0,len(training)):
    player_list = []
    j = 0
    for j in range(0,len(player_stats)):
        if player_stats[j][0] in training[i][-1]:
            player_list.append(player_stats[j][1:])
    X_train[i] = player_list

X_train = [np.concatenate(x) for x in X_train]
X_test = [0]*len(test)
for i in range(0,len(test)):
    player_list = []
    j = 0
    for j in range(0,len(player_stats_2018)):
        if player_stats_2018[j][0] in test[i][-1]:
            player_list.append(player_stats_2018[j][1:])
    X_test[i] = player_list

X_test = [np.concatenate(x) for x in X_test]
print(len(X_train[0]))
572

The final part of this data engineering step involves something called feature scaling. Again, we won’t explore this concept in-depth, but there’s a lot of learning material out there if you’re curious (this Towards Data Science article, for example). Essentially, feature scaling improves the convergence rate of the ANN we’re about to create. We only need to scale our non-binary data, so we can leave our dependent variable arrays and opposition/venue data alone.

For this project I just used StandardScaler class out of the Scikit-learn pre-processing module:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
scaled_X_train = sc.fit_transform(X_train)
scaled_X_test = sc.fit_transform(X_test)

After feature scaling, we just combine our player and opposition/venue data, which gives us our matrix of features:

scaled_X_train = np.hstack((scaled_X_train, opp_teams_train))
scaled_X_train = np.asarray(scaled_X_train).astype(np.float32)

scaled_X_test = np.hstack((scaled_X_test, opp_teams_test))
scaled_X_test = np.asarray(scaled_X_test).astype(np.float32)

Run the below print statement to check that the data combined successfully:

print(scaled_X_train[0])

The reason we now have negative values is because of the feature scaling.

Step 4: building the ANN

This part of the project is surprisingly simple, as all the heavy lifting is done by the TensorFlow library. We’ll use a network with two hidden layers, each with six neurons and both utilising the rectified linear unit (ReLU) activation function. ReLU is one of the most diverse and adaptable activation functions, which is why we’re using it. We’re opting for this network layout for simplicity, but feel free to experiment by adding more layers and/or neurons! For the output layer, we will use a sigmoid activation function as we are looking for a probabilistic output. For more reading on how a basic neural network functions, there is plenty of online material.

In TensorFlow (tf), the layers are added individually using the .add() method:

# Build the ANN
ann = tf.keras.models.Sequential()
# First hidden layer
ann.add(tf.keras.layers.Dense(units = 6, activation = 'relu'))
# Second hidden layer
ann.add(tf.keras.layers.Dense(units = 6, activation = 'relu'))
# Output layer
ann.add(tf.keras.layers.Dense(units = 1, activation = 'sigmoid'))

We then compile the ANN using the Adam optimiser and the binary cross-entropy loss function. Adam is selected again for its adaptability to a range of problems, whilst the loss function is chosen due to its effectiveness for binary outputs.

ann.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

Step 5: feed the data into the ANN

After compiling, the ANN is ready to be trained on the data. We will use a for loop to fit our model, as we would like to analyse the results later. For each loop, we re-fit the model and evaluate the loss and accuracy on both the training and test sets:

steps = []
accs = []
test_accs = []
for i in range(0, 25):
    ann.fit(scaled_X_train, y_train, epochs = 1)
    [loss, accuracy] = ann.evaluate(scaled_X_train, y_train)
    [loss_t, accuracy_t] = ann.evaluate(scaled_X_test, y_test)
    accs.append(accuracy), steps.append(i), test_accs.append(accuracy_t)

Step 6: verify results

While the training loop is running, we can see the training progress being printed, including the accuracy and loss. These are the last few lines of output:

78/78 [==============================] - 0s 628us/step - loss: 0.3769 - accuracy: 0.7887
78/78 [==============================] - 0s 462us/step - loss: 0.3508 - accuracy: 0.7952
13/13 [==============================] - 0s 539us/step - loss: 0.8428 - accuracy: 0.6176

On the bottom line, we can see the final accuracy on the test set of 61.8%. So, after all that work, we only get an accuracy of 61.8%… hardly seems worth the effort, right?

We can plot training set accuracy (blue) and test accuracy (orange) over time steps to investigate how accuracy fluctuates over training steps (epochs):

import matplotlib.pyplot as plt
plt.plot(steps, accs)
plt.plot(steps, test_accs)
[<matplotlib.lines.Line2D at 0x1f797461688>]

We can see that accuracy on the test set increases steadily over the first five epochs, then remains relatively flat while the training accuracy climbs gradually. Ideally, we would like to see a gradual increase in the test set accuracy as well.

Potential improvements

There are several improvements that can be made to this basic model:

  • Adjusting hyperparameters using a validation set. Good neural networks use validation sets to investigate the effects of adjusting their hyperparameters, which in this case include the number of layers, the learning rate, the type of activation function and the batch size. This article sums these up in an easy-to-understand way.
  • Using better data. The data set I found online was relatively simplistic in that some vital statistics weren’t included. For example, it is well known that “points from clearances” and “points from turnovers” are key indicators of successful teams. If such statistics were included in the data, the neural network may have performed better.
  • Changing the structure of the data fed into the ANN. In this project, for simplicity, I used players’ data averaged over the whole time period for the respective training and test sets. However, this doesn’t really make sense: for example, the average stats for a player who played a game in 2012 would include data from 2012 to 2017, i.e. from the future. This problem could be remedied, but it would be more challenging to derive the training and test sets. Check out this code on Kaggle for an example of how a (slightly more realistic) statistical Machine Learning approach may be used on this kind of data set.

So, the moral of the story is that it’s relatively straightforward to create a simple AI. There are several libraries out there – such as Pandas, Numpy, Scikit-learn and Tensorflow – to help you out. Hopefully this article has provided you with some ideas and inspiration on how you can apply machine learning techniques for prediction.

Get in touch with the 4CDA team if you would like to learn more on how you can apply machine learning approaches to solve problems!