In this tutorial, we will use artificial neural networks (ANN), implemented via the popular open source library TensorFlow, to predict the outcome of games of Australian Rules Football (AFL). We will use historical data to train the model and then use those parameters in the predictive model. The process will include an end-to-end approach including data acquisition, cleansing and organising, before the application of the ANN and verification of results.
This tutorial assumes some prior basic knowledge in Python.
What is an artificial neural network (ANN)?
The definition according to Wikipedia for an artificial neural network is:
ANNs, usually simply called neural networks (NNs), are computing systems vaguely inspired by the biological neural networks that constitute animal brains.
– Wikipedia definition
ANNs can solve new problems by analysing previous examples of similar problems. In this example, we have chosen ANNs to demonstrate their power in classification and regression problems. Similar results could be gained using linear or logistic regression (see our introductory post on linear regression), random forest classification, etc..
Project Structure
The structure we’ll follow to create an AFL predictor involves six steps:
- Find/create your dataset
- Clean the data
- Organise the data
- Set up the Neural Network
- Feed the data into the Neural Network
- Verify results
This structure may be generalised to solve many different problems, so I encourage readers to use this article as an example of how to implement an ANN in Python and then use the method to solve an alternative problem. Furthermore, I would suggest choosing a project that you’re passionate and interested in – I found that this motivated me to continue working and tinkering even when I wasn’t sure what to do next.
The final code used in this tutorial is available in this GitHub repository.
Step 1: find/create your dataset
Before we can do anything Data Science-related, we need data. I found a Kaggle dataset containing AFL player data from (almost) all matches between 2012 and 2018. It has a manageable 63,712 rows and 37 columns.
We need to import it into Jupyter Notebook, the Python workbook we’ll be using, before we can see or do anything with it. Let’s start by importing some essential libraries and loading our data:
import tensorflow as tf
import numpy as np
import pandas as pd
TensorFlow is an open source machine learning platform from Google that contains the Neural Networks we will use in our predictor, Numpy is used extensively for matrix and array operations, and Pandas is used to put the data into a data frame just to make our lives easier.
The next part of this step is to load the data using Pandas:
afl = pd.read_csv('C:/path/to/your/csv/file.csv')
This will place the data from your CSV file into a Pandas data frame. Data frames are great for working with large data sets, as they allow for easy row/column operations and sub-setting amongst many other things.
Now that the data is loaded, we can see the column names using the .info()
method:
afl.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 63712 entries, 0 to 63711 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 63712 non-null object 1 Player 63712 non-null object 2 D.O.B 63712 non-null object 3 Height 63712 non-null int64 4 Weight 63712 non-null int64 5 Position 63712 non-null object 6 Season 63712 non-null int64 7 Round 63712 non-null object 8 Date 63624 non-null object 9 Score 63624 non-null float64 10 Margin 63624 non-null float64 11 WinLoss 63624 non-null object 12 Opposition 63624 non-null object 13 Venue 63624 non-null object 14 Disposals 63712 non-null int64 15 Kicks 63712 non-null int64 16 Marks 63712 non-null int64 17 Handballs 63712 non-null int64 18 Goals 63712 non-null int64 19 Behinds 63712 non-null int64 20 Hitouts 63712 non-null int64 21 Tackles 63712 non-null int64 22 Rebound50s 63712 non-null int64 23 Inside50s 63712 non-null int64 24 Clearances 63712 non-null int64 25 Clangers 63712 non-null int64 26 FreesFor 63712 non-null int64 27 FreesAgainst 63712 non-null int64 28 BrownlowVotes 63712 non-null int64 29 ContendedPossessions 63712 non-null int64 30 UncontendedPossessions 63712 non-null int64 31 ContestedMarks 63712 non-null int64 32 MarksInside50 63712 non-null int64 33 OnePercenters 63712 non-null int64 34 Bounces 63712 non-null int64 35 GoalAssists 63712 non-null int64 36 PercentPlayed 63712 non-null int64 dtypes: float64(2), int64(26), object(9) memory usage: 18.0+ MB
Step 2: clean the data
An essential component of any data-centric project is to clean your data set. Luckily, this one did not require too much cleaning – the only necessary change was to get rid of any rows containing NA entries – but I chose to demonstrate some data handling capabilities of the Pandas library by also adding an “Age” column, which may also be relevant in what we are trying to predict:
import datetime
afl = afl.dropna(axis=0)
afl = afl[afl['WinLoss'] != 'D']
afl['D.O.B'] = pd.to_datetime(afl['D.O.B'])
afl['Date'] = pd.to_datetime(afl['Date'])
age_in_days = (afl['Date']-afl['D.O.B'])
age_in_years = age_in_days.dt.days/365.2425
afl['Age'] = age_in_years
The first line deletes any rows containing NA entries. The second line deletes rows where the match result was a draw – I chose to do this because draws are so infrequent in the AFL that it’s pretty much pointless to try predicting them. The rest of this code block finds each player’s age in years at the time they played each game and adds a column for this data.
Step 3: organise the data
This is arguably the most critical component of the project. We need to decide which data to use, and how to feed it into the ANN. I considered many different approaches to this step, but the structure I settled on is described below:
This may seem complicated, but the code snippets below should help explain what’s going on.
Firstly, the data is grouped by match, a list of players is generated for each of these matches, the team lists are shuffled and the result is converted to a Numpy array:
import random
grouped_data = afl.groupby(['Team','Season','Round','WinLoss','Opposition','Venue'])['Player'].apply(list).reset_index()
players = grouped_data['Player'].to_numpy()
for item in players:
random.shuffle(item)
grouped_data['Player'] = players
grouped_data = grouped_data.to_numpy()
We shuffle the team lists to prevent bias towards some players in the neural network. For example, Gary Ablett is one of the best players of all time, so his average stats would probably be in the top 5% of players across the league. His name would be first on almost every one of his team lists, meaning the neural network may assign weights differently to his stats columns than other player columns. Shuffling the team lists ensures we don’t encounter this issue.
We can look at the first entry to see what each array element looks like:
grouped_data[0]
array(['Adelaide', 2012, 'PF', 'L', 'Hawthorn', 'M.C.G.', list(['Thompson, Scott', 'Smith, Brodie', 'Porplyzia, Jason', 'Sloane, Rory', 'Reilly, Brent', 'Rutten, Ben', 'Douglas, Richard', 'van Berlo, Nathan', 'Tippett, Kurt', 'Callinan, Ian', 'Henderson, Ricky', 'Thompson, Luke', 'Johncock, Graham', 'Doughty, Michael', 'Jacobs, Sam', 'Otten, Andy', 'Wright, Matthew', 'Walker, Taylor', 'Mackay, David', 'Petrenko, Jared', 'Dangerfield, Patrick', 'Vince, Bernie'])], dtype=object)
We can now see that each array element are themselves arrays containing the team, season, round, result, opposing team, ground and list of players on the team.
The next part of this step is to convert non-numerical information into numerical data so we can feed it into the ANN. The best way of doing this is by using column encoding – we won’t go into too much detail in this article, but there are plenty of tutorials online covering this topic (here is an example). So, we convert the match result, opposing team and ground columns into encoded columns, which requires the OneHotEncoder and ColumnTransformer modules out of Scikit-learn.
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct1 = ColumnTransformer([('encoder',OneHotEncoder(),[3,4,5])], remainder='passthrough',sparse_threshold=0)
grouped_data = ct1.fit_transform(grouped_data)
That’s all there is to it – let’s take another look at the first array element to see what each entry looks like after the transformation:
grouped_data[0]
array([1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 'Adelaide', 2012, 'PF', list(['Thompson, Scott', 'Smith, Brodie', 'Porplyzia, Jason', 'Sloane, Rory', 'Reilly, Brent', 'Rutten, Ben', 'Douglas, Richard', 'van Berlo, Nathan', 'Tippett, Kurt', 'Callinan, Ian', 'Henderson, Ricky', 'Thompson, Luke', 'Johncock, Graham', 'Doughty, Michael', 'Jacobs, Sam', 'Otten, Andy', 'Wright, Matthew', 'Walker, Taylor', 'Mackay, David', 'Petrenko, Jared', 'Dangerfield, Patrick', 'Vince, Bernie'])], dtype=object)
Note how the encoded columns were pushed to the front of the array. Now, we need a training set and a test set. The training set will be fed into the ANN and used to develop the model, and the test set will allow us to measure the performance of the model. Usually, I would recommend choosing training and test sets using train_test_split out of Scikit-learn, however I went for a different approach in this project: I chose the data from the 2012-2017 season as my training set and data from the 2018 season as my test set. Firstly, split grouped data:
training = [x for x in grouped_data if x[43]<2018]
test = [x for x in grouped_data if x[43]==2018]
So “training” and “test” are just subsets of grouped_data
.
Next, we find an encoded column corresponding to whether the team won or lost and use this as our dependent variable (as this is what we’re trying to predict):
y_train = np.array([x[0] for x in training])
y_train = y_train.reshape(-1,1)
y_test = np.array([x[0] for x in test])
y_test = y_test.reshape(-1,1)
The result is a two-dimensional array with 1’s corresponding to a win and 0’s to a loss:
print(y_train)
[[1.] [1.] [0.] ... [0.] [1.] [1.]]
Next, we find average stats for each player for the respective training and test sets and shuffle the resulting data to prevent bias:
# player_stats is each players' average stats between 2012 and 2017.
player_stats = afl[afl['Season'] < 2018].groupby('Player',sort=False).mean()
player_stats = player_stats.reset_index()
player_stats = player_stats.drop(['Season','Score','Margin'],axis=1)
player_stats = player_stats.sample(frac=1).reset_index(drop=True)
player_stats = player_stats.to_numpy()
# Player average stats for 2018
player_stats_2018 = afl[afl['Season'] == 2018].groupby('Player',sort=False).mean()
player_stats_2018 = player_stats_2018.reset_index()
player_stats_2018 = player_stats_2018.drop(['Season','Score','Margin'],axis=1)
player_stats_2018 = player_stats_2018.sample(frac=1).reset_index(drop=True)
player_stats_2018 = player_stats_2018.to_numpy()
We choose to ignore the season, score and margin columns. I guessed that the “season” column would not improve the prediction, which was later verified through testing. The “score” and “margin” columns are direct indicators of the match result, so if they were included in the analysis the neural network would achieve an accuracy of close to 100% on the training set but would perform very poorly on the test set.
We can see what each entry of these player statistic arrays looks like:
player_stats[0]
array(['Goodes, Brett', 183.0, 89.0, 17.045454545454547, 10.772727272727273, 4.045454545454546, 6.2727272727272725, 0.18181818181818182, 0.09090909090909091, 0.13636363636363635, 2.6363636363636362, 2.8181818181818183, 2.272727272727273, 1.3636363636363635, 2.772727272727273, 0.9545454545454546, 1.0, 0.09090909090909091, 6.2727272727272725, 9.909090909090908, 0.22727272727272727, 0.0, 1.8636363636363635, 0.5, 0.18181818181818182, 78.5, 29.979335024613157], dtype=object)
Now we want to combine our encoded venue and opposition data with our average player data. Firstly, we get our encoded data from the “training” and “test” arrays we created previously:
opp_teams_train = np.array([x[2:42] for x in training])
opp_teams_test = np.array([x[2:42] for x in test])
The following “for” loop then reads the list of players on a team for each match, finds each player’s average stats in player_stats and adds them to an array, i.e. each row of X_train
contains every player statistic for each game. The same thing is done for player_stats_2018, with the results stored in X_test
:
X_train = [0]*len(training)
for i in range(0,len(training)):
player_list = []
j = 0
for j in range(0,len(player_stats)):
if player_stats[j][0] in training[i][-1]:
player_list.append(player_stats[j][1:])
X_train[i] = player_list
X_train = [np.concatenate(x) for x in X_train]
X_test = [0]*len(test)
for i in range(0,len(test)):
player_list = []
j = 0
for j in range(0,len(player_stats_2018)):
if player_stats_2018[j][0] in test[i][-1]:
player_list.append(player_stats_2018[j][1:])
X_test[i] = player_list
X_test = [np.concatenate(x) for x in X_test]
print(len(X_train[0]))
572
The final part of this data engineering step involves something called feature scaling. Again, we won’t explore this concept in-depth, but there’s a lot of learning material out there if you’re curious (this Towards Data Science article, for example). Essentially, feature scaling improves the convergence rate of the ANN we’re about to create. We only need to scale our non-binary data, so we can leave our dependent variable arrays and opposition/venue data alone.
For this project I just used StandardScaler class out of the Scikit-learn pre-processing module:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
scaled_X_train = sc.fit_transform(X_train)
scaled_X_test = sc.fit_transform(X_test)
After feature scaling, we just combine our player and opposition/venue data, which gives us our matrix of features:
scaled_X_train = np.hstack((scaled_X_train, opp_teams_train))
scaled_X_train = np.asarray(scaled_X_train).astype(np.float32)
scaled_X_test = np.hstack((scaled_X_test, opp_teams_test))
scaled_X_test = np.asarray(scaled_X_test).astype(np.float32)
Run the below print statement to check that the data combined successfully:
print(scaled_X_train[0])
The reason we now have negative values is because of the feature scaling.
Step 4: building the ANN
This part of the project is surprisingly simple, as all the heavy lifting is done by the TensorFlow library. We’ll use a network with two hidden layers, each with six neurons and both utilising the rectified linear unit (ReLU) activation function. ReLU is one of the most diverse and adaptable activation functions, which is why we’re using it. We’re opting for this network layout for simplicity, but feel free to experiment by adding more layers and/or neurons! For the output layer, we will use a sigmoid activation function as we are looking for a probabilistic output. For more reading on how a basic neural network functions, there is plenty of online material.
In TensorFlow (tf), the layers are added individually using the .add()
method:
# Build the ANN
ann = tf.keras.models.Sequential()
# First hidden layer
ann.add(tf.keras.layers.Dense(units = 6, activation = 'relu'))
# Second hidden layer
ann.add(tf.keras.layers.Dense(units = 6, activation = 'relu'))
# Output layer
ann.add(tf.keras.layers.Dense(units = 1, activation = 'sigmoid'))
We then compile the ANN using the Adam optimiser and the binary cross-entropy loss function. Adam is selected again for its adaptability to a range of problems, whilst the loss function is chosen due to its effectiveness for binary outputs.
ann.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
Step 5: feed the data into the ANN
After compiling, the ANN is ready to be trained on the data. We will use a for loop to fit our model, as we would like to analyse the results later. For each loop, we re-fit the model and evaluate the loss and accuracy on both the training and test sets:
steps = []
accs = []
test_accs = []
for i in range(0, 25):
ann.fit(scaled_X_train, y_train, epochs = 1)
[loss, accuracy] = ann.evaluate(scaled_X_train, y_train)
[loss_t, accuracy_t] = ann.evaluate(scaled_X_test, y_test)
accs.append(accuracy), steps.append(i), test_accs.append(accuracy_t)
Step 6: verify results
While the training loop is running, we can see the training progress being printed, including the accuracy and loss. These are the last few lines of output:
78/78 [==============================] - 0s 628us/step - loss: 0.3769 - accuracy: 0.7887 78/78 [==============================] - 0s 462us/step - loss: 0.3508 - accuracy: 0.7952 13/13 [==============================] - 0s 539us/step - loss: 0.8428 - accuracy: 0.6176
On the bottom line, we can see the final accuracy on the test set of 61.8%. So, after all that work, we only get an accuracy of 61.8%… hardly seems worth the effort, right?
We can plot training set accuracy (blue) and test accuracy (orange) over time steps to investigate how accuracy fluctuates over training steps (epochs):
import matplotlib.pyplot as plt
plt.plot(steps, accs)
plt.plot(steps, test_accs)
[<matplotlib.lines.Line2D at 0x1f797461688>]
We can see that accuracy on the test set increases steadily over the first five epochs, then remains relatively flat while the training accuracy climbs gradually. Ideally, we would like to see a gradual increase in the test set accuracy as well.
Potential improvements
There are several improvements that can be made to this basic model:
- Adjusting hyperparameters using a validation set. Good neural networks use validation sets to investigate the effects of adjusting their hyperparameters, which in this case include the number of layers, the learning rate, the type of activation function and the batch size. This article sums these up in an easy-to-understand way.
- Using better data. The data set I found online was relatively simplistic in that some vital statistics weren’t included. For example, it is well known that “points from clearances” and “points from turnovers” are key indicators of successful teams. If such statistics were included in the data, the neural network may have performed better.
- Changing the structure of the data fed into the ANN. In this project, for simplicity, I used players’ data averaged over the whole time period for the respective training and test sets. However, this doesn’t really make sense: for example, the average stats for a player who played a game in 2012 would include data from 2012 to 2017, i.e. from the future. This problem could be remedied, but it would be more challenging to derive the training and test sets. Check out this code on Kaggle for an example of how a (slightly more realistic) statistical Machine Learning approach may be used on this kind of data set.
So, the moral of the story is that it’s relatively straightforward to create a simple AI. There are several libraries out there – such as Pandas, Numpy, Scikit-learn and Tensorflow – to help you out. Hopefully this article has provided you with some ideas and inspiration on how you can apply machine learning techniques for prediction.
Get in touch with the 4CDA team if you would like to learn more on how you can apply machine learning approaches to solve problems!