An introduction to using artificial neural networks via TensorFlow to predict AFL football game results

Ever wondered how an Artificial Neural Network works? Here’s a simple tutorial on the basics of implementing an ANN as a solution to a data problem using a real-world example.

Ever wondered how an Artificial Neural Network works? Here’s a simple tutorial on the basics of implementing an ANN as a solution to a data problem using a real-world example.

In this tutorial, we will use artificial neural networks (ANN), implemented via the popular open source library TensorFlow, to predict the outcome of games of Australian Rules Football (AFL). We will use historical data to train the model and then use those parameters in the predictive model. The process will include an end-to-end approach including data acquisition, cleansing and organising, before the application of the ANN and verification of results.

This tutorial assumes some prior basic knowledge in Python.

What is an artificial neural network (ANN)?

The definition according to Wikipedia for an artificial neural network is:

ANNs, usually simply called neural networks (NNs), are computing systems vaguely inspired by the biological neural networks that constitute animal brains.

– Wikipedia definition

ANNs can solve new problems by analysing previous examples of similar problems. In this example, we have chosen ANNs to demonstrate their power in classification and regression problems. Similar results could be gained using linear or logistic regression (see our introductory post on linear regression), random forest classification, etc..

Project Structure

The structure we’ll follow to create an AFL predictor involves six steps:

  1. Find/create your dataset
  2. Clean the data
  3. Organise the data
  4. Set up the Neural Network
  5. Feed the data into the Neural Network
  6. Verify results

This structure may be generalised to solve many different problems, so I encourage readers to use this article as an example of how to implement an ANN in Python and then use the method to solve an alternative problem. Furthermore, I would suggest choosing a project that you’re passionate and interested in – I found that this motivated me to continue working and tinkering even when I wasn’t sure what to do next.

The final code used in this tutorial is available in this GitHub repository.

Step 1: find/create your dataset

Before we can do anything Data Science-related, we need data. I found a Kaggle dataset containing AFL player data from (almost) all matches between 2012 and 2018. It has a manageable 63,712 rows and 37 columns.

We need to import it into Jupyter Notebook, the Python workbook we’ll be using, before we can see or do anything with it. Let’s start by importing some essential libraries and loading our data:

import tensorflow as tf
import numpy as np
import pandas as pd

TensorFlow is an open source machine learning platform from Google that contains the Neural Networks we will use in our predictor, Numpy is used extensively for matrix and array operations, and Pandas is used to put the data into a data frame just to make our lives easier.

The next part of this step is to load the data using Pandas:

afl = pd.read_csv('C:/path/to/your/csv/file.csv')

This will place the data from your CSV file into a Pandas data frame. Data frames are great for working with large data sets, as they allow for easy row/column operations and sub-setting amongst many other things.

Now that the data is loaded, we can see the column names using the .info() method:

afl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63712 entries, 0 to 63711
Data columns (total 37 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Team                    63712 non-null  object 
 1   Player                  63712 non-null  object 
 2   D.O.B                   63712 non-null  object 
 3   Height                  63712 non-null  int64  
 4   Weight                  63712 non-null  int64  
 5   Position                63712 non-null  object 
 6   Season                  63712 non-null  int64  
 7   Round                   63712 non-null  object 
 8   Date                    63624 non-null  object 
 9   Score                   63624 non-null  float64
 10  Margin                  63624 non-null  float64
 11  WinLoss                 63624 non-null  object 
 12  Opposition              63624 non-null  object 
 13  Venue                   63624 non-null  object 
 14  Disposals               63712 non-null  int64  
 15  Kicks                   63712 non-null  int64  
 16  Marks                   63712 non-null  int64  
 17  Handballs               63712 non-null  int64  
 18  Goals                   63712 non-null  int64  
 19  Behinds                 63712 non-null  int64  
 20  Hitouts                 63712 non-null  int64  
 21  Tackles                 63712 non-null  int64  
 22  Rebound50s              63712 non-null  int64  
 23  Inside50s               63712 non-null  int64  
 24  Clearances              63712 non-null  int64  
 25  Clangers                63712 non-null  int64  
 26  FreesFor                63712 non-null  int64  
 27  FreesAgainst            63712 non-null  int64  
 28  BrownlowVotes           63712 non-null  int64  
 29  ContendedPossessions    63712 non-null  int64  
 30  UncontendedPossessions  63712 non-null  int64  
 31  ContestedMarks          63712 non-null  int64  
 32  MarksInside50           63712 non-null  int64  
 33  OnePercenters           63712 non-null  int64  
 34  Bounces                 63712 non-null  int64  
 35  GoalAssists             63712 non-null  int64  
 36  PercentPlayed           63712 non-null  int64  
dtypes: float64(2), int64(26), object(9)
memory usage: 18.0+ MB

Step 2: clean the data

An essential component of any data-centric project is to clean your data set. Luckily, this one did not require too much cleaning – the only necessary change was to get rid of any rows containing NA entries – but I chose to demonstrate some data handling capabilities of the Pandas library by also adding an “Age” column, which may also be relevant in what we are trying to predict:

import datetime

afl = afl.dropna(axis=0)
afl = afl[afl['WinLoss'] != 'D']
afl['D.O.B'] = pd.to_datetime(afl['D.O.B'])
afl['Date'] = pd.to_datetime(afl['Date'])
age_in_days = (afl['Date']-afl['D.O.B'])
age_in_years = age_in_days.dt.days/365.2425
afl['Age'] = age_in_years

The first line deletes any rows containing NA entries. The second line deletes rows where the match result was a draw – I chose to do this because draws are so infrequent in the AFL that it’s pretty much pointless to try predicting them. The rest of this code block finds each player’s age in years at the time they played each game and adds a column for this data.

Step 3: organise the data

This is arguably the most critical component of the project. We need to decide which data to use, and how to feed it into the ANN. I considered many different approaches to this step, but the structure I settled on is described below:

This may seem complicated, but the code snippets below should help explain what’s going on.

Firstly, the data is grouped by match, a list of players is generated for each of these matches, the team lists are shuffled and the result is converted to a Numpy array:

import random
grouped_data = afl.groupby(['Team','Season','Round','WinLoss','Opposition','Venue'])['Player'].apply(list).reset_index()
players = grouped_data['Player'].to_numpy()
for item in players:
    random.shuffle(item)
grouped_data['Player'] = players
grouped_data = grouped_data.to_numpy()

We shuffle the team lists to prevent bias towards some players in the neural network. For example, Gary Ablett is one of the best players of all time, so his average stats would probably be in the top 5% of players across the league. His name would be first on almost every one of his team lists, meaning the neural network may assign weights differently to his stats columns than other player columns. Shuffling the team lists ensures we don’t encounter this issue.

We can look at the first entry to see what each array element looks like:

grouped_data[0]
array(['Adelaide', 2012, 'PF', 'L', 'Hawthorn', 'M.C.G.',
       list(['Thompson, Scott', 'Smith, Brodie', 'Porplyzia, Jason', 'Sloane, Rory', 'Reilly, Brent', 'Rutten, Ben', 'Douglas, Richard', 'van Berlo, Nathan', 'Tippett, Kurt', 'Callinan, Ian', 'Henderson, Ricky', 'Thompson, Luke', 'Johncock, Graham', 'Doughty, Michael', 'Jacobs, Sam', 'Otten, Andy', 'Wright, Matthew', 'Walker, Taylor', 'Mackay, David', 'Petrenko, Jared', 'Dangerfield, Patrick', 'Vince, Bernie'])],
      dtype=object)

We can now see that each array element are themselves arrays containing the team, season, round, result, opposing team, ground and list of players on the team.

The next part of this step is to convert non-numerical information into numerical data so we can feed it into the ANN. The best way of doing this is by using column encoding – we won’t go into too much detail in this article, but there are plenty of tutorials online covering this topic (here is an example). So, we convert the match result, opposing team and ground columns into encoded columns, which requires the OneHotEncoder and ColumnTransformer modules out of Scikit-learn.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct1 = ColumnTransformer([('encoder',OneHotEncoder(),[3,4,5])], remainder='passthrough',sparse_threshold=0)
grouped_data = ct1.fit_transform(grouped_data)

That’s all there is to it – let’s take another look at the first array element to see what each entry looks like after the transformation:

grouped_data[0]
array([1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 'Adelaide', 2012, 'PF',
       list(['Thompson, Scott', 'Smith, Brodie', 'Porplyzia, Jason', 'Sloane, Rory', 'Reilly, Brent', 'Rutten, Ben', 'Douglas, Richard', 'van Berlo, Nathan', 'Tippett, Kurt', 'Callinan, Ian', 'Henderson, Ricky', 'Thompson, Luke', 'Johncock, Graham', 'Doughty, Michael', 'Jacobs, Sam', 'Otten, Andy', 'Wright, Matthew', 'Walker, Taylor', 'Mackay, David', 'Petrenko, Jared', 'Dangerfield, Patrick', 'Vince, Bernie'])],
      dtype=object)

Note how the encoded columns were pushed to the front of the array. Now, we need a training set and a test set. The training set will be fed into the ANN and used to develop the model, and the test set will allow us to measure the performance of the model. Usually, I would recommend choosing training and test sets using train_test_split out of Scikit-learn, however I went for a different approach in this project: I chose the data from the 2012-2017 season as my training set and data from the 2018 season as my test set. Firstly, split grouped data:

training = [x for x in grouped_data if x[43]<2018]
test = [x for x in grouped_data if x[43]==2018]

So “training” and “test” are just subsets of grouped_data.

Next, we find an encoded column corresponding to whether the team won or lost and use this as our dependent variable (as this is what we’re trying to predict):

y_train = np.array([x[0] for x in training])
y_train = y_train.reshape(-1,1)
y_test = np.array([x[0] for x in test])
y_test = y_test.reshape(-1,1)

The result is a two-dimensional array with 1’s corresponding to a win and 0’s to a loss:

print(y_train)
[[1.]
 [1.]
 [0.]
 ...
 [0.]
 [1.]
 [1.]]

Next, we find average stats for each player for the respective training and test sets and shuffle the resulting data to prevent bias:

# player_stats is each players' average stats between 2012 and 2017.
player_stats = afl[afl['Season'] < 2018].groupby('Player',sort=False).mean()
player_stats = player_stats.reset_index()
player_stats = player_stats.drop(['Season','Score','Margin'],axis=1)
player_stats = player_stats.sample(frac=1).reset_index(drop=True)
player_stats = player_stats.to_numpy()

# Player average stats for 2018
player_stats_2018 = afl[afl['Season'] == 2018].groupby('Player',sort=False).mean()
player_stats_2018 = player_stats_2018.reset_index()
player_stats_2018 = player_stats_2018.drop(['Season','Score','Margin'],axis=1)
player_stats_2018 = player_stats_2018.sample(frac=1).reset_index(drop=True)
player_stats_2018 = player_stats_2018.to_numpy()

We choose to ignore the season, score and margin columns. I guessed that the “season” column would not improve the prediction, which was later verified through testing. The “score” and “margin” columns are direct indicators of the match result, so if they were included in the analysis the neural network would achieve an accuracy of close to 100% on the training set but would perform very poorly on the test set.

We can see what each entry of these player statistic arrays looks like:

player_stats[0]
array(['Goodes, Brett', 183.0, 89.0, 17.045454545454547,
       10.772727272727273, 4.045454545454546, 6.2727272727272725,
       0.18181818181818182, 0.09090909090909091, 0.13636363636363635,
       2.6363636363636362, 2.8181818181818183, 2.272727272727273,
       1.3636363636363635, 2.772727272727273, 0.9545454545454546, 1.0,
       0.09090909090909091, 6.2727272727272725, 9.909090909090908,
       0.22727272727272727, 0.0, 1.8636363636363635, 0.5,
       0.18181818181818182, 78.5, 29.979335024613157], dtype=object)

Now we want to combine our encoded venue and opposition data with our average player data. Firstly, we get our encoded data from the “training” and “test” arrays we created previously:

opp_teams_train = np.array([x[2:42] for x in training])
opp_teams_test = np.array([x[2:42] for x in test])

The following “for” loop then reads the list of players on a team for each match, finds each player’s average stats in player_stats and adds them to an array, i.e. each row of X_train contains every player statistic for each game. The same thing is done for player_stats_2018, with the results stored in X_test:

X_train = [0]*len(training)
for i in range(0,len(training)):
    player_list = []
    j = 0
    for j in range(0,len(player_stats)):
        if player_stats[j][0] in training[i][-1]:
            player_list.append(player_stats[j][1:])
    X_train[i] = player_list

X_train = [np.concatenate(x) for x in X_train]
X_test = [0]*len(test)
for i in range(0,len(test)):
    player_list = []
    j = 0
    for j in range(0,len(player_stats_2018)):
        if player_stats_2018[j][0] in test[i][-1]:
            player_list.append(player_stats_2018[j][1:])
    X_test[i] = player_list

X_test = [np.concatenate(x) for x in X_test]
print(len(X_train[0]))
572

The final part of this data engineering step involves something called feature scaling. Again, we won’t explore this concept in-depth, but there’s a lot of learning material out there if you’re curious (this Towards Data Science article, for example). Essentially, feature scaling improves the convergence rate of the ANN we’re about to create. We only need to scale our non-binary data, so we can leave our dependent variable arrays and opposition/venue data alone.

For this project I just used StandardScaler class out of the Scikit-learn pre-processing module:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
scaled_X_train = sc.fit_transform(X_train)
scaled_X_test = sc.fit_transform(X_test)

After feature scaling, we just combine our player and opposition/venue data, which gives us our matrix of features:

scaled_X_train = np.hstack((scaled_X_train, opp_teams_train))
scaled_X_train = np.asarray(scaled_X_train).astype(np.float32)

scaled_X_test = np.hstack((scaled_X_test, opp_teams_test))
scaled_X_test = np.asarray(scaled_X_test).astype(np.float32)

Run the below print statement to check that the data combined successfully:

print(scaled_X_train[0])

The reason we now have negative values is because of the feature scaling.

Step 4: building the ANN

This part of the project is surprisingly simple, as all the heavy lifting is done by the TensorFlow library. We’ll use a network with two hidden layers, each with six neurons and both utilising the rectified linear unit (ReLU) activation function. ReLU is one of the most diverse and adaptable activation functions, which is why we’re using it. We’re opting for this network layout for simplicity, but feel free to experiment by adding more layers and/or neurons! For the output layer, we will use a sigmoid activation function as we are looking for a probabilistic output. For more reading on how a basic neural network functions, there is plenty of online material.

In TensorFlow (tf), the layers are added individually using the .add() method:

# Build the ANN
ann = tf.keras.models.Sequential()
# First hidden layer
ann.add(tf.keras.layers.Dense(units = 6, activation = 'relu'))
# Second hidden layer
ann.add(tf.keras.layers.Dense(units = 6, activation = 'relu'))
# Output layer
ann.add(tf.keras.layers.Dense(units = 1, activation = 'sigmoid'))

We then compile the ANN using the Adam optimiser and the binary cross-entropy loss function. Adam is selected again for its adaptability to a range of problems, whilst the loss function is chosen due to its effectiveness for binary outputs.

ann.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

Step 5: feed the data into the ANN

After compiling, the ANN is ready to be trained on the data. We will use a for loop to fit our model, as we would like to analyse the results later. For each loop, we re-fit the model and evaluate the loss and accuracy on both the training and test sets:

steps = []
accs = []
test_accs = []
for i in range(0, 25):
    ann.fit(scaled_X_train, y_train, epochs = 1)
    [loss, accuracy] = ann.evaluate(scaled_X_train, y_train)
    [loss_t, accuracy_t] = ann.evaluate(scaled_X_test, y_test)
    accs.append(accuracy), steps.append(i), test_accs.append(accuracy_t)

Step 6: verify results

While the training loop is running, we can see the training progress being printed, including the accuracy and loss. These are the last few lines of output:

78/78 [==============================] - 0s 628us/step - loss: 0.3769 - accuracy: 0.7887
78/78 [==============================] - 0s 462us/step - loss: 0.3508 - accuracy: 0.7952
13/13 [==============================] - 0s 539us/step - loss: 0.8428 - accuracy: 0.6176

On the bottom line, we can see the final accuracy on the test set of 61.8%. So, after all that work, we only get an accuracy of 61.8%… hardly seems worth the effort, right?

We can plot training set accuracy (blue) and test accuracy (orange) over time steps to investigate how accuracy fluctuates over training steps (epochs):

import matplotlib.pyplot as plt
plt.plot(steps, accs)
plt.plot(steps, test_accs)
[<matplotlib.lines.Line2D at 0x1f797461688>]

We can see that accuracy on the test set increases steadily over the first five epochs, then remains relatively flat while the training accuracy climbs gradually. Ideally, we would like to see a gradual increase in the test set accuracy as well.

Potential improvements

There are several improvements that can be made to this basic model:

  • Adjusting hyperparameters using a validation set. Good neural networks use validation sets to investigate the effects of adjusting their hyperparameters, which in this case include the number of layers, the learning rate, the type of activation function and the batch size. This article sums these up in an easy-to-understand way.
  • Using better data. The data set I found online was relatively simplistic in that some vital statistics weren’t included. For example, it is well known that “points from clearances” and “points from turnovers” are key indicators of successful teams. If such statistics were included in the data, the neural network may have performed better.
  • Changing the structure of the data fed into the ANN. In this project, for simplicity, I used players’ data averaged over the whole time period for the respective training and test sets. However, this doesn’t really make sense: for example, the average stats for a player who played a game in 2012 would include data from 2012 to 2017, i.e. from the future. This problem could be remedied, but it would be more challenging to derive the training and test sets. Check out this code on Kaggle for an example of how a (slightly more realistic) statistical Machine Learning approach may be used on this kind of data set.

So, the moral of the story is that it’s relatively straightforward to create a simple AI. There are several libraries out there – such as Pandas, Numpy, Scikit-learn and Tensorflow – to help you out. Hopefully this article has provided you with some ideas and inspiration on how you can apply machine learning techniques for prediction.

Get in touch with the 4CDA team if you would like to learn more on how you can apply machine learning approaches to solve problems!