Friday 27 January 2017

Feedforward Artificial Neural Network pt5: Additional analysis

Now that we've finally implemented our ANN, let's have a play around some of the parameters to get an understanding of how they affect our network and its results.
The tricky part about training ANNs is that the loss function isn't necessarily convex, which means that we can't use our usual optimisation routines. The fact that the loss function isn't necessarily convex means that just because we find a local minimum, it doesn't mean it's the global minimum. Thus non convex problems may converge on different local minima depending on the parameters of the optimisation routine. We'll explore some of these parameters below.

Learning Rate

Recall how the learning rate $\eta$ enters our optimisation procedure via the weight updates in gradient descent; $$ w \rightarrow w - \eta \frac{\partial L}{\partial w}$$ It essentially controls the step size at each update. Recall we had some funny bumps in our loss function at certain iterations, let's take a closer look: I've plotted the results of two different iterations of training the ANN below
The two lines correspond to the total loss as a function of the number of iterations in our training of the ANN. The blue line has $\eta = 0.001$ and the green line has $\eta = 0.01$. You can see that the green line has those funny bumps we witnessed before - this is the training example with a larger learning rate. The spikes occur when the step size is too large and we overshoot the minimum. Notice that the blue line doesn't have these overshoots, however it takes more iterations to approach the minimum. If we take a step size which is too large, then we consistently overshoot the minima - never converging on the minimum:


The key is finding a learning rate which will find the minimum within a reasonable timeframe. Although our selection ($\eta = 0.001$ vs $\eta = 0.01$) didn't make a huge difference in this case, consider an ANN with multiple hidden layers and thousands of neurons in each layer. This network may take hours (or days) to train depending on how we choose our learning rate.
Depending on the problem at hand, you may value accuracy more than efficiency or vice versa, this will dictate how you choose your learning rate, of which you will usually calculate using cross validation.

Regularisation / Weight Decay

Say we have our initial loss function (the cross entropy Loss) $L_0$ and we add a regularisation term such that we now have $$L = L_0 + \frac{\lambda}{2n} \sum_{w} w^2$$ where the sum is over all weights. Now if $\lambda$ is large then the second term will dominate $L$ and the task of optimising the entire expression will be reduced to minimising $\sum_w w^2$. If $\lambda$ is small then the first term dominates and there are less restrictions place on $w$. This regularisation term controls $w$ by preventing it from becoming overly large and helps us from overfitting the model. If we want to use gradient descent to minimise this regularised loss function we have $$ \frac{\partial L}{\partial w} = \frac{\partial L_0}{\partial w} + \frac{\lambda}{n} \sum_w w$$ so our update at each iteration is $$ w \rightarrow w - \eta \frac{\partial L}{\partial w}$$ becomes $$ w \rightarrow w - \eta \frac{\partial L_0}{\partial w} - \frac{\eta \lambda}{n} w$$ $$\implies w \rightarrow \left(1 - \frac{\eta \lambda}{n} \right) w - \eta \frac{\partial L_0}{\partial w}$$ That is at each update, the weight $w$ is rescaled by a factor of $\left( 1 - \frac{\eta \lambda}{n} \right)$ at each iteration; this is referred to as weight decay and as mentioned before, limits the magnitude of $w$.

Weight initialisation

In this section we'll take a look at why we chose to intialise our values as we did (from a normal distribution with specific parameters). Recall the definition of the weight update from our gradient descent algorithm $$ w \rightarrow w - \eta \frac{\partial L}{\partial w}$$, if the second term in this expression is small or zero, then there is effectively no (or very little) weight update to $w$. This causes our training to slow down incredibly, such that after each iteration our weight $w$ is only changing ever so slightly; obviously we would like to avoid this situation at the start of the procedure. Recall the backpropagation rules for $W^{(1)}$:

  • $\delta^{(1)} = (1-\tanh^{2}(Z^{(1)}) \odot \delta^{(2)}{W^{(2)}}^T$
  • $\frac{\partial L}{\partial W^{(1)}} = {x}^T \delta^{(1)}$

  • we see that $(1-\tanh^{2}(Z^{(1)})$ term enters the equation (more generally, this will be the derivative of the activation function). So we have our update to the weight $w$ as $$w \rightarrow w - \eta (1-\tanh^{2}(Z^{(1)})) \odot {x}^T  \delta^{(2)}{W^{(2)}}^T$$ That is the amount we update $w$ by is proportional to the derivative of our activation function. Thus we want to avoid initialising our weights in a region where this derivative is close to zero. Below is a plot of the function
    We can see that this activation function has its derivative approach zero at both extremes: as $x \rightarrow \infty$ and as $x \rightarrow -\infty$. Let's think about a more general ANN for a moment - suppose we have an ANN with 1000 inputs and a single training example where each input is equal to $1$. We have as usual $$Z^{(1)} = x W^{(1)} +b^{(1)}$$ If we have intialised each entry $W^{(1)}_{ij}$ and $b^{(1)}_j$ as selected from a standard normal distribution (iid), then each entry $Z^{(1)}_{i}$ will be the sum of 1001 iid standard normal variables. Then since the sum of $N$ standard normal variables will have mean $0$ and standard deviation $\sqrt{1001}$ i.e a very wide distribution with a relatively high probability of giving a large negative or positive result (this almost looks like a uniform distribution), the derivative of the activation function will be very close to zero! This isn't what we want.
    What about if we initialise with a random normal with mean $0$ and standard deviation $\frac{1}{\sqrt{1000}}$? Now we know that the variance of a sum of iid random normal variances is the sum of the variances so we now have each entry in $Z^{(1)}_{ij}$ has mean $0$ and standard deviation $$\sigma = \sqrt{\frac{1000}{1000}+1} = \sqrt{2}$$ which is a lot narrower than our distribution before - there is a lot smaller chance intialising at values where the derivative of the activation function is close to $0$. Below is a comparison of the resulting initialisation distributions from the toy example - the green line is the resulting distribution for the refined initialisation, where the red line results from initialisation by standard normal variables.

    More generally for a given network we will initialise from a Gaussian distribution with mean $0$ and standard deviation $\frac{1}{\sqrt{N_{in}}}$ where $N_{in}$ is the number of inputs into the neural network.
    Next time we'll have a look at optimising our network using stochastic gradient descent and maybe play around with some different datasets.

    Thursday 12 January 2017

    Feedforward Artificial Neural Network pt4: Implementation!

    So if you've survived this far, it's all worth it! We've done all the groundwork required to code up our ANN to classify our toy example (make moons dataset kindly provided by sklearn).

    Recap

    Recall that a motivation for exploring ANN's was that our standard Logistic Regression classification technique could only ever give us a linear separating boundary:

    ANN structure and Matrices

    In the last blog post we derived general expressions for a single hidden layer feedforward ANN. The results were;
    For a network with the following features:
    • 2 inputs, $x_1$, $x_2$
    • 1 Hidden Layer with n neurons - we will treat n as a parameter and see how it affects the output classification
    • 2 outputs, $\hat{y}_1$, $\hat{y}_2$
    • Activation function: $\sigma(x) = \tanh(x)$
    • Softmax output
    • Loss function: Cross entropy loss $$L = - \frac{1}{N} \sum_{n \in N} \sum_{i \in C} y_{n,i} \ln{\hat{y}_{n,i}}$$

    Forward Propagation Summary

    • $Z^{(1)} = xW^{(1)} + b^{(1)}$
    • $a^{(1)} = \tanh(Z^{(1)})$
    • $Z^{(2)} = a^{(1)}W^{(2)} + b^{(2)}$
    • $\hat{y} \equiv a^{(2)} = softmax(Z^{(2)})$
    Backpropagation Summary

    • $\delta^{(2)} = \hat{y} - y$
    • $\delta^{(1)} = (1-\tanh^{2}(Z^{(1)}) \odot \delta^{(2)}{W^{(2)}}^T$
    • $\frac{\partial L}{\partial W^{(2)}} = {a^{(1)}}^T \delta^{(2)}$
    • $\frac{\partial L}{\partial b^{(2)}} = \sum_{rows} \delta^{(2)}$
    • $\frac{\partial L}{\partial W^{(1)}} = {x}^T \delta^{(1)}$
    • $\frac{\partial L}{\partial b^{(1)}} = \sum_{rows} \delta^{(1)}$
    With the above, we have entirely defined how our ANN classifies training examples, how the weights and biases of the network affect the loss function and hence have our gradient descent update rule completely specified (up to the learning rate parameter).

    Implementation in Python

    Now instead of defining our network and neurons as a class, I've decided to closely follow this page (which actually inspired all of these ANN related posts) as it is more transparent and easier to grasp what is going on in the code. So without further delay, I present the code for an ANN with a single hidden layer of size 3:

    Code

    import numpy as np import sklearn.datasets
    import pandas as pd import matplotlib.pyplot as plt 
    np.random.seed(0) 
    X, y = sklearn.datasets.make_moons(200, noise=0.20)
    df = pd.DataFrame()
    df['x1'] = X[:,0]
    df['x2'] = X[:,1]
    df['z'] = y


    ##define ANN which takes inputs X and y with parameters n_iter as number of iterations and learning_rating the rate used in gradient descent
    def calculate(X ,y, num_hidden_neurons, n_iter=2000, learning_rate = 0.01, regularisation_rate = 0):
        loss = [];
        #initialise weights and biases
        X_size = len(X)
        np.random.seed(0)
        W1 = np.random.randn(2, num_hidden_neurons) / np.sqrt(2)
        b1 = np.zeros((1, num_hidden_neurons))
        W2 = np.random.randn(num_hidden_neurons, 2) / np.sqrt(num_hidden_neurons)
        b2 = np.zeros((1, 2))
        
        model = {}
        
        for i in xrange(n_iter):
        #feedforward
            Z1 = X.dot(W1)+b1
            a1 = np.tanh(Z1)
            Z2 = a1.dot(W2)+b2
            yhat = softmax(Z2)
        #backpropagation
            d2 = yhat
            d2[range(X_size),y] -= 1
            d1 = (1-a1**2)*d2.dot(W2.T)
            dW2 = a1.T.dot(d2)
            db2 = np.sum(d2, axis=0, keepdims = True)
            dW1 = X.T.dot(d1)
            db1 = np.sum(d1, axis=0)
        #regularisation
            dW1 += regularisation_rate * W1     
            dW2 += regularisation_rate * W2  
        #gradient descent    
            W1 = W1 - learning_rate * dW1
            b1 = b1 - learning_rate * db1
            W2 = W2 - learning_rate * dW2
            b2 = b2 - learning_rate * db2
            
            model = {'W1' : W1, 'b1' : b1, 'W2' : W2, 'b2' : b2}
            loss.append(predict(model,X,y,True))
                    
        return loss, model

    Comments

    The code is fairly self explanatory, but I'll walk through the steps
    1. Importing all the libraries we need
    2. Initialisation of the weights and biases (more on this later)
    3. Feedforward step - implemented via matrices exactly how we calculated them in the previous blog post
    4. Backpropagation step, again - implemented just how we derived them before. Note as the $b^{(i)}$ matrices consist of a single unique row each, we let python broadcast them into the appropriate dimensions when it does the matrix addition in the calculation of $Z^{(1)}$ and $Z^{(2)}$.
    5. Gradient descent to alter the weights after each iteration in order to minimize the loss function.
    6. Regularisation rate - I haven't mentioned this before, I've put it in there but set it to $0$ as to have no effect on the results. I will discuss this term later.
    7. Output a dictionary Model which contains our weight and bias matrices $W^{(1)}$, $b^{(1)}$, $W^{(2)}$, $b^{(2)}$.
    8. Note that I call a function (that's not defined here) called predict which will make predictions based on the input $X$ or return the cross entropy loss of a set of weights and biases when applied to $X$. The calculation of the cross entropy loss is the reason that $y$ is an input into the overall calculate function; if we were just calculating the weights and biases we would only require $X$ as an input. See code for predict at the bottom of this post.

    Results

    #function call - 3 neurons in the hidden layer
    loss, model = calculate(X, y, 3)
    The decision boundary is plotted below
    Notice the non-linear decision boundary, and how well it captures the separation of data points! This gives us great hope that if the data generating process is distinctly non-linear then even though the standard GLM (i.e logistic regression) decision boundaries are crude, with an ANN we can potentially get a good fit!
    The corresponding cross entropy loss as a function of iterations is also plotted below:
    Ok, the overall trend of the loss looks alright, but what the hell are the spikes??? Stay tuned...

    What about if we have 10 neurons in the hidden layer?
    #function call - 10 neurons in the hidden layer
    loss, model = calculate(X, y, 10)
    and the corresponding loss: 
    Now the spikes are gone and the loss is lower than in the 3 neuron case. Whoa this is a really good fit, or is it? How good is too good? Obviously here, with the 10 neurons the ANN is not only fitting the overall general shape of the data, but its capturing the noise from the input too. Just looking at the data (ignoring the decision boundary) we can imagine a kind of smooth sinusoidal shape separating the classes (as in the 3 neuron case above). Notice the jagged edge near the long red dot amongst the blue dots at coordinates (-0.5, 0.7), you can see that the single red training example has 'pulled' the decision boundary towards it. This is an ominous sign as the decision boundary is far too sensitive to a single data point, remember the idea is to capture the essence of the data generating process excluding noise.
    If we were to assess the performance of our ANN purely on the cross entropy loss - it would seem that the network with 10 neurons in the hidden layer outperforms the one with 3. However we always need to test the ANN on data it hasn't seen before - cross validation is usually employed to do so.

    Well there it is, if you've followed the posts from pt.1 of this series knowing the bare minimum about ANNs only armed with a pen, paper and a desire to learn, then you should be able to implement your own feedforward ANN from scratch! It was a laborious task (all the matrix algebra, etc) but I hope it provided insight into the construction and architecture of ANNs.

    Useful functions

    # Predict y_hat from input X or calculate loss
    def predict(model, X, y, calc_loss = False):
        W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
        # Forward propagation
        Z1 = X.dot(W1) + b1
        a1 = np.tanh(Z1)
        Z2 = a1.dot(W2) + b2
        probs = softmax(Z2)
        if calc_loss:
            loss = -np.log(probs[range(len(X)), y])
            tot_loss = np.sum(loss)
            return 1./len(X) * tot_loss
        return np.argmax(probs, axis=1)  
    #Compute softmax values for each sets of scores in x.
    def softmax(x):
        return np.exp(x) / np.sum(np.exp(x), axis=1, keepdims = True)
    Given a model, the predict function returns a classification (0 or 1) for our input data X or the cross entropy loss. This is used to keep track of the loss at each iteration of the weight updates. Note that this is just for illustrative purposes, since it is an expensive calculation in reality, it isn't wise to call it every iteration - maybe every 100, or 1000 or so.

    Softmax is used to calculate the softmax of the entries of a vector - used to calculate our $\hat{y}$.

    Next steps

    Currently I'm working my way through the CS229 Machine Learning course by Andrew Ng which takes up the majority of my spare time. It's a fantastic course which has accompanying lectures available on youtube and is a great introduction to ML techniques.

    However, in the future I am to continue with this implementation but tweak it to explore the following
    • Why the weights were initialised in such a peculiar way (why not just set them all to 0?)
    • Extending to more than 1 hidden layer
    • How to prevent overfitting
      • Regularisation (I mentioned it above, but did not explain what/how/why)
      • Drop out techniques
    • Stochastic gradient descent
      • A variant of our batch gradient descent which is industry practise when training an ANN (due to massive datasets)
    • Different applications - when I get around to it, I'll implement an ANN for the classic MNIST handwritten digit recognition problem.


    Wednesday 4 January 2017

    Feedforward Artificial Neural Network pt3: Matrices!

    So far we've looked at the building blocks of an ANN - the different layers, weights, activation functions and error functions. We know have a decent understanding of what the objects are and how they relate, but so far we've only looked at relatively small parts of the network in isolation for individual training samples. We'll look at implementing the aforementioned ideas using matrices, which will be very helpful when we build our full network and have hundreds (or thousands) of training samples. This is a very natural thing to do - if we can set up the propagation of our network and training samples through matrix multiplication then the computer can do the work in optimising the calculations. The other option is the old fashioned way of using for loops to iterate through each neuron and each training example in our network, this way is markedly slower and inefficient.

    Sample ANN - using matrices

    We'll use the below network to demonstrate the power of the matrix implementation. Note that we will do this step by step, so it can be rather laborious/repetitive but it's important to understand how each element of the matrix relates to the inputs and is propagated through the network. 
    The network has the following features:
    • 2 inputs, $x_1$, $x_2$
    • 1 Hidden Layer with 3 neurons
    • 2 outputs, $\hat{y}_1$, $\hat{y}_2$
    • Activation function: $\sigma(x) = \tanh(x)$
    • Softmax output
    • Loss function: Cross entropy loss $$L = - \frac{1}{N} \sum_{n \in N} \sum_{i \in C} y_{n,i} \ln{\hat{y}_{n,i}}$$

    Inputs

    We define our input matrix based on two training samples: $$x = \left[ \begin{array}{cc} x_{11} & x_{12} \\ x_{21} & x_{22} \\ \end{array} \right]$$ 
    where the matrix is defined such that $x_{ij}$ is the $j^{th}$ input of the $i^{th}$ training sample.

    Weights

    Now we define a matrix $W^{(1)}$ of weights for the neurons in the Hidden Layer;
    $$W^{(1)} = \left[ \begin{array}{ccc} w_{11}^{(1)} w_{12}^{(1)} w_{13}^{(1)} \\ w_{21}^{(1)} w_{22}^{(1)} w_{23}^{(1)} \end{array} \right]$$ where $w_{ij}^{(1)}$ is the weight for the $i^{th}$ input in the $j^{th}$ neuron in the hidden layer. Thus each column of $W^{(1)}$ represents the weight each input receives from each neuron in the hidden layer.

    Bias

    We introduce the concept of bias in an activation function. The bias is a translation of the entire activation function and is implemented as the following $$\sigma(w \cdot x +b)$$ where $w$ is the weight, $x$ is the input into the particular neuron and $b$ is the bias. The bias adds flexibility to the activation function; we can achieve any output from any input. 
    For example consider an activation function without bias $$\sigma(x) = \tanh(w \cdot x)$$ and say we want to achieve an output of $0.5$ when $x=0$. There is no such $w$ that will allow us to achieve this result, as $\tanh(0) = 0$. However if we introduce bias we have $$\sigma(x) = \tanh(w \cdot x + b)$$ we can set $b \approx 0.549$ to achieve the desired result. Now we introduce the bias matrix - the matrix of biases for the hidden layer $$b^{(1)} = \left[ \begin{array}{ccc} b_{11}^{(1)} b_{12}^{(1)} b_{13}^{(1)} \\ b_{21}^{(1)} b_{22}^{(1)} b_{23}^{(1)} \end{array} \right]$$ with the $b^{(1)}_{ij} = b^{(1)}_{kj}$ for all $k$ i.e $b^{(1)}$ has a single unique row; all other rows in the matrix are a copy of this one. This is just to ensure that the matrix multiplication works as intended.

    Forward Propagation

    We'll now work our way through the network, starting at the input, traversing through the hidden layer and arriving at the output.

    Using the notation described in this previous blog post, we get our matrix $Z^{(1)}$ which contains the results of applying the weights and biases to our inputs. Thus $$Z^{(1)} = x W^{(1)}+b^{(1)}$$ is a $2 \times 3$ matrix with the following structure
    $$Z^{(1)} = \left[ \begin{array}{ccc} x_{11}w_{11}^{(1)}+x_{12}w_{21}^{(1)}+b_{11}^{(1)} \hspace{.3in} \cdots \hspace{.3in} \cdots \\ x_{21}w_{11}^{(1)}+x_{22}w_{21}^{(1)}+b_{21}^{(1)} \hspace{.3in} \cdots \hspace{.3in} \cdots \end{array} \right]$$ where I've only included the first column of the matrix. In matrix index notation we have $$Z_{ij}^{(1)} = \sum_{k} x_{ik} W^{(1)}_{kj} + b_{ij}^{(1)}$$We can interpret $Z_{ij}^{(1)}$ as the input the the $j^{th}$ hidden neuron receives from the $i^{th}$ test sample.

    Once at the hidden layer, the activation function is applied (elementwise) to $Z^{(1)}$ as $$a^{(1)} = \left[ \begin{array}{ccc} a_{11}^{(1)} a_{12}^{(1)} a_{13}^{(1)}  \\ a_{21}^{(1)} a_{22}^{(1)} a_{23}^{(1)} \end{array} \right] \equiv \tanh{Z^{(1)}} $$

    We now propagate these values through the hidden layer, applying the weights and biases resulting in a $2 \times 2$ matrix $Z^{(2)}$ defined as $$Z^{(2)} = a^{(1)}W^{(2)}+b^{(2)}$$ where $$W^{(2)} = \left[ \begin{array}{cc} w_{11}^{(2)} w_{12}^{(2)} \\ w_{21}^{(2)} w_{22}^{(2)} \\ w_{31}^{(2)} w_{32}^{(2)} \end{array} \right]$$ and $$b^{(2)} = \left[ \begin{array}{cc} b_{11}^{(1)} b_{12}^{(2)} \\ b_{21}^{(2)} b_{22}^{(2)} \end{array} \right]$$ Similarly to $b^{(1)}$, this matrix has a unique single row, with the rest of the rows of the matrix being exact copies.  Explicitly, $Z^{(2)}$ has the following entries
    $$\left[ \begin{array}{cc} a_{11}^{(1)}w_{11}^{(2)}+a_{12}^{(1)}w_{21}^{(2)}+a_{13}^{(1)}w_{31}^{(2)}+b_{11}^{(2)} \hspace{.3in} a_{11}^{(1)}w_{12}^{(2)}+a_{12}^{(1)}w_{22}^{(2)}+a_{13}^{(1)}w_{23}^{(2)}+b_{12}^{(2)} \\ a_{21}^{(1)}w_{11}^{(2)}+a_{22}^{(1)}w_{21}^{(2)}+a_{23}^{(1)}w_{31}^{(2)}+b_{21}^{(2)} \hspace{.3in} a_{21}^{(1)}w_{12}^{(2)}+a_{22}^{(1)}w_{22}^{(2)}+a_{23}^{(1)}w_{23}^{(2)}+b_{22}^{(2)} \end{array} \right]$$ More compactly, we have $$Z_{ij}^{(2)} = \sum_{k} a^{(1)}_{ik} W^{(2)}_{kj} + b^{(2)}_{ij} $$ where $Z_{ij}^{(2)}$ can be considered the input to the $j^{th}$ output neuron from the $i^{th}$ training sample.

    Finally, applying the softmax function in the output neurons we end up with $$a^{(2)} = \text{softmax}(Z^{(2)}) \equiv \left[ \begin{array}{cc} \hat{y}_{11} \hspace{.15in} \hat{y}_{12} \\ \hat{y}_{21} \hspace{.15in} \hat{y}_{22} \end{array} \right]$$ where $a_{ij}^{(2)}$ is the output of the $j^{th}$ output neuron of the $i^{th}$ training sample. Hopefully the explicit matrix multiplication helps illuminate how the input values are propagated through the network and how their values are affected by the various weights in the input/hidden/output layers. Next we'll take an explicit look at backpropagation and establish similar matrix results which will allow us to train the network in a relatively efficient manner.

    Backpropagation

    In this section we'll derive the backpropagation rules for the network. Make sure you understand matrix index notation as we'll use it here to succinctly write the results.
    Recall our definition $$\delta_j \equiv \frac{\partial L}{\partial z_j} =  \sigma ' (z_j) \sum_{k \in path(j)} \delta_k w_{j \rightarrow k}$$ We'll slightly alter our notation to make it easier to track all of the indices floating around $ \delta^{(j)} \equiv \delta_j$ Starting at our output neurons and using the result from the previous blog post we have $$\delta^{(2)}_{ij} \equiv  \frac{\partial L}{\partial Z^{(2)}_{ij}} = \hat{y}_{ij} - y_{ij}$$ or in full matrix form
    $$\left[ \begin{array}{cc} \hat{y}_{11}-y_{11} \hspace{.3in} \hat{y}_{12}-y_{12} \\ \hat{y}_{21}-y_{21} \hspace{.3in} \hat{y}_{22}-y_{22} \end{array} \right]$$ Now $$\delta^{(1)}_{ij} \equiv  \frac{\partial L}{\partial Z^{(1)}_{ij}}$$ using the chain rule we can decompose this into $$ \frac{\partial L}{\partial Z^{(1)}_{ij}} = \sum_{m,n,p,q} \frac{\partial L}{\partial{Z^{(2)}_{mn}}} \times \frac{\partial{Z^{(2)}_{mn}}}{\partial{a^{(1)}_{pq}}}\times \frac{\partial{a^{(1)}_{pq}}}{\partial {Z^{(1)}_{ij}}}$$ The first term is simply $\delta^{(2)}_{mn}$. For the second term, recall above $$Z^{(2)}_{mn} = \sum_{k} a^{(1)}_{mk} W^{(2)}_{kn} + b^{(2)}_{mn} $$ $$\implies \frac{\partial Z^{(1)}_{mn}}{\partial a^{(1)}_{pq}} = W^{(2)}_{qn}$$ where the sum over k has been collapsed since the derivative is only non-zero when $m=p$ and $k=q$. Now, the third term $$\frac{\partial a^{(1)}_{pq}}{\partial Z^{(1)}_{ij}} = \frac{\partial}{\partial Z^{(1)}_{ij}} \left( \tanh(Z^{(1)}_{pq}) \right) = 1 - \tanh^{2}(Z^{(1)}_{ij})$$ where the only non-zero elements of the above expression are when $p=i$ and $q=j$. Putting this all together we have $$\frac{\partial L}{\partial Z^{(1)}_{ij}} = \left( 1 - \tanh^{2}(Z^{(1)}_{ij}) \right)  \sum_{n} \delta^{(2)}_{in} W^{(2)}_{jn} $$ note the sum is only over $n$ now, instead of $m, n, p, q$ make sure you understand why this is. We can now write this in full matrix form (as we will use in the implementation) as
    $$\delta^{(1)} = \left( 1 - \tanh^{2}(Z^{(1)}) \right) \odot \delta^{(2)} {W^{(2)}}^T$$ where $\odot$ is the elemetwise multiplication of the matrices (Hadamard product) and $T$ indicates the transpose of the matrix.
    Now we can use the above results for $\delta^{(1)}$ and $\delta^{(2)}$ to calculate how the loss function changes with respect to the weights - recall this is the value we use to alter our weights at each iteration in our gradient descent routine. Now $$\frac{\partial L}{\partial W^{(2)}_{ij}} = \sum_{p,q} \frac{\partial L}{\partial Z^{(2)}_{pq}} \times \frac{\partial Z^{(2)}_{pq}}{\partial W^{(2)}_{ij}}$$ The first term is simply $\delta^{(2)}_{pq}$ and the second term is $$\frac{\partial}{\partial W^{(2)}_{ij}} \left[ \sum_{k} a^{(1)}_{pk} W^{(2)}_{kq} + b^{(2)}_{pq} \right]$$ $\implies k=i$ and $q=j$ hence the sum collapses and the term evaluates to $a^{(1)}_{pi}$. Hence $$\frac{\partial L}{\partial W^{(2)}_{ij}} = \sum_{p} \delta^{(2)}_{pj} a^{(1)}_{pi}$$ or in matrix form, this is represented as $$ \frac{\partial L}{\partial W^{(2)}} = {a^{(1)}}^T \delta^{(2)}$$ Similarly $$\frac{\partial L}{ \partial b^{(2)}_{ij}} = \sum_{p,q} \frac{\partial L}{\partial Z^{(2)}_{pq}} \times \frac{\partial Z^{(2)}_{pq}}{\partial b^{(2)}_{ij}} $$ the second term forces $q=j$ resulting in $$\frac{\partial L}{ \partial b^{(2)}_{ij}} = \sum_p \delta^{(2)}_{pj}$$ i.e a sum over the rows of $\delta^{(2)}$ note the right hand side of the expression is indendent of $i$ this implies that every row in $\frac{\partial L}{ \partial b^{(2)}}$ is the same. Similar analysis yields $$\frac{\partial L}{\partial W^{(1)}_{ij}} = \sum_{p} \delta^{(1)}_{pj} x_{pi} $$ and $$\frac{\partial L}{\partial b^{(1)}_{ij}} = \sum_p \delta^{(1)}_{pj}$$

    Recap

    Relax. Take a breath. The derivations are over (for now). I think going through all the derivations in detail at least once is paramount to understanding the inner workings of the neural network. Once you get your head around the notation, there actually isn't anything that fancy going on - just repeated applications of the chain rule. They key for me was attacking this problem via matrix index notation which illustrated exactly how values are propagated (backwards and forwards) in the network via matrix multiplication.

    Forward Propagation Summary

    • $Z^{(1)} = xW^{(1)} + b^{(1)}$
    • $a^{(1)} = \tanh(Z^{(1)})$
    • $Z^{(2)} = a^{(1)}W^{(2)} + b^{(2)}$
    • $\hat{y} \equiv a^{(2)} = softmax(Z^{(2)})$

    Backpropagation Summary


  • $\delta^{(2)} = \hat{y} - y$
  • $\delta^{(1)} = (1-\tanh^{2}(Z^{(1)}) \odot \delta^{(2)}{W^{(2)}}^T$
  • $\frac{\partial L}{\partial W^{(2)}} = {a^{(1)}}^T \delta^{(2)}$
  • $\frac{\partial L}{\partial b^{(2)}} = \sum_{rows} \delta^{(2)}$
  • $\frac{\partial L}{\partial W^{(1)}} = {x}^T \delta^{(1)}$
  • $\frac{\partial L}{\partial b^{(1)}} = \sum_{rows} \delta^{(1)}$
  • Note that the $\frac{\partial L}{\partial W^{(i)}}$ and $\frac{\partial L}{\partial b^{(i)}}$ are not actually derivatives with respect to matrices, these are just labels for the matrices containing the derivatives with respect to the various weights and biases. The $\sum_rows \delta^{(i)}$ indicates that we sum over the rows of the matrix $\delta^{(i)}$ such that the matrices $$\frac{\partial L}{\partial b^{(i)}}$$ consist of a single unique row repeated $m$ times, where $n$ is the number of training examples.


    With these matrices we are now in a position to finally implement our ANN to classify our dataset! The next blog will detail the implementation in python and the associated results.