## Derivative of the softmax function

I've dedicated a separate post for the derivation of the derivative of the cross entropy loss function with softmax as the activation function in the output layer, which I'll reference in the future. It's nothing groundbreaking but sometimes it's nice to work through some of the results which are often quoted without derivation.

### The softmax function

The softmax function is often used as the activation function for neurons in the output layer of an ANN. We borrow the notation from a previous blog post, where $\hat{y}$ is the output of a particular neuron in the output layer and $$z_o = \sum_{j \in input(o)} z_j w_{j \rightarrow o}$$ i.e. we sum over all paths $j$ which are an input into our output neuron $o$.

It is defined as the following:
$$\hat{y}_o = \frac{e^{z_o}}{\sum_k e^{z_k}}$$ where the sum is over the $k$ output neurons in the network. The softmax essentially takes an input $z_o \in \mathbb{R}$ and maps it to the interval $(0,1)$ such that the sum of all outputs is 1 - it gives us a probability distribution over our $k$ outputs (i.e. $k$ different classifications).

### Cross entropy loss function

Recalling the definition of the cross entropy loss function from a previous blog post we have
$$L = - \frac{1}{N} \sum_{n \in N} \sum_{i \in C} y_{n,i} \ln{\hat{y}_{n,i}}$$
where $N$ is the number of training examples and $C$ is the number of classifications (i.e 2 for binary classification, etc...). Let us assume $N = 1$ without loss of generality, thus
$$L = - \sum_{o} y_{o} \ln{\hat{y}_{o}}$$ where the sum is over the output neurons in our network (given that we have 1 output neuron for each class, the sum over $i \in C$ and $o$ are equivalent). Note that $y_o$ is the actual classification of our data, i.e. $y_o \in \{0,1\}$ and that if $y_o=1$ then all other $y_j =0$ where $i \neq o$ see a description of one hot encoding. Now our object of interest is $$\frac{ \partial L}{\partial z_i} = \frac{\partial L}{\partial \hat{y}_o} \times \frac{\partial \hat{y}_o}{\partial z_i}$$ Let's focus on the second term $\frac{\partial \hat{y}_o}{\partial z_i}$ for now and split it into two cases.

• Case 1: $i = o$
$$\frac{\partial \hat{y}_o}{\partial z_i} = \frac{\partial \hat{y}_o}{\partial z_o} = \frac{\partial}{\partial z_o} \left( \frac{e^{z_o}}{\sum_k e^{z_k}} \right) = \frac{e^{z_o}}{\sum_k e^{z_k}} - e^{z_o} (\sum_k e^{z_k})^{-2}e^{z_o}$$
$$= \frac{e^{z_o}}{\sum_k e^{z_k}} \left ( 1 - \frac{e^{z_o}}{\sum_k e^{z_k}} \right) = \hat{y}_o (1 - \hat{y}_o)$$
• Case 2: $i \neq o$
$$\frac{\partial \hat{y}_i}{\partial z_i} = \frac{\partial}{\partial z_i} \left( \frac{e^{z_o}}{\sum_k e^{z_k}} \right) = -e^{z_o}(\sum_k e^{z_k})^{-2} e^{z_i}$$
$$= -\hat{y}_o \hat{y}_i$$ Now $$\frac{\partial L}{\partial \hat{y}_o} = - \sum_o \frac{1}{\hat{y}_o} y_o$$ Therefore $$\frac{\partial L}{\partial z_i} = -\sum_o \frac{1}{\hat{y}_o} y_o \times \frac{\partial \hat{y}_o}{\partial z_i}$$ Substituting in our above computations and splitting out the $i=o$ case from the sum we get

$$\frac{\partial L}{\partial z_i} = -\frac{1}{\hat{y}_i} y_i \times \hat{y}_i(1-\hat{y}_i) - \sum_{o \neq i} \frac{1}{\hat{y}_o} y_o \times (- \hat{y}_o \hat{y}_i)$$
$$= -y_i(1-\hat{y}_i) + \sum_{o \neq i} y_o \hat{y}_i$$ $$= y_i \hat{y}_i -y_i + \sum_{o \neq i} y_o \hat{y}_i$$ Moving the first term inside the sum and summing over all $o$ we get
$$\frac{\partial L}{\partial z_i} = -y_i + \sum_{o} y_o \hat{y}_i$$ $$= \hat{y}_i \sum_o y_o - y_i$$ But $\sum_o y_o$ is the sum over all the actual labels of our data which is one hot encoded (as mentioned above). Thus $y_o = 1$ for exactly one class and $0$ for all else - hence $\sum_o y_o = 1$. We finally arrive at a result we'll use in the future in constructing and implementing our ANN to classify the make moons dataset
$$\frac{\partial L}{\partial z_i} = \hat{y}_i - y_i$$

## Feedforward Artificial Neural Network pt2

In the previous post, we briefly looked at the general structure of a Feedforward ANN and why such a construction should be useful. Now that we have a general handle of the idea behind a Feedforward ANN, we need to train the network - find a method to calculate the appropriate weights for our hidden layer neurons. In the context of our classification case, we want to train the network by providing it a dataset with labels so that it can predict the labels of new data once trained - a classic supervised learning scenario.

#### Backpropagation

Backpropagation is the algorithm we'll use to train (i.e minimize the error function of) our ANN. It is essentially repeated applications of the chain rule - chaining together all possible paths in the network to establish how the error changes as a function of the inputs and outputs of each neuron. This may not seem like a very profound approach, nor may it be clear that this use of the chain rule should deserve it's own name, however the power from this method comes from the implementation. The set up of backpropagation allows the derivatives we use in the chain rule to be calculated using intermediate results in our network. This becomes especially important when the size of our network grows - it is efficient in that it saves the recalculation of derivatives.

### Error propagation

We will now generalise the above computation, in order to get the derivative of the loss function with respect to any $z_j$ in any of the layers of the network. We define the error signal $$\delta_j = \frac{\partial L}{\partial z_j}$$ Now, using the chain rule as in the example above we have $$\delta_j = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial z_j}$$ Once we specify our loss function $L$ we'll have a nice expression for the first term in the expression above, for now let's keep it generic. Now using the definitions of the $z_i$ and $a_i$ above: $$\frac{\partial \hat{y}}{\partial z_j} = \frac{\partial \hat{y}}{\partial a_j} \times \frac{\partial a_j}{\partial z_j} = \sigma ' (z_j) \sum_{k \in path(j)} \frac{\partial \hat{y}}{\partial z_k} \frac{\partial z_k}{\partial a_j}$$ where the sum is over the $k$ paths which are connected to the output of neuron $j$. We have $z_k = a_j w_{j \rightarrow k}$ so our expression becomes $$\frac{ \partial \hat{y}}{\partial z_j} = \sigma '(z_j) \sum_{k \in path(j)} \frac{\partial \hat{y}}{\partial z_k} w_{j \rightarrow k}$$ $$\implies \delta_j = \sigma '(z_j) \sum_{k \in path(j)} \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z_k} w_{j \rightarrow k}$$ Note the first two terms in the summation are actually $\delta_k$ in disguise! Thus we have a recursive definition for our error propagation: $$\delta_j = \sigma ' (z_j) \sum_{k \in path(j)} \delta_k w_{j \rightarrow k}$$
Thus if we have $\delta_\text{output}$, we can work our way backwards through the neural network, calculating the error for each neuron as a function of the error of the neurons that its output is connected to. This is the key concept of back propagation - although it is just an application of the chain rule, because we can propagate the error at each point in the network through subsequent neurons, the calculations can be done on the fly by utilising calculations that have already been performed.

Next time we'll describe the architecture of the actual ANN we'll use to classify the make moons example detailted in pt1. To be honest, these concepts didn't fully sink in until the first time I implemented an ANN, going through the details of the calculations, I suggest you work through this and the next blog post by hand in order to ensure you have a handle on how the feedforward and error propagations are affected by different topologies of ANNs.