Friday 24 February 2017

Optimisation routines

Now that we are able to train the Neural Network, let's take a look at how we can optimise the training as in reality we will be dealing with deep (many hidden neurons) networks which will potentially take days to train. I'm going to take a step back and have a look at how we might generally attack a convex optimisation problem. Convex optimisation problems are very well researched and understood area, so it makes sense to have a look at these first.

The problem

We are going to fit a simple logistic regression to the following data 
  • http://cs229.stanford.edu/ps/ps1/logistic_x.txt
  • http://cs229.stanford.edu/ps/ps1/logistic_y.txt
There are two predictors (columns) in the first file and 1 outcome ($ y = \pm 1$) in the second. Note that I've borrowed these from the cs229 class that I'm currently working through. Our task is to minimise  the average empirical loss: $$ L(\theta) = \frac{1}{m} \sum_{i=1}^{m} \log (1 + \exp^{-y^{(i)} \theta^{T} x^{(i)}}) $$ where $y^{(i)}$ is the actual outcome for observation $x^{(i)}$. Note that $x^{(i)}$ is a vector containing the predictors for that observation.

Is this a convex optimisation problem?

It turns out that if we can show that the Hessian $H$ for this loss function satisfies $$z^{T} H z \ge 0$$ $ \forall z \in \mathbb{R}^{3}$ then $L(\theta) $ is a convex function (more generally if $H \in \mathbb{R}^{n \times n}$ then this result must hold true $\forall z \in \mathbb{R}^{n}$. Using the definition of $H$, we have $$ H_{pq} = \frac{\partial^2 L(\theta)}{\partial \theta_p \partial \theta_q}$$ The details of the calculation are quite straightforward, so I'll omit them but the result yields $$H_{pq} = \frac{1}{m} \sum_{i=1}^{m} \frac{1}{1 + \exp^{-\theta^T x^{(i)}}} \times \left( 1 - \frac{1}{1 + \exp^{-\theta^T x^{(i)}}} \right) \times x^{(i)}_p x^{(i)}_q $$ where the sum is over all of our training examples. We can write the last two terms as a matrix product (and subsequently drop the indices) as $x^{(i)} x^{(i)^{T}}$. Since the first two terms are $\in (0, 1]$ and $[0,1)$ respectively, then the product is $\in [0, 1]$, thus we can ignore this term when assessing if $z^T H z \ge 0 \hspace{.1in} \forall z$. Thus $$z^T H z \propto \sum_{i=1}^{m} z^T x^{(i)} x^{(i)^{T}} z = \sum_{i=1}^{m} (z^T x^{(i)})^2 \ge 0 \hspace{0.1in} \forall z$$ Hence $H$ is positive semidefinite which implies that $L(\theta) is a convex problem. This means it has a global optima, which makes our lives a lot easier. Since calculating the Hessian is rather easy in this setting, we can use Newton's method.

Newton's method

Newton's method utilises the Hessian and hence the curvature of the loss surface to find an optimal path to the minima. With this additional information, we can expect the algorithm to converge faster than gradient descent, which only uses first derivative information. The update rule is as follows $$ \theta \rightarrow \theta - H^{-1} \nabla_{\theta} L(\theta) $$
Let's see the performance of Newton's method vs Gradient descent:
I've performed 20 iterations for both Newton's method and gradient descent - clearly Newton's method converges a lot faster than gradient descent. Looking at the update step, it is obvious that this method won't scale well with more parameters since each step requires the calculation of the matrix of first derivatives and the Hessian. Furthermore, it's obvious that this method will fall apart if $H$ is singular. So in our toy example above which had a convex loss function, minimal parameters to estimate - Newton's method was king, but obviously in our neural network we couldn't apply such an algorithm. The beauty of backpropagation was that after a forward pass through the network, we had all of the first order derivatives we required in the update step. This fortune does not however extend to the second derivatives we would require to calculate $H$. It seems we're stuck with regular old batch gradient descent to train or neural network...or are we?

Stochastic Gradient Descent (SGD)

Recall our update step for batch gradient descent: $$\theta_j \rightarrow \theta_j - \eta \frac{\partial}{\partial \theta_j}L(\theta)$$ where  $$ L(\theta) = \frac{1}{m} \sum_{i=1}^{m} \log (1 + \exp^{-y^{(i)} \theta^{T} x^{(i)}}) $$$L(\theta)$ is a function of all m training examples. That is, for a single update to parameter $\theta_j$ we need to use all of our training data. What if we could calculate our loss on a subset of the training data? The hope is that this subset is "representative enough" of the entire dataset such that resulting update to $\theta_j$ is generally in the same direction to that of the update calculated on the entire dataset. This is the essence of Stochastic Gradient Descent.

Consider the fact when the training dataset has several hundred million rows, we may not even be able to fit all the data in memory to perform a batch gradient descent! 

I'll define the following terms commonly found when talking about SGD

  • Epoch - once every datapoint in the training set has been used, one epoch has occurred.
  • Mini batch - the subset of training data that is used in the parameter update
  • Batch size - the size of the mini batch
So per epoch, we will be able to update the parameters $\frac{N}{\text{batch size}}$ times. Compare this to batch gradient descent which by definition, the parameters are updated once per epoch.
The pseudocode is as follows
for epoch in num_epochs:
    shuffle(data)
    for mini_batch in data:
        evaluate derivative of loss on mini_batch
        update parameters

Since we are taking a subset of our training data to update our parameters, the results may be volatile as each subset may contain slightly different information about our loss surface. This is slightly troublesome as we approach the optima, as in SGD the path of optimisation will tend to oscillate around the minimum we seek. In order to remedy this, we use a learning rate schedule which is simply a scaling (generally based on heuristics) of our learning rate at each iteration. The hope is that by the time the algorithm is near the minimum, the learning rate has been scaled down such that the successive parameter updates are relatively stable from this point forward. This process is also called annealing the learning rate. The trick is getting the balance between timing the schedule such that the learning rate is small enough when the algorithm is near the minimum - if you are too aggressive with the schedule, the learning rate will become too small too soon and you won't get near the minimum as the update to the parameters will tend to zero.

Let's have a look at the performance of SGD vs batch gradient descent on our neural network. If implemented SGD on the aforementioned neural network and have run the make moons dataset with 100000 data points and 10 neurons in the hidden layer. See below for a plot of the loss vs. epoch for both SGD and batch gradient descent.

Batch gradient descent:
  • Epochs: 200 (i.e 200 updates to the parameters, each based on the full training set)
  • Execution time: 17s
SGD:
  • Epochs: 1
    • Mini batch size: 500 (i.e 200 updates to the parameters, each based on mini batch of size 500)
  • Execution time: 6s
We can see that SGD actually outperforms batch gradient descent here and takes about a third of the time to run! I haven't actually applied any learning rate schedule here, you can see an egregious spike in the loss from SGD at around the $160^{th}$ epoch. See what happens below when I simply set $\eta \rightarrow \frac{\eta}{10}$ at the $150^{th}$ epoch:


Notice that the aforementioned spike is now gone and the SGD results look very promising given the accuracy and performance (compared to batch gradient descent). In practice, batch gradient descent will never be used due to the memory constraints and the fact that the randomness of SGD can help the algorithm escape local minima that batch gradient descent would naturally get stuck in.

Momentum

We can introduce the concept of momentum to our parameter updates. It is applied as follows:
Recall the usual update to our parameters is as follows $$\theta_j \rightarrow \theta_j - \eta \frac{\partial}{\partial \theta_j}L(\theta)$$ which we can write in two steps, define
$$\Delta \theta_j = \frac{\partial}{\partial \theta_j}L(\theta)$$ such that the update becomes $$\theta_j \rightarrow \theta_j - \eta \Delta \theta_j$$ We'll now modify our definition of $\Delta \theta_j$ as $$\Delta \theta_j = \gamma \Delta \theta_j - \eta \frac{\partial}{\partial \theta_j}L(\theta)$$ and our update is $$\theta_j \rightarrow \theta_j + \Delta \theta_j$$ What we've done is influenced our current update of $\theta_j$ by the amount it was updated by in the previous step. Yes we've introduced yet another hyperparameter, but the idea here is to give it some "momentum" such that the updates are influenced by previous updates and the optimsation continues towards the minimum. Consider the case when the loss surface is a long narrow ravine with steep walls - SGD will typically oscillate across the ravine as the update will point down the steep walls, the momentum term will help move the algorithm down the ravine towards to minimum we seek. See the effect of momentum on training our neural network below

We see that momentum allows the algorithm to descend very quickly and hits the minimum loss around 40 epochs in - compared to ~140 in our previous iteration.

Hopefully you've now got a good grasp of batch gradient descent vs SGD and some of the subtleties and hyperparameters that need to be considered when training a model with SGD. For a more in depth exploration of various optimisation techniques, including some more advanced methods see here. In production, we'll rely on the built in SGD routines which have been highly tuned for performance.