Journey into Machine Learning

Sunday 8 March 2020

Variational Autoencoders

What is an autoencoder?

An autoencoder is a neural network which is used to learn efficient (read: lower dimensional) encodings of input data in an unsupervised manner. This just means that we design the architecture of a neural network which takes an input of $k$ dimensions and train it to output that very input (e.g. the output layer has the same dimensions $k$ as the input). That is, the loss that we try to minimise isn't base on the difference of the prediction and some label or value (as in the supervised case) but the difference between the prediction and the input (also known as the reconstruction error); we are training the network to recreate the $k$ dimensional input vector as closely as possible. But Why? I hear you ask.

Dense Representation

Study the architecture of the network below, you can see it is naturally broken up into two parts; an encoder and a decoder.

The encoder takes the aforementioned $k$ dimensional input vector and successively feeds it through hidden layers with dimension $< k$ (there are two in the image below). This results in the middle layer of the network, which is a key part in an autoencoder and its associated uses - more on that later.

The decoder takes takes the middle layer and feeds it through additional hidden layers - upscaling the dimensionality - to result in the output vector of dimension $k$.

The idea is that in a sufficiently trained network which has minimal error between the input and output vectors, the middle layer will have captured the essence (e.g. the signal) of the input in a lower dimensional, or dense representation vector, stripping the input of irrelevant noise. The dense representation is also called a latent representation which generally lives in a latent vector space.

So what?

That's all nice and dandy but what can we actually use autoencoders for?

Outlier detection

By definition, outliers have some characteristics of features distinct from the rest of the dataset - they are in the vast minority. Thus we can feed a given dataset through an autoencoder - training it to minimise the reconstruction error between input and output. We then feed through a previously unseen observation; the autoencoder will then hopefully be able to identify this observation as either

similar to "typical" observations it was trained on; characterised by small reconstruction error or
as an outlier; characterised by the relatively large reconstruction error.

De-noising inputs / dimensionality reduction of inputs

Autoencoders can be used to reduce noise from inputs to create a "de-noised" and lower dimensional version ready for use in machine learning pipelines. This can be seen as a form of regularisation.

Latent vector space

One natural curiosity that may arise is how we can characterise or better understand the latent vector space. I've borrowed the results below from [2], which shows a 2 dimensional dense representation of MNIST trained on an autoencoder.

What we can see are distinct and disjoint regions for each of digits [0-9] - this makes it easy for the decoder to perform it's job. However, this causes problems if we ever want to sample from this space to generate realistic examples. Because each digit's latent vector is clustered locally, the decoder would likely decode a vector taken from (-15, 5) into junk - it would not resemble any of the digits from [0-9] - as it hasn't seen any training examples from this local region. Hence we are very limited to sampling local regions of the latent space which have seen training data in that region - the generated samples will likely replicate the training samples, this is not very intelligent generation! If we were to sample within a cluster but in a region that wasn't in the training data - for example (-22,-10) - there is no guarantee that the decoded image would make sense! If we started at the 1 indicated in the image and traveled along the vector toward 7, intuitively we would expect the output of the decoder to continuously deform from a 1 to a 7, however this is not the case with the autoencoder - no such guarantee exists - and it is more than likell non intelligible outputs would be created.

The disjoint and discontinuous nature of the representations created by an autoencoder make it a poor candidate for generating realistic samples, if we want to do so, we must use a Variational Autoencoder.

Variational Autoencoder (VAE)

The distinguishing feature of a VAE compared to an autoencoder is that - by design - the latent spaces are continuous, in the sense that two reprensentations which are close in the latent space will result in similar looking outputs when decoded. Close here can be any metric, but let's consider the Euclidean metric for intuitive understanding. This is achieved via borrowing some machinery from Bayesian analysis (hence the variational part of the name) and cleverly manipulating the loss function.

The loss function

In the regular autoencoder, for vectors $(x_i,y_i)$ we define the loss function as the standard sum of squares

$$ L(y, \hat{y}) = \sum_{i=1}^{n} (y_i - \hat{y}_{i})^2 $$

where $(y_i = x_i)$ for an autoencoder and $ \hat{y}_{i}$ is the attempted reconstruction of the input $x_i$ by the autoencoder. This loss function has a single job - to minimise the reconstruction error, it as no constraints as to how the latent vectors are represented or distributed in the latent space.

The trick is to introduce some form of penalty into this loss function such that we coerce the latent vectors to be drawn from a continuous probability distribution, thus ensuring it has the aforementioned desirable properties for generating samples. This is achieved by using the Kullback Leibler (KL) divergence.

The KL divergence is a very important piece of machinery, often used in Bayesian analysis. It gives a (non-symmetric) measure of how similar or dissimilar two probability distributions are. We can use it to ensure our latent vectors are somewhat drawn from a distribution of our choice - this is almost always a Gaussian distribution due to the tractability of the math and the existence of closed form solutions [3].

ADDITIONAL MATERIAL TO FOLLOW.

[1]https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798

[2]https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf

[3]https://arxiv.org/pdf/1907.08956.pdf

Friday 2 February 2018

Recommender Systems: Collaborative Filtering and matrix factorization: Part 2

The previous post looked at the mechanics of SVD and how we can interpret and use it in the collaborating filtering approach to building a recommender system. I've borrowed heavily from [1] for the blog below - I've followed it step by step. As I've mentioned before, this blog is geared toward my continued education and understanding of machine learning topics so I'll likely be implementing ideas that already exist to get a better understanding of the underlying mechanics.

User and Items rating matrix

Suppose we have a matrix $r \in \mathbb{R}^{m \times n}$ which each row corresponds to a single user (i.e. person) and each column corresponds to a single item (i.e. movie) such that $r_{ui}$ corresponds to the rating user $u$ gave rating $i$. Now this matrix will generally be constructed from all user and rating information that is available and it is quite obvious that not every user will have rated every item available. In fact when we use the MovieLens dataset we'll see that only about 6% of the matrix $r$ are valid entries - the rest are missing and are the ratings we would like to predict by using our recommender system once it is built.

The sparsity of the matrix $r$ poses an issue using the SVD approach covered previously, as SVD aims to reconstruct a matrix as is; it considers all entries in the matrix $r$ as valid in the reconstruction process. So we need to make a slight tweak to our approach to building the recommender system. We still want a latent variable representation, but we only want to use the 6% of ratings that are populated.

Latent space representation

In order to define the latent space representations, we make the following assumptions:

Each user can be described by a $k$ dimensional latent vector $\textbf{x}_u = \left[ x_{u1} \dots x_{uk} \right] \in \mathbb{R}^k$
Each item can be described by a $k$ dimensional latent vector $\textbf{y}_i = \left[ y_{i1} \dots y_{ik} \right]\in \mathbb{R}^k$

Hence we can approximate the rating user $u$ gave item $j$ as $$\hat{r}_{ui} = \textbf{x}_{u}^{T} \cdot \textbf{y}_{i}$$

Hence we can define the loss function (with some regularisation) as trying to minimise the difference between the predicted and actual ratings a every user gave each item:

$$ L = \sum_{u,i \in S}\left( r_{ui} - \textbf{x}_{u}^{T} \cdot \textbf{y}_{i}\right)^2 + \lambda_x \sum_{u} ||\textbf{x}_u||^2 + \lambda_y \sum_{i} ||\textbf{y}_i||^2$$ where $S$ is the set of valid ratings and $\lambda_x$, $\lambda_y$ are regularisation hyperparameters. We can rewrite a little more succinctly using matrix notation: $$ L = \left( \textbf{r}_u - \textbf{x}^T_uY^T \right)^2 + \lambda_x \sum_{u} ||\textbf{x}_u||^2 + \lambda_y \sum_{i} ||\textbf{y}_i||^2 $$ where $$\textbf{r}_u = \left[ r_{u_1} \dots r_{u_m} \right]$$ is the rating vector for user u's ratings of the m items and $$Y = \left[ \begin{array}{ccc} -& y^{T}_{1} & - \\ - & y^{T}_{2} & - \\ & \vdots & \\ - & y^{T}_{m} & - \end{array} \right] $$ is a composite vector of the $y_i$'s. Now this equation looks quite familiar (recall the definition of the OLS loss function...) hence if we fix one of the $\textbf{x}^T_u$ or $\textbf{y}_{i}$ then we have effectively reduced the cost function to that of an OLS problem. The procedure is simple, we fix $\textbf{y}_i$ and find the corresponding $\textbf{x}^T_u$ which minimises the loss function, this then feeds into the next iteration where we fix $\textbf{x}^T_u$ and find the $\textbf{y}_i$ which minimises the updated loss function and continue until convergence (or some other appropriate stopping criteria). This process is known as (for obvious reasons) as Alternating Least Squares and we'll derive the close form solution in the next section.

Alternating Least Squares

We can then minimize $L$ by differentiating with respect to the components of our user vector $x_u$ and setting to $0$. Using summation notation:
$$ \frac{\partial L}{\partial x_{ua}} = \sum_{i}^{m} -2 \left(r_{ui} - \sum_{j}^k x_{uj} Y_{ij} \right)Y_{ai} + 2 \lambda_x x_{ua}$$ Hence for a minimum and in matrix form $$ 0 = -\left( \textbf{r}_u - \textbf{x}^T_uY^T \right) Y + \lambda_x \textbf{x}_{u}$$ we can re-arrange to get $$ \textbf{x}_{u}^T = \textbf{r}_u Y \left( I \lambda_x + Y^TY\right)^{-1}$$ We can perform an analogous calculation for $\textbf{y}_i$ which yields $$ \textbf{y}_i^T = r_i X \left( I \lambda_y + X^TX \right)^{-1} $$ where $I$ is the $k \times k$ identity matrix. Hence we have derived the updates for our parameters required at each iteration - the next section looks at the results.

Next time we'll have a look at the python implementation and results to see how well our recommendation system works!

Results

See my github for the python implementation in a jupyter notebook. I have run the model with the following parameters:

$k = 10$ latent features
$\lambda_x = 0.01$ user regularisation
$\lambda_y = 0.01$ item regularisation
$100$ iterations

You can see the training and test error plot for each iteration below:

We can see that learning occurs very quickly and after approximately the fifth

iteration, there is not much reduction in the training or test MSE. The training MSE error sits around $5.5$ and the test hovers just above $8$.

Additional flexibility and SGD

So far we've only considered a linear model which does not allow for the idiosyncrasies of individual user rating behaviour i.e. a given user might tend to rate movies more highly in general, compared to another user. Concretely, consider a movie I don't particularly like - I might give it a 3/5 because I'm a nice guy, however someone else might be more punitive and assign the movie a 1/5. Even though we have the same sentiment toward the movie, our individual rating philosophies differ. Similarly we can address the fact that certain movies tend to be rated lower compared to another movie even though the average sentimentality towards each movie might be similar.

We can address this by normalising each movie/rating pair by refining our rating estimate:
$$\hat{r}_{ui} = \mu + b_i + b_u + \textbf{x}_{u}^{T} \cdot \textbf{y}_{i}$$ where $b_i$ is the user bias, $b_u$ is the item bias and $\mu$ is the global bias. This global bias term just adds additional flexibility (another parameter = another degree of freedom) for our model. Hence our loss function is now defined as:

$$ L = \sum_{u,i \in S}\left( r_{ui} - \mu - b_i - b_u - \textbf{x}_{u}^{T} \cdot \textbf{y}_{i}\right)^2 + \lambda_x \sum_{u} ||\textbf{x}_u||^2 + \lambda_y \sum_{i} ||\textbf{y}_i||^2 + \lambda_{bi} \sum_{i} ||b_i||^2 + \lambda_{bu} \sum_{u} ||b_u||^2 $$

where we have just substituted our new definition for $$\hat{r}_{ui}$$ and added regularisation terms for our $b_i$ and $b_u$ biases.

The form of this loss function is no longer comparable to OLS, due to our new bias terms. In order to minimise, we turn to our old friend SGD. There's enough maths on this page so I won't go through the gory details of the updates for each parameter - if you're interested you can see the details here [1].

Optimising this new loss function using SGD with the same parameters above:

$k = 10$ latent features
$\lambda_x = 0.01$ user regularisation
$\lambda_y = 0.01$ item regularisation
$\lambda_bu = 0.01$ user bias regularisation
$\lambda_bi = 0.01$ item bias regularisation
$100$ iterations
$\eta = 0.005$ learning rate

We can see that the training error shrinks to zero whilst the test error starts increasing with the number of iterations - these are the classic signs of overfitting! Looking at the plot, cutting off training at around iteration $25$ seems like a good idea. Comparing this to our ALS results, the test error is now around $1$, which is markedly better than the $8$ we observed earlier - this model fits the data a lot better, as expected with the additional flexibility.
Ok we have fit our models, now what? The whole purpose of this exercise was to make recommendations! What we're going to do is use the cosine-similarity to compare each of our latent vectors - given an input vector representation of a movie, we'll find the most 'similar' in our latent space to provide the user with recommendations!

On the right below we have the recommendations based on the bond movie Independence Day provided by our SGD model and on the left the ALS model.

Which do you think is better? There is some common recommendations - The Rock, Twister and the recommendations seem reasonable for the most part, aside from a stray Toy Story. In reality you'd probably combine these recommendations in the final model and perhaps supplement with some content based filtering to get the 'best' recommendation.

Until next time...

References

[1] https://blog.insightdatascience.com/explicit-matrix-factorization-als-sgd-and-all-that-jazz-b00e4d9b21ea

Thursday 26 October 2017

Recommender Systems: Collaborative Filtering and matrix factorization

What is a recommender system?

Think about any website which has a community of users who purchase/interact with content (items) on that website. Based on users' interactions with the items, various personalised recommendations can be tailored to specific user behaviour. More concretely, consider the purchase suggestions Amazon offers you, or movies that Netflix recommend you watch based on your past viewing behaviour. The machinery behind these recommendations is the basis of recommender systems.

Companies are looking to personalise a user's experience in a way that resonates with them - to make the user experience feel unique and tailored to their needs. These days websites capture all sorts of attributes from a user's experience/interaction - time spent on pages, scroll speed, viewing behaviour, purchasing behaviour etc... One may ask how we can use such detailed information to offer a better user experience; through the construction of a recommender system.

Now recommender systems are generally based on two types of user behaviour:

Implicit feedback

Information based on user behaviour such as viewing/purchasing/click behaviour; user preferences can be inferred.

Explicit feedback

Where the user has given an explicit rating to a movie or item (a 5 star rating for example).

There are generally two approaches to building a recommender system, content based filtering and collaborative filtering.

Content based filtering

The approach of content based filtering is as follows:

Each item is characterised by a series of properties or features. For example if we are using content based filtering to build a movie recommendation system then each movie could have attributes such as lead actor, director, genre, etc.
A profile based on these attributes is built for each user and hence new items can be recommended to the user via a similarity measure between the user's profile and a specific items profile.

This approach requires the features of each item to be defined explicitly and hence it is limited to the level of feature engineering. It also means that as new items are added - which have new characteristics - these must be calculated for each of the existing items too and the dimensionality of our feature space will rise exponentially. We will not focus on this approach in the remainder of the blog.

Collaborative Filtering

As its name suggests, collaborative filtering uses an aggregate (or collaborative) approach to suggests new items, based on the premise that users who share a common interest in certain items will also share a common interest in other items. Unlike content based filtering, this approach doesn't require hand crafted features for each item and hence can be more easily scaled to larger and even different domains. We will focus on the collaborative filtering approach in building our recommender system and will use the MovieLens dataset in our example [1].

The approach we take in building the collaborative filter borrows from some linear algebra results, namely the Singular Value Decomposition (SVD) of a matrix. I will first define exactly what SVD is and then I'll add some context into how it helps us with creating a recommender system.

Singular Value Decomposition (SVD)

Given a matrix $M \in \mathbb{R}^{m \times n}$ there exists a factorisation $ M = U \Sigma V^{*}$ where

$U \in \mathbb{R}^{m \times m}$ us a unitary matrix (i.e $UU^{*} = U^{*}U = I$)
$\Sigma \in \mathbb{R}^{m \times n}$ is diagonal, with non-negative numbers (singular values)
$V^{*} \in \mathbb{R}^{n \times n}$ is the conjugate transpose of a unitary matrix $V$

Note that since I've taken $\mathbb{R}$ as the field the matrix entries come from, the conjugate transpose is just the transpose (i.e. $V^{*} \equiv V^{T}$) and the unitary matrices are also known as orthogonal matrices.

What this is telling us, is that any matrix can be decomposed into the above structure. If you think about the eigenvalue decomposition (EVD) of a matrix, it only exists for a square matrix, whereas the SVD of an arbitrarily (non-square) shaped matrix exists. There is a deep connection between the SVD and EVD of a matrix - the non zero entries of $\Sigma$ are the square root of the eigenvalues of $MM^T$. See section 3.2 of [2].

Yeah, yeah that's a cool result, but how does that help us with the recommender system? Well, the motivation is borrowed from Latent Semantic Analysis (LSA) - a topic in Natural Language Processing which aims to understand relationships between documents and the words they contain. In LSA, a term-document matrix is created in which each entry contains the number of occurrences of a particular word in a particular document. This term-document matrix is then decomposed via SVD with the resultant matrices representing documents and terms which capture the pattern of term usage among documents. The top $k$ singular values are kept, such that the decomposition results in a representation of the original matrix in a lower dimensional space. The idea is that this lower dimensional representation captures the structure of the relationship between documents and terms, whilst filtering out the noise. In this lower dimensional 'latent' space one can simply find the similarity of documents or terms by using our $U$ and $V$ matrices.

So in our case, the analogy of a term-document matrix is a user-item matrix in which each row of the matrix is a user, and each column is an item. For our example the items will be movies. Each entry $(i,j)$ corresponds to the rating the $i^{th}$ user gave to the $j^{th}$ movie. The SVD will result in a latent space in which user and item vectors reside and we can calculate their similarities.

Interpretation of SVD

Geometric Interpretation

In the case of $M \in \mathbb{R}^{n \times n}$ - aka a linear map within the same space - then there is a nice geometrical interpretation of the action of $M$, the image below illustrates this

$V$ is an orthogonal matrix - which corresponds to a reflection or rotation
$\Sigma$ is diagonal and hence is a scaling matrix
$U$ is also an orthogonal matrix - which is another reflection or rotation

More generally, consider $M \in \mathbb{R}^{m \times n}$, a bilinear map between spaces, we can rewrite the SVD equation as $$ MV = U \Sigma $$ If we consider each of the $j$ columns of $V$ separately then it is apparent $Mv_{j} = \sigma_j u_{j}$. Recalling that since $U$ and $V$ are orthogonal matrices and hence their rows form orthonormal bases, the action of $M$ is as follows: the basis of a coordinate system $\{v_1, \dots, v_n \}$ is mapped to a 'scaled' basis of a different coordinate system $\{\sigma_{1} u_{1}, \dots, \sigma_{m} u_{m} \}$. In words, every matrix $M \in \mathbb{R}^{m \times n}$ can be interpreted as having the following action

A first rotation in the input space
A simple positive scaling that takes a vector in the input space to the output space
And another rotation in the output space

SVD as an approximation

Note that in the SVD decomposition there is an equality; the matrix $M$ is exactly reconstructed. In practice, we don't actually want the exact matrix, instead we want a lower k-dimensional approximation which will have less noise and (hopefully) have captured the latent structure of the user/item relationships. Now it turns out through SVD -if we keep only the top k singular values - the resulting matrix is the best k-rank approximation of $M$, where the notion of 'best' is defined by the Frobenius norm. See Section 2.1 of [2] for details.

We can think of the rank k approximation via SVD of a real matrix $M$ as the decomposition of $M$ into the sum of k rank 1 matrices. That is

$$ M = \sum_{j=1}^{k} \sigma_j u_j v_{j}^{*}$$

with $\sigma_1 > \sigma_2 > \dots >\sigma_k > 0$. This is a kind of expansion of $M$ in terms of rank 1 matrices, with the first term capturing the 'most' of $M$, followed by the second term and so on, with each including term adding to the accuracy of the approximation.

For a great exposition on SVD and some various angles on interpretation, check out [3] by Jeremy Kun which should help you gain some intuition.

Well, that wraps up a short summary on the techniques we'll use to build our recommender system! Next post will be focused some of the subtleties of the implementation and some tweaks which have been used to win competitions in the past!

References

[1] https://grouplens.org/datasets/movielens/
[2] http://www.math.ucla.edu/~dakuang/cse6040/lectures/6040_lecture15.pdf

[3] https://jeremykun.com/2016/04/18/singular-value-decomposition-part-1-perspectives-on-linear-algebra/

Thursday 22 June 2017

Transfer learning and hotdog classification!

Inspired by a recent episode of Silicon Valley where Jian Yang builds an app to classify pictures as either Hotdog or Not Hotdog, I decided to have a crack at retraining the final layer of the Inception V3 network to do so.

Thankfully, the lovely folks at Google make it ridiculously easy to do so. So instead of maxing out my poor little Macbook pro for weeks (or months :/) of training - we can utilise the already trained network and simply adjust the final layer for our use, more on this below.

Note that this post assumes familiarity of CNN architecture and its use in computer vision - see this great course at Stanford for a great introduction to CNNs - CS231n: Convolutional Neural Networks for Visual Recognition.

Background

The Inception v3 image recognition neural network came out of Google at the end of 2015. The focus of the architecture builds upon the original inception architecture of GoogLeNet [1] which was also designed to perform well even under strict constraints on memory and computational budget [2].

Inception v1

The basis of this (at the time) new architecture was the so called inception module. It builds on the traditional convolutional layer which has a fixed height and width. Instead it applies and assortment of different size filters to the input, allowing the model to perform multi-level feature extraction on each input:

Figure 1: Taken from Rethinking the inception architecture for computer vision[1] - the naive inception module.

As you can see above, there are $1 \times 1$, $3 \times 3$ and $5 \times 5$ convolutions as well as a $3 \times 3$ max pooling applied to the input. In theory, this is great - as we are able to exploit multiple level spatial correlations in the input, however using a large number of $5 \times 5$ filters can be computationally expensive.

This train of thought led to the next iteration of the inception module:

Figure 2: Taken from Rethinking the inception architecture for computer vision[1] - inception module with dimension reductions.

This architecture is essentially the same as the initial proposal except that each of the larger filters is preceded by a $1 \times 1$ convolution. These $1 \times 1$ convolutions act as a dimension reduction tool in the channel dimension, leaving the spatial dimension untouched. The idea is that the important cross channel information will be captured without explicitly keeping every channel. Once the channel dimension reduction has been performed, these results feed into the larger filters which will capture both spatial and channel correlations. The $1 \times 1$ filters also have ReLU activation functions which introduce additional non-linearities into the system.

Put simply in [1]:

One of the main beneficial aspects of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity. The ubiquitous use of dimension reduction allows for shielding the large number of input filters of the last stage to the next layer, first reducing their dimension before convolving over them with a large patch size. Another practically useful aspect of this design is that it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously.

Inception v2/3

A key realisation of the earlier iterations of Inception was that the improved performance compared to its peers was driven by dimensionality reduction using the $1 \times 1$ filters. This has a natural interpretation in computer vision; as the authors put it [2]:

In a vision network, it is expected that the outputs of near-by activations are highly correlated. Therefore, we can expect that their activations can be reduced before aggregation and that this should result in similarly expressive local representations.

With reducing computational complexity in mind (hence the number of parameters in the network) the authors sought a way to keep the power of the larger filters, but reduce how expensive the operations involving them were. Thus they investigated replacing the larger filters with a multi-layer network that has the same input/output size but less parameters.

Figure 3: Replacing a $5 \times 5$ filter with a network of $3 \times 3$ and a $1 \times 1$ filter.

Thus the inception module now looks like

Figure 4: Inception module where each $5 \times 5$ filter has been replaced by two $3 \times 3$ filters.

This replacement results in a 28% saving in computational efficiency [2]. Inception v2 and v3 share the same inception module architecture, the different arises in differences in training - namely batch normalization of the fully connected layer of the auxiliary classifier.

Transfer Learning

Now that we have a (very) high level understanding of the architecture of Inception v3, let's talk about transfer learning.

Given how computationally expensive it is to train a CNN from scratch - a task which can take months - we can use the results of a pre-trained CNN to do the feature extraction for us. The idea is that the generic features of an image (edges, shapes, etc) which are captured in the early layers of the CNN are common to most images. That is, we can utilise a CNN that has been trained for a specific task on millions of images to do the feature extraction for our image recognition task.

We then strip the CNN of the final fully connected layer (which feeds to the softmax classifier) and replace it with a fully connected layer built for our own task and train that. There are several approaches to transfer learning which are dependent on the task, such as we can use the pre-trained CNN as a feature extractor as detailed above, or we can use it as a starting point and retrain the weights in the convolutional layers. The amount of training data you have for your task and the details of the task will generally dictate the approach. See [3] for a detailed study on transfer learning.

The pre-trained CNN we'll be utilising an Inception v3 network as described above, trained on the ImageNet 2012 dataset which contained 1.2million images and 1000 categories.

Implementation

Thankfully Google had the foresight to make it dead easy to perform the aforementioned transfer learning using Tensorflow.
For my training images, I utilised the work done in [4] which provided around 350 hotdog images and around 50 not hotdog (notdog) images. We then utilise the script retrain.py which we'll pass parameters to from the terminal.

The script will determine the number of classes based on the number of folders in the parent folder. You don't need to worry about specific names for the images, you just need to make sure the folders contained the correctly labeled images.

The rest of the parameters are fairly self explanatory and are explained on the tutorial. One interesting set of parameters to note are the random_crop and random_scale, these distort the training images as to preserve their meaning but add new training examples. For example a picture of a hotdog rotated 90 degrees is still a hotdog; instead of sourcing additional hotdog pictures we can just rotate an existing training example to teach the classifier something new.

Results

After running retrain.py with 500 training steps, I received the below training summary

99% Validation accuracy - not bad at all!

Let's have a look at a specific example and try the below hotdog

hotdogs (score = 0.99956)

random (score = 0.00044)
Great news, this image is in fact a hotdog and our CNN recognises it as such with a high degree of confidence. What about a notdog?

random (score = 0.92311)

hotdogs (score = 0.07689)

This is a sunflower not a hotdog! Yet the classifier is only rather confident rather than completely confident that it is a notdog. This can be attributed to the limited number of training examples that were used in training the final layer before classification.
What about if we try to trick it - with a HOTDOG CAR!?

random (score = 0.70404)
hotdogs (score = 0.29596)
Not too bad at all, obviously the shape is similar to that of a hot dog but the chassis, wheels and other car-like features offer enough difference for our CNN to make the distinction.

What's next?

I haven't decided if I'll build some CNNs or have a crack at an LSTM implementation next - both will be using Tensorflow but I've got some ideas I want to play around with first which will shape which direction I head in.

References

[1] Szegedy, C. et al. 2014. Going deeper with convolutions, https://arxiv.org/pdf/1409.4842.pdf
[2] Szegedy, C. et al. 2015. Rethinking the Inception Architecture for Computer Vision, https://arxiv.org/pdf/1512.00567.pdf.
[3] Donahue, J. et al. 2013. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition, https://arxiv.org/abs/1310.1531.
[4] https://github.com/Fazelesswhite/Hotdog-Classification

Friday 31 March 2017

word2vec and word embeddings

Where to now?

Now that we've got the basics of neural networks down pat, this opens up a world of related techniques that can build on this knowledge. I thought to myself that I'd get stuck into some Natural Language Processing (NLP) with a view to eventually implement some sort of Recurrent Neural Network using TensorFlow.

The wonderful folks at Stanford have all the lecture notes and assignments up online at http://web.stanford.edu/class/cs224n/syllabus.html which is a fantastic resource.

Anyway before we can even begin to think of the applications, we must understand how to represent our text for input into any machine learning algorithm.

One-hot encoding

Naively, we may think to take our entire vocabulary (i.e all words in the training set - size $V$) and make a huge vector, with each entry in the vector corresponding to a word in our vocabulary. Thus each word is represented by a $V \times 1$ vector with exactly one entry equal to $1$ and all other entries $0$. For example, a word $x_i$ would be represented as

$$ x_i = \left[ \begin{array}{c} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{array} \right]$$This is called one-hot encoding. The representation seems simple enough, but it has one problem - there is no natural way to embed the contextual similarity of words. Imagine we had three words; $x_1$ = dog, $x_2$ = cat and $x_3$ = pencil. Intuitively, if you were asked which two words were "similar", you would say in a given context, $x_1$ and $x_2$ are similar, whilst $x_3$ shares less similarity with $x_1$ or $x_2$. Now supposed our three words have the following one-hot encoded representation $$ x_1 = \left[\begin{array}{c} \vdots \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \\ \vdots\end{array} \right], x_2 = \left[ \begin{array}{c} \vdots \\ 1 \\ \vdots \\ 0 \\ \vdots \\ 0 \\\vdots \end{array} \right], x_3 = \left[ \begin{array}{c} \vdots \\ 0 \\ \vdots \\ 0 \\ \vdots \\ 1 \\\vdots \end{array} \right]$$ We could hope that the dot product between vectors would give us a notion of similarity, but each of the $w_i$ are orthogonal, that is $x_i \cdot x_j = 0 \hspace{.1in} \forall i \neq j$ This representation also isn't all that great given that our vocabulary could be of the order of hundreds of millions of words, which would result in humongous one-hot encoded vector representations in which almost all (except for one) entry is a 0. This called a sparse representation for obvious reasons. We'll now have a look at some more computationally efficient (and more accurate!) dense word representations.

Efficient Estimation of Word Representations in Vector Space

In 2013, Mikolov et al. presented Efficient Estimation of Word representations in Vector Space which proposed two architectures:

Continuous Bag of Words (CBOW) and
Skip-gram model

with the aim of minimizing the computational complexity traditionally associated with NLP models. These representations somewhat surprisingly capture a lot of syntactic and semantic information in the structure of the vector space. That is, that similar words tend to be near one another (in terms of Euclidean distance) and that vector operations can be used to analogously relate pairs of words. There is the famous $\text{king} - \text{man} + \text{woman} \approx \text{queen}$ equation which is quite remarkable. In the one-hot view of the world, all of our vectors were orthogonal so dot products or any sort of vector addition/subtraction had no intuitive meaning or captured any relationship between words. See here for a cool visualisation.

Continuous Bag of words

Simply put, the CBOW aims to predict a word given an input context. For example, we would like a CBOW model to predict the word fox from an input context "The quick brown...". To perform such a task, it was proposed to use a simple feedforward neural network (wait - we know all about those now!) without an activation function in each neuron in the hidden layer. The idea is to train the NN on this classification task and then simply pick out the weight matrix in the hidden layer - this will be our dense representation.

Architecture of CBOW

The CBOW model looks at m words either side (or $C$ total as in the diagram below) of the target/center word $w_c$ in order to predict it . Below is an image the architecture of a CBOW model - notice that each input word is a one-hot encoded vector of dimension $V \times 1$ where $V$ is the size of our vocabulary. $W$ is a $N \times V$ matrix, where $N$ is the dimension of the representation of our input vectors $x_1, \dots, x_c$, where $x_{i}$ is the one hot encoded vector of the $i^{th}$ context word. We define $W x_{i} \equiv v_{i}$ which is a single column of $W$ due to the one hot encoding. Below is a visual representation of a CBOW model.

Forward propagation:

We have $C$ one hot encoded input vectors which feed into the hidden layer via the weight matrix $W$.
Since we have $C$ inputs acting on $W$, the resulting matrix $\bar{v}$ is computed from the average as follows: $$ \hat{v} = \frac{1}{C} W \left( x_{1} + x_{2} + \dots + x_{C} \right) = \frac{1}{C} \left ( v_{1} + v_{2} + \dots + v_{C} \right)$$
We actually don't have an activation function here (it is the identity mapping). So from the hidden layer to the output layer, we propagate forward via weight matrix $W'$ which is $V \times N$

Define the $j^{th}$ row of $W'$ as $u^T_{j}$.

Thus the $j^{th}$ entry of the unnormalised output (a $V \times 1$ vector) is simply $z_j = u^{T}_{j} \hat{v}$
Finally, we apply our old friend the softmax function to obtain our output: $$\hat{y}_j = \frac{e^{z_j}}{\sum_{k=1}^{V} e^{z_k}}$$
To measure the loss of our network, we'll use the cross entropy loss function, which we saw in the derivation of a feedforward neural network. $$l = - \sum_{j=1}^{V} y_j \log(\hat{y}_j)$$

This is just the negative of the log likelihood $p \left( w_k | w_{I_1}, \dots , w_{I_C} \right)$ where $w_{I_j}$ represents the $j^{th}$ word of input context $I$.
Thus we have $$l = - u^{T}_{c} \hat{v} + \log \left(\sum_{k=1}^V e^{ u^{T}_k \hat{v}} \right)$$
where c is the index of the correct word - the one hot vector for the correct word as a $1$ at the $c^{th}$ position.

We now just need to optimise the loss function to recover our input representation $\hat{v}$ and the output representation $u^{T}_j$ for $j = 1, \dots, V$.

Let's now take a look at how we optimise our loss function $L$ via gradient descent using backpropagation. We won't go through algebra, but it is exactly analogous to what we've seen before. Now $$\frac{\partial J}{\partial u_j} = \left\{ \begin{array}{ll} (\hat{y}_j-1)\hat{v} & j = c \\ \hat{y}_j \hat{v} & \text{otherwise} \end{array} \right. $$ and $$\frac{\partial J}{\partial \hat{v}} = -u_c + \sum_{k=1}^{V} \hat{y}_k u_k$$
These define the update rules for stochastic gradient descent. Note the sum over $V$, the size of our vocabulary, which can contain potentially millions of tokens. This isn't computationally efficient and will result in terrible performance if we have do do this when training the model. Suspend your disbelief for now and I'll cover some methods that people have come up with to tackle this problem in my next blogpost. We'll carry on for now and introduce the skip-gram model.

Skip-Gram

The skip-gram does somewhat the opposite of the CBOW. A skip-gram model takes a single word (the center word) as input and predicts surrounding words (context). It uses the same approach in that we'll train an ANN to perform the prediction task and we'll just pick up the weight matrices it learns as our dense representation of our words.

Architecture of the Skip-Gram model

The Skip-Gram model takes an input (context) word $x_k$ and will predict the context from it. The input word is a single one-hot encoded vector. We denote the output by $C$ - a collection of vectors ${\hat{y_1}, \dots, \hat{y_C}}$. Each $\hat{y_j}$ is a vector of probabilities for each word in the vocabulary occurring. Below is an image the architecture of a Skip-Gram model - notice that we have the same matrices $W$ and $W'$ as in CBOW - these operate the same way on our single input as they did on $\hat{v}$ in CBOW.

Forward propagation:

We have a single one hot encoded input vector $x_k$ which feeds into the hidden layer via the weight matrix $W$. Define $W x_k \equiv v_c$ - this is also called the center word input representation.
Again, We actually don't have an activation function here (it is the identity mapping). So from the hidden layer to the output layer, we propagate forward via weight matrix $W'$ which is $V \times N$
Thus the $j^th$ element of the unnormalised output (a $V \times 1$ vector) is simply $z_j = u^{T}_j v_c$. $C$ copies of the output vector are output - one vector for each of the words in the context we are trying to predict.
Again, we apply the softmax function to each of the $C$ output vectors (to obtain our output: $$\hat{y}_{i,j} = \frac{e^{z_j}}{\sum_{k=1}^{V} e^{z_k}} = \frac{e^{u^{T}_j v_c}}{\sum_{k=1}^{V} e^{u^{T}_k v_c}}$$ where $i \in \{1, \dots , C\}$ indexes each of the output vectors.
Note that since we used $C$ copies of the same output matrix for the prediction of each surrounding word, this amounts to a conditional independence assumption. That is the probability of each surrounding word occurring given the center word is independent of it's relative positioning to it.
To measure the loss of our network, we'll again use the cross entropy loss function. However, now that there are $C$ copies of the output, we need to sum over them $$l = - \sum_{j=1}^{V} \sum_{i=1}^{C} y_{i,j} \log(\hat{y}_{i,j}) $$ represents the index of the actual output word. For example if the actual output words for a center word like were I and ice-cream then there are two one hot encoded vectors $y_{1}, y_{2}$. Suppose I has position $10,000$ in our vocabulary then $y_{1,10000}$ would be the only non zero entry of $y_{1}$. Similarly $y_{2,5000}$ would be the only non-zero entry of $y_{2}$ if ice-cream had position $5000$ in the vocabulary.
This is just the negative of the log likelihood $p \left( w_{c-m},\dots, w_{c-1}, w_{c+1}, \dots , w_{c+m} | w_c \right)$ where $w_{j}$ where $(j \neq c)$ represents the $j^{th}$ word of surrounding window and $w_c$ is the center word. We have introduced the index $m$ to index the words surrounding the center word. That is, the window of sized $C$ is symmetric about the center word $v_c$ such that there are $m$ words to the left and right of $v_c$.

Thus we have $$l = - \sum_{j=0, j\neq m}^{2m} u^{T}_{c-m+j} v_c + 2m \log{\sum_{j=1}^{V} e^{u^{T}_j v_c }} $$

It's important to see that this loss function also suffers from the problem that we need to sum over our entire vocabulary $V$ at each parameter update step - which is computationally far too expensive and sometimes not even possible.

Results

There are some interesting remarks on the word2vec Google Code page regarding model selection (skip-gram vs CBOW) and hyper parameter selection. We won't get into the details until we've covered the aforementioned sampling approaches for evaluating the models, but one point to note is

architecture: skip-gram (slower, better for infrequent words) vs CBOW (fast)

We can understand the skip-gram performing better for infrequent words as the embedded words are not averaged as they are in CBOW (i.e when we define $\hat{v}$ in CBOW). The averaging process will dampen information from infrequently occurring words, contrasted with the skip-gram model where no such process takes place.

In a second paper published by Mikolov et al Distributed Representations of Words and Phrasesand their Compositionality they published the following results for a $1000$ dimensional skip-gram model. Taking the first two principal components, the following image was created

The caption says it all really - note the geometric similarity of differences between capital cities and their countries! This was captured by the skip-gram model without providing any supervised information regarding these relationships. You can imagine simple vector addition holding approximately true in this 2D vector space: $$\text{Portugal} - \text{Lisbon} + \text{Beijing} \approx \text{China}$$

Conclusion

In the next blog I'll look at some clever sampling based approaches that people have come up with to tackle the problem of having to normalise over the entire vocabulary $V$ in order to efficiently train the model. Then we can mess around with some text classification tasks, by using our word embeddings in our favourite ML techniques! The hope is that once we've played around with some toy problems to get a strong handle of the implementations of word2vec, then I'll introduce Recurrent Neural Networks and then we can get into the nitty gritty of some super interesting deep learning architectures :)

Friday 24 February 2017

Optimisation routines

Now that we are able to train the Neural Network, let's take a look at how we can optimise the training as in reality we will be dealing with deep (many hidden neurons) networks which will potentially take days to train. I'm going to take a step back and have a look at how we might generally attack a convex optimisation problem. Convex optimisation problems are very well researched and understood area, so it makes sense to have a look at these first.

The problem

We are going to fit a simple logistic regression to the following data

http://cs229.stanford.edu/ps/ps1/logistic_x.txt
http://cs229.stanford.edu/ps/ps1/logistic_y.txt

There are two predictors (columns) in the first file and 1 outcome ($ y = \pm 1$) in the second. Note that I've borrowed these from the cs229 class that I'm currently working through. Our task is to minimise the average empirical loss: $$ L(\theta) = \frac{1}{m} \sum_{i=1}^{m} \log (1 + \exp^{-y^{(i)} \theta^{T} x^{(i)}}) $$ where $y^{(i)}$ is the actual outcome for observation $x^{(i)}$. Note that $x^{(i)}$ is a vector containing the predictors for that observation.

Is this a convex optimisation problem?

It turns out that if we can show that the Hessian $H$ for this loss function satisfies $$z^{T} H z \ge 0$$ $ \forall z \in \mathbb{R}^{3}$ then $L(\theta) $ is a convex function (more generally if $H \in \mathbb{R}^{n \times n}$ then this result must hold true $\forall z \in \mathbb{R}^{n}$. Using the definition of $H$, we have $$ H_{pq} = \frac{\partial^2 L(\theta)}{\partial \theta_p \partial \theta_q}$$ The details of the calculation are quite straightforward, so I'll omit them but the result yields $$H_{pq} = \frac{1}{m} \sum_{i=1}^{m} \frac{1}{1 + \exp^{-\theta^T x^{(i)}}} \times \left( 1 - \frac{1}{1 + \exp^{-\theta^T x^{(i)}}} \right) \times x^{(i)}_p x^{(i)}_q $$ where the sum is over all of our training examples. We can write the last two terms as a matrix product (and subsequently drop the indices) as $x^{(i)} x^{(i)^{T}}$. Since the first two terms are $\in (0, 1]$ and $[0,1)$ respectively, then the product is $\in [0, 1]$, thus we can ignore this term when assessing if $z^T H z \ge 0 \hspace{.1in} \forall z$. Thus $$z^T H z \propto \sum_{i=1}^{m} z^T x^{(i)} x^{(i)^{T}} z = \sum_{i=1}^{m} (z^T x^{(i)})^2 \ge 0 \hspace{0.1in} \forall z$$ Hence $H$ is positive semidefinite which implies that $L(\theta) is a convex problem. This means it has a global optima, which makes our lives a lot easier. Since calculating the Hessian is rather easy in this setting, we can use Newton's method.

Newton's method

Newton's method utilises the Hessian and hence the curvature of the loss surface to find an optimal path to the minima. With this additional information, we can expect the algorithm to converge faster than gradient descent, which only uses first derivative information. The update rule is as follows $$ \theta \rightarrow \theta - H^{-1} \nabla_{\theta} L(\theta) $$

Let's see the performance of Newton's method vs Gradient descent:

I've performed 20 iterations for both Newton's method and gradient descent - clearly Newton's method converges a lot faster than gradient descent. Looking at the update step, it is obvious that this method won't scale well with more parameters since each step requires the calculation of the matrix of first derivatives and the Hessian. Furthermore, it's obvious that this method will fall apart if $H$ is singular. So in our toy example above which had a convex loss function, minimal parameters to estimate - Newton's method was king, but obviously in our neural network we couldn't apply such an algorithm. The beauty of backpropagation was that after a forward pass through the network, we had all of the first order derivatives we required in the update step. This fortune does not however extend to the second derivatives we would require to calculate $H$. It seems we're stuck with regular old batch gradient descent to train or neural network...or are we?

Stochastic Gradient Descent (SGD)

Recall our update step for batch gradient descent: $$\theta_j \rightarrow \theta_j - \eta \frac{\partial}{\partial \theta_j}L(\theta)$$ where $$ L(\theta) = \frac{1}{m} \sum_{i=1}^{m} \log (1 + \exp^{-y^{(i)} \theta^{T} x^{(i)}}) $$$L(\theta)$ is a function of all m training examples. That is, for a single update to parameter $\theta_j$ we need to use all of our training data. What if we could calculate our loss on a subset of the training data? The hope is that this subset is "representative enough" of the entire dataset such that resulting update to $\theta_j$ is generally in the same direction to that of the update calculated on the entire dataset. This is the essence of Stochastic Gradient Descent.

Consider the fact when the training dataset has several hundred million rows, we may not even be able to fit all the data in memory to perform a batch gradient descent!

I'll define the following terms commonly found when talking about SGD

Epoch - once every datapoint in the training set has been used, one epoch has occurred.
Mini batch - the subset of training data that is used in the parameter update
Batch size - the size of the mini batch

So per epoch, we will be able to update the parameters $\frac{N}{\text{batch size}}$ times. Compare this to batch gradient descent which by definition, the parameters are updated once per epoch.

The pseudocode is as follows

for epoch in num_epochs:
shuffle(data)
for mini_batch in data:
evaluate derivative of loss on mini_batch
update parameters

Since we are taking a subset of our training data to update our parameters, the results may be volatile as each subset may contain slightly different information about our loss surface. This is slightly troublesome as we approach the optima, as in SGD the path of optimisation will tend to oscillate around the minimum we seek. In order to remedy this, we use a learning rate schedule which is simply a scaling (generally based on heuristics) of our learning rate at each iteration. The hope is that by the time the algorithm is near the minimum, the learning rate has been scaled down such that the successive parameter updates are relatively stable from this point forward. This process is also called annealing the learning rate. The trick is getting the balance between timing the schedule such that the learning rate is small enough when the algorithm is near the minimum - if you are too aggressive with the schedule, the learning rate will become too small too soon and you won't get near the minimum as the update to the parameters will tend to zero.

Let's have a look at the performance of SGD vs batch gradient descent on our neural network. If implemented SGD on the aforementioned neural network and have run the make moons dataset with 100000 data points and 10 neurons in the hidden layer. See below for a plot of the loss vs. epoch for both SGD and batch gradient descent.

Batch gradient descent:

Epochs: 200 (i.e 200 updates to the parameters, each based on the full training set)
Execution time: 17s

SGD:

Epochs: 1

Mini batch size: 500 (i.e 200 updates to the parameters, each based on mini batch of size 500)

Execution time: 6s

We can see that SGD actually outperforms batch gradient descent here and takes about a third of the time to run! I haven't actually applied any learning rate schedule here, you can see an egregious spike in the loss from SGD at around the $160^{th}$ epoch. See what happens below when I simply set $\eta \rightarrow \frac{\eta}{10}$ at the $150^{th}$ epoch:

Notice that the aforementioned spike is now gone and the SGD results look very promising given the accuracy and performance (compared to batch gradient descent). In practice, batch gradient descent will never be used due to the memory constraints and the fact that the randomness of SGD can help the algorithm escape local minima that batch gradient descent would naturally get stuck in.

Momentum

We can introduce the concept of momentum to our parameter updates. It is applied as follows:
Recall the usual update to our parameters is as follows $$\theta_j \rightarrow \theta_j - \eta \frac{\partial}{\partial \theta_j}L(\theta)$$ which we can write in two steps, define
$$\Delta \theta_j = \frac{\partial}{\partial \theta_j}L(\theta)$$ such that the update becomes $$\theta_j \rightarrow \theta_j - \eta \Delta \theta_j$$ We'll now modify our definition of $\Delta \theta_j$ as $$\Delta \theta_j = \gamma \Delta \theta_j - \eta \frac{\partial}{\partial \theta_j}L(\theta)$$ and our update is $$\theta_j \rightarrow \theta_j + \Delta \theta_j$$ What we've done is influenced our current update of $\theta_j$ by the amount it was updated by in the previous step. Yes we've introduced yet another hyperparameter, but the idea here is to give it some "momentum" such that the updates are influenced by previous updates and the optimsation continues towards the minimum. Consider the case when the loss surface is a long narrow ravine with steep walls - SGD will typically oscillate across the ravine as the update will point down the steep walls, the momentum term will help move the algorithm down the ravine towards to minimum we seek. See the effect of momentum on training our neural network below

We see that momentum allows the algorithm to descend very quickly and hits the minimum loss around 40 epochs in - compared to ~140 in our previous iteration.

Hopefully you've now got a good grasp of batch gradient descent vs SGD and some of the subtleties and hyperparameters that need to be considered when training a model with SGD. For a more in depth exploration of various optimisation techniques, including some more advanced methods see here. In production, we'll rely on the built in SGD routines which have been highly tuned for performance.

Friday 27 January 2017

Feedforward Artificial Neural Network pt5: Additional analysis

Now that we've finally implemented our ANN, let's have a play around some of the parameters to get an understanding of how they affect our network and its results.
The tricky part about training ANNs is that the loss function isn't necessarily convex, which means that we can't use our usual optimisation routines. The fact that the loss function isn't necessarily convex means that just because we find a local minimum, it doesn't mean it's the global minimum. Thus non convex problems may converge on different local minima depending on the parameters of the optimisation routine. We'll explore some of these parameters below.

Learning Rate

Recall how the learning rate $\eta$ enters our optimisation procedure via the weight updates in gradient descent; $$ w \rightarrow w - \eta \frac{\partial L}{\partial w}$$ It essentially controls the step size at each update. Recall we had some funny bumps in our loss function at certain iterations, let's take a closer look: I've plotted the results of two different iterations of training the ANN below

The two lines correspond to the total loss as a function of the number of iterations in our training of the ANN. The blue line has $\eta = 0.001$ and the green line has $\eta = 0.01$. You can see that the green line has those funny bumps we witnessed before - this is the training example with a larger learning rate. The spikes occur when the step size is too large and we overshoot the minimum. Notice that the blue line doesn't have these overshoots, however it takes more iterations to approach the minimum. If we take a step size which is too large, then we consistently overshoot the minima - never converging on the minimum:

The key is finding a learning rate which will find the minimum within a reasonable timeframe. Although our selection ($\eta = 0.001$ vs $\eta = 0.01$) didn't make a huge difference in this case, consider an ANN with multiple hidden layers and thousands of neurons in each layer. This network may take hours (or days) to train depending on how we choose our learning rate.
Depending on the problem at hand, you may value accuracy more than efficiency or vice versa, this will dictate how you choose your learning rate, of which you will usually calculate using cross validation.

Regularisation / Weight Decay

Say we have our initial loss function (the cross entropy Loss) $L_0$ and we add a regularisation term such that we now have $$L = L_0 + \frac{\lambda}{2n} \sum_{w} w^2$$ where the sum is over all weights. Now if $\lambda$ is large then the second term will dominate $L$ and the task of optimising the entire expression will be reduced to minimising $\sum_w w^2$. If $\lambda$ is small then the first term dominates and there are less restrictions place on $w$. This regularisation term controls $w$ by preventing it from becoming overly large and helps us from overfitting the model. If we want to use gradient descent to minimise this regularised loss function we have $$ \frac{\partial L}{\partial w} = \frac{\partial L_0}{\partial w} + \frac{\lambda}{n} \sum_w w$$ so our update at each iteration is $$ w \rightarrow w - \eta \frac{\partial L}{\partial w}$$ becomes $$ w \rightarrow w - \eta \frac{\partial L_0}{\partial w} - \frac{\eta \lambda}{n} w$$ $$\implies w \rightarrow \left(1 - \frac{\eta \lambda}{n} \right) w - \eta \frac{\partial L_0}{\partial w}$$ That is at each update, the weight $w$ is rescaled by a factor of $\left( 1 - \frac{\eta \lambda}{n} \right)$ at each iteration; this is referred to as weight decay and as mentioned before, limits the magnitude of $w$.

Weight initialisation

In this section we'll take a look at why we chose to intialise our values as we did (from a normal distribution with specific parameters). Recall the definition of the weight update from our gradient descent algorithm $$ w \rightarrow w - \eta \frac{\partial L}{\partial w}$$, if the second term in this expression is small or zero, then there is effectively no (or very little) weight update to $w$. This causes our training to slow down incredibly, such that after each iteration our weight $w$ is only changing ever so slightly; obviously we would like to avoid this situation at the start of the procedure. Recall the backpropagation rules for $W^{(1)}$:

$\delta^{(1)} = (1-\tanh^{2}(Z^{(1)}) \odot \delta^{(2)}{W^{(2)}}^T$

$\frac{\partial L}{\partial W^{(1)}} = {x}^T \delta^{(1)}$

we see that $(1-\tanh^{2}(Z^{(1)})$ term enters the equation (more generally, this will be the derivative of the activation function). So we have our update to the weight $w$ as $$w \rightarrow w - \eta (1-\tanh^{2}(Z^{(1)})) \odot {x}^T \delta^{(2)}{W^{(2)}}^T$$ That is the amount we update $w$ by is proportional to the derivative of our activation function. Thus we want to avoid initialising our weights in a region where this derivative is close to zero. Below is a plot of the function

We can see that this activation function has its derivative approach zero at both extremes: as $x \rightarrow \infty$ and as $x \rightarrow -\infty$. Let's think about a more general ANN for a moment - suppose we have an ANN with 1000 inputs and a single training example where each input is equal to $1$. We have as usual $$Z^{(1)} = x W^{(1)} +b^{(1)}$$ If we have intialised each entry $W^{(1)}_{ij}$ and $b^{(1)}_j$ as selected from a standard normal distribution (iid), then each entry $Z^{(1)}_{i}$ will be the sum of 1001 iid standard normal variables. Then since the sum of $N$ standard normal variables will have mean $0$ and standard deviation $\sqrt{1001}$ i.e a very wide distribution with a relatively high probability of giving a large negative or positive result (this almost looks like a uniform distribution), the derivative of the activation function will be very close to zero! This isn't what we want.

What about if we initialise with a random normal with mean $0$ and standard deviation $\frac{1}{\sqrt{1000}}$? Now we know that the variance of a sum of iid random normal variances is the sum of the variances so we now have each entry in $Z^{(1)}_{ij}$ has mean $0$ and standard deviation $$\sigma = \sqrt{\frac{1000}{1000}+1} = \sqrt{2}$$ which is a lot narrower than our distribution before - there is a lot smaller chance intialising at values where the derivative of the activation function is close to $0$. Below is a comparison of the resulting initialisation distributions from the toy example - the green line is the resulting distribution for the refined initialisation, where the red line results from initialisation by standard normal variables.

More generally for a given network we will initialise from a Gaussian distribution with mean $0$ and standard deviation $\frac{1}{\sqrt{N_{in}}}$ where $N_{in}$ is the number of inputs into the neural network.
Next time we'll have a look at optimising our network using stochastic gradient descent and maybe play around with some different datasets.