This is another "learning the learning rate" paper, which predates (and might have inspired) the "Speed learning on the fly" paper I recently wrote notes about (see \cite{journals/corr/MasseO15}). In this paper, they consider the offline training scenario, and propose to do gradient descent on the learning rate by unrolling the *complete* training procedure and treating it all as a function to optimize, with respect to the learning rate. This way, they can optimize directly the validation set loss. The paper in fact goes much further and can tune many other hyperparameters of the gradient descent procedure: momentum, weight initialization distribution parameters, regularization and input preprocessing. #### My two cents This is one of my favorite papers of this year. While the method of unrolling several steps of gradient descent (100 iterations in the paper) makes it somewhat impractical for large networks (which is probably why they considered 3layer networks with only 50 hidden units per layer), it provides an incredibly interesting window on what are good hyperparameter choices for neural networks. Note that, to substantially reduce the memory requirements of the method, the authors had to be quite creative and smart about how to encode changes in the network's weight changes. There are tons of interesting experiments, which I encourage the reader to go check out (see section 3). One experiment on training the learning rates, separately for each iteration (i.e. learning a learning rate schedule), for each layer and for either weights or biases (800 hyperparameters total) shows that a good schedule is one where the top layer first learns quickly (large learning), then the bottom layer starts training faster, and finally the learning rates of all layers is decayed towards zero. Note that some of the experiments presented actually optimized the training error, instead of the validation set error. Another looked at finding optimal scales for the weight initialization. Interestingly, the values found weren't that far from an often prescribed scale of $1 / \sqrt{N}$, where $N$ is the number of units in the previous layer. The experiment on "training the training set", i.e. generating the 10 examples (one per class) that would minimize the validation set loss of a network trained on these examples is a pretty cool idea (it essentially learns prototypical images of the digits from 0 to 9 on MNIST). Another experiment tried to optimize a multitask regularization matrix, in order to encourage forms of softweighttying across tasks. Note that approaches like the one in this paper make tools for automatic differentiation incredibly valuable. Python autograd, the author's automatic differentiation Python library https://github.com/HIPS/autograd (which inspired our own Torch autograd https://github.com/twitter/torchautograd) was in fact developed in the context of this paper. Finally, I'll end with a quote from the paper, that I found particularly funny: "The last remaining parameter to SGD is the initial parameter vector. Treating this vector as a hyperparameter blurs the distinction between learning and metalearning. In the extreme case where all elementary learning rates are set to zero, the training set ceases to matter and the metalearning procedure exactly reduces to elementary learning on the validation set. Due to philosophical vertigo, we chose not to optimize the initial parameter vector."
Your comment:
