Learning to learn by gradient descent by gradient descent on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Learning to learn by gradient descent by gradient descent
Marcin Andrychowicz and Misha Denil and Sergio Gomez and Matthew W. Hoffman and David Pfau and Tom Schaul and Nando de Freitas
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.NE, cs.LG
more

Summaries/Notes 1

[link] Summary by Peter O'Connor 8 years ago

# Very Short

The authors propose **learning** an optimizer **to** optimally **learn** a function (the *optimizee*) which is being trained **by gradient descent**.  This optimizer, a recurrent neural network, is trained to make optimal parameter updates to the optimizee **by gradient descent**.

# Short

Let's suppose we have a stochastic function $f: \mathbb R^{\text{dim}(\theta)} \rightarrow \mathbb R^+$, (the *optimizee*) which we wish to minimize with respect to $\theta$.  Note that this is the typical situation we encounter when training a neural network with Stochastic Gradient Descent - where the stochasticity comes from sampling random minibatches of the data (the data is omitted as an argument here).  

The "vanilla" gradient descent update is: $\theta_{t+1} = \theta_t - \alpha_t \nabla_{\theta_t} f(\theta_t)$, where $\alpha_t$ is some learning rate.  Other optimizers (Adam, RMSProp, etc) replace the multiplication of the gradient by $-\alpha_t$ with some sort of weighted sum of the history of gradients.

This paper proposes to apply an optimization step $\theta_{t+1} = \theta_t + g_t$, where the update $g_t \in \mathbb R^{\text{dim}(\theta)}$ is defined by a recurrent network $m_\phi$: 

$$(g_t, h_{t+1}) := m_\phi (\nabla_{\theta_t} f(\theta_t), h_t)$$

Where in their implementation, $h_t \in \mathbb R^{\text{dim}(\theta)}$ is the hidden state of the recurrent network.  To make the number of parameters in the optimizer manageable, they implement their recurrent network $m$ as a *coordinatewise* LSTM (i.e. A set of $\text{dim}(\theta)$ small LSTMs that share parameters $\phi$).   They train the optimizer networks's parameters $\phi$ by "unrolling" T subsequent steps of optimization, and minimizing:

$$\mathcal L(\phi) := \mathbb E_f[f(\theta^*(f, \phi))]  \approx \frac1T \sum_{t=1}^T f(\theta_t)$$

Where $\theta^*(f, \phi)$ are the final optimizee parameters.  In order to avoid computing second derivatives while calculating $\frac{\partial \mathcal L(\phi)}{\partial \phi}$, they make the approximation $\frac{\partial}{\partial \phi}  \nabla_{\theta_t}f(\theta_t) \approx 0$ (corresponding to the dotted lines in the figure, along which gradients are not backpropagated).  

https://i.imgur.com/HMaCeip.png
**The computational graph of the optimization of the optimizer, unrolled across 3 time-steps.  Note that $\nabla_t := \nabla_{\theta_t}f(\theta_t)$.  The dotted line indicates that we do not backpropagate across this path.**

The authors demonstrate that their method usually outperforms traditional optimizers (ADAM, RMSProp, SGD, NAG), on a synthetic dataset, MNIST, CIFAR-10, and Neural Style Transfer.  They argue that their algorithm constitutes a form of transfer learning, since a pre-trained optimizer can be applied to accelerate training of a newly initialized network.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private