[link]
# Very Short The authors propose **learning** an optimizer **to** optimally **learn** a function (the *optimizee*) which is being trained **by gradient descent**. This optimizer, a recurrent neural network, is trained to make optimal parameter updates to the optimizee **by gradient descent**. # Short Let's suppose we have a stochastic function $f: \mathbb R^{\text{dim}(\theta)} \rightarrow \mathbb R^+$, (the *optimizee*) which we wish to minimize with respect to $\theta$. Note that this is the typical situation we encounter when training a neural network with Stochastic Gradient Descent  where the stochasticity comes from sampling random minibatches of the data (the data is omitted as an argument here). The "vanilla" gradient descent update is: $\theta_{t+1} = \theta_t  \alpha_t \nabla_{\theta_t} f(\theta_t)$, where $\alpha_t$ is some learning rate. Other optimizers (Adam, RMSProp, etc) replace the multiplication of the gradient by $\alpha_t$ with some sort of weighted sum of the history of gradients. This paper proposes to apply an optimization step $\theta_{t+1} = \theta_t + g_t$, where the update $g_t \in \mathbb R^{\text{dim}(\theta)}$ is defined by a recurrent network $m_\phi$: $$(g_t, h_{t+1}) := m_\phi (\nabla_{\theta_t} f(\theta_t), h_t)$$ Where in their implementation, $h_t \in \mathbb R^{\text{dim}(\theta)}$ is the hidden state of the recurrent network. To make the number of parameters in the optimizer manageable, they implement their recurrent network $m$ as a *coordinatewise* LSTM (i.e. A set of $\text{dim}(\theta)$ small LSTMs that share parameters $\phi$). They train the optimizer networks's parameters $\phi$ by "unrolling" T subsequent steps of optimization, and minimizing: $$\mathcal L(\phi) := \mathbb E_f[f(\theta^*(f, \phi))] \approx \frac1T \sum_{t=1}^T f(\theta_t)$$ Where $\theta^*(f, \phi)$ are the final optimizee parameters. In order to avoid computing second derivatives while calculating $\frac{\partial \mathcal L(\phi)}{\partial \phi}$, they make the approximation $\frac{\partial}{\partial \phi} \nabla_{\theta_t}f(\theta_t) \approx 0$ (corresponding to the dotted lines in the figure, along which gradients are not backpropagated). https://i.imgur.com/HMaCeip.png **The computational graph of the optimization of the optimizer, unrolled across 3 timesteps. Note that $\nabla_t := \nabla_{\theta_t}f(\theta_t)$. The dotted line indicates that we do not backpropagate across this path.** The authors demonstrate that their method usually outperforms traditional optimizers (ADAM, RMSProp, SGD, NAG), on a synthetic dataset, MNIST, CIFAR10, and Neural Style Transfer. They argue that their algorithm constitutes a form of transfer learning, since a pretrained optimizer can be applied to accelerate training of a newly initialized network.
Your comment:
