This paper presents a method for "learning the learning rate" of a stochastic gradient descent method, in the context of online learning. Indeed, variations on the chosen learning rate or learning rate schedule can have a large impact in observed performance of stochastic gradient descent. Moreover, in the context of online learning, where we are interested in achieving high performance not only at convergence but every step of the way, the "choosing the learning rate" problem is even more crucial. The authors present a method which attempts to train the learning rate itself by gradient descent. This is achieved by "unrolling" the parameter updates of our model across the time steps of online learning, which exposes the interaction between the learning rate and the sum of losses of the model across these time steps. The authors then propose a way to approximate the gradient of the sum of losses with respect to the learning rate, so that it can be used to perform gradient updates on the learning rate itself. The gradient on the learning rate has to be approximated, for essentially the same reason that gradients to train a recurrent neural network online must be approximated (see also my notes on another good paper by Yann Ollivier here: \cite{journals/corr/OllivierC15}). Another approximation is introduced to avoid having to compute an Hessian matrix. Nevertheless, results suggest that the proposed approximation works well and can improve over a fixed learning with a reasonable rate decay schedule #### My two cents I think the authors are right on the money as to the challenges posed by online learning. I think these challenges are likely to be greater in the context of training neural networks online, for which little satisfactory solutions exist right now. So this is a direction of research I'm particularly excited about. At this points, the experiments consider fairly simple learning scenarios, but I don't see any obstacle in applying the same method to neural networks. One interesting observation from the results is that results are fairly robust to variations of "the learning rate of the learning rate", compared to varying and fixing the learning rate itself. Finally, I haven't had time to entirely digest one of their theoretical result, suggesting that their approximation actually corresponds to an exact gradient taken "alongside the effective trajectory" of gradient descent. However, that result seems quite interesting and would deserve more attention.
Your comment:
