[link]
A few years ago, a paper came out demonstrating that adaptive gradient methods (which dynamically scale gradient updates in a perparameter way according to the magnitudes of past updates) have a tendency to generalize less well than nonadaptive methods, even they adaptive methods sometimes look more performant in training, and are easier to hyperparameter tune. The 2017 paper offered a theoretical explanation for this fact based on Adam learning less complex solutions than SGD; this paper offers a different one, namely that Adam performs poorly because it is typically implemented alongside L2 regularization, which has importantly different mechanical consequences than it does in SGD. Specifically, in SGD, L2 regularization, where the loss includes both the actual loss and a L2 norm of the weights, can be made equivalent to weight decay, by choosing the right parameters for each. (see proof below). https://i.imgur.com/79jfZg9.png However, for Adam, this equivalence doesn’t hold. This is true because, in SGD, all the scaling factors are just constants, and for each learning rate value and regularization parameter, a certain weight decay parameter is implied by that. However, since Adam scales its parameter updates not by a constant learning rate but by a matrix, it’s not possible to pick other hyperparameters in a way that could get you something similar to constantparameter weight decay. To solve this, the authors suggest using an explicit weight decay term, rather than just doing implicit weight decay via L2 regularization. This is salient because the L2 norm is added to the *loss function*, and it makes up part of the gradient update, and thus gets scaled down by Adam by the same adaptive mechanism that scales down historically large gradients. When weight decay is moved outside of the form of being a norm calculation inside a loss function, and just something applied to the final update but not actually part of the adaptive scaling calculation, the authors find that 1) Adam is able to get comparable performance on image and sequence tasks (where it has previously had difficult), and 2) that even for SGD, where it was possible to find a optimal parameter setting to reproduce weight decay, having an explicit and decoupled weight decay parameter made parameters that were previously dependent on one another in their optimal values (regularization and learning rate) more independent.
Your comment:
