[link]
Deep rectified neural networks are overparameterized in the sense that scaling of the weights in one layer, can be compensated for exactly in the subsequent layer. This paper introduces PathSGD, a simple modification to the SGD update rule, whose update is invariant to such rescaling. The method is derived from the proximal form of gradient descent, whereby a constraint term is added which preserves the norm of the "product weight" formed along each path in the network (from input to output node). PathSGD is thus principled and shown to yield faster convergence for a standard 2 layer rectifier network, across a variety of dataset (MNIST, CIFAR10, CIFAR100, SVHN). As the method implicitly regularizes the neural weights, this also translates to better generalization performance on half of the datasets. At its core, PathSGD belongs to the family of learning algorithms which aim to be invariant to model reparametrizations. This is the central tenet of Amari's natural gradient (NG) \cite{amari_natural_1998}, whose importance has resurfaced in the area of deep learning. PathSGD can thus be cast an approximation to NG, which focuses on a particular type of rescaling between neighboring layers. The paper would greatly benefit from such a discussion in my opinion. I also believe NG to be a much more direct way to motivate PathSGD, than the heuristics of maxnorm regularization.
Your comment:
