Path-SGD: Path-Normalized Optimization in Deep Neural NetworksPath-SGD: Path-Normalized Optimization in Deep Neural NetworksNeyshabur, Behnam and Salakhutdinov, Ruslan and Srebro, Nathan2015
Paper summarynipsreviewsDeep rectified neural networks are over-parameterized in the sense that scaling of the weights in one layer, can be compensated for exactly in the subsequent layer. This paper introduces Path-SGD, a simple modification to the SGD update rule, whose update is invariant to such rescaling. The method is derived from the proximal form of gradient descent, whereby a constraint term is added which preserves the norm of the "product weight" formed along each path in the network (from input to output node). Path-SGD is thus principled and shown to yield faster convergence for a standard 2 layer rectifier network, across a variety of dataset (MNIST, CIFAR-10, CIFAR-100, SVHN). As the method implicitly regularizes the neural weights, this also translates to better generalization performance on half of the datasets.
At its core, Path-SGD belongs to the family of learning algorithms which aim to be invariant to model reparametrizations. This is the central tenet of Amari's natural gradient (NG) \cite{amari_natural_1998}, whose importance has resurfaced in the area of deep learning. Path-SGD can thus be cast an approximation to NG, which focuses on a particular type of rescaling between neighboring layers. The paper would greatly benefit from such a discussion in my opinion. I also believe NG to be a much more direct way to motivate Path-SGD, than the heuristics of max-norm regularization.
Deep rectified neural networks are over-parameterized in the sense that scaling of the weights in one layer, can be compensated for exactly in the subsequent layer. This paper introduces Path-SGD, a simple modification to the SGD update rule, whose update is invariant to such rescaling. The method is derived from the proximal form of gradient descent, whereby a constraint term is added which preserves the norm of the "product weight" formed along each path in the network (from input to output node). Path-SGD is thus principled and shown to yield faster convergence for a standard 2 layer rectifier network, across a variety of dataset (MNIST, CIFAR-10, CIFAR-100, SVHN). As the method implicitly regularizes the neural weights, this also translates to better generalization performance on half of the datasets.
At its core, Path-SGD belongs to the family of learning algorithms which aim to be invariant to model reparametrizations. This is the central tenet of Amari's natural gradient (NG) \cite{amari_natural_1998}, whose importance has resurfaced in the area of deep learning. Path-SGD can thus be cast an approximation to NG, which focuses on a particular type of rescaling between neighboring layers. The paper would greatly benefit from such a discussion in my opinion. I also believe NG to be a much more direct way to motivate Path-SGD, than the heuristics of max-norm regularization.