Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse TargetsEfficient Exact Gradient Update for training Deep Networks with Very Large Sparse TargetsVincent, Pascal and de Brébisson, Alexandre and Bouthillier, Xavier2015
Paper summaryhlarochelleThis paper presents a linear algebraic trick for computing both the value and the gradient update for a loss function that compares a very high-dimensional target with a (dense) output prediction. Most of the paper exposes the specific case of the squared error loss, though it can also be applied to some other losses such as the so-called spherical softmax. One use case could be for training autoencoders with the squared error on very high-dimensional but sparse inputs. While a naive (i.e. what most people currently do) implementation would scale in $O(Dd)$ where $D$ is the input dimensionality and d the hidden layer dimensionality, they show that their trick allows to scale in $O(d^2)$.
Their experiments show that they can achieve speedup factors of over 500 on the CPU, and over 1500 on the GPU.
#### My two cents
This is a really neat, and frankly really surprising, mathematical contribution. I did not suspect getting rid of the dependence on D in the complexity would actually be achievable, even for the "simpler" case of the squared error.
The jury is still out as to whether we can leverage the full power of this trick in practice. Indeed, the squared error over sparse targets isn't the most natural choice in most situations. The authors did try to use this trick in the context of a version of the neural network language model that uses the squared error instead of the negative log-softmax (or at least I think that's what was done... I couldn't confirm this with 100% confidence). They showed that good measures of word similarity (Simlex-999) could be achieved in this way, though using the hierarchical softmax actually achieves better performance in about the same time.
But as far as I'm concerned, that doesn't make the trick less impressive. It's still a neat piece of new knowledge to have about reconstruction errors. Also, the authors mention that it would be possible to adapt the trick to the so-called (negative log) spherical softmax, which is like the softmax but where the numerator is the square of the pre-activation, instead of the exponential. I hope someone tries this out in the future, as perhaps it could be key to making this trick a real game changer!
This paper presents a linear algebraic trick for computing both the value and the gradient update for a loss function that compares a very high-dimensional target with a (dense) output prediction. Most of the paper exposes the specific case of the squared error loss, though it can also be applied to some other losses such as the so-called spherical softmax. One use case could be for training autoencoders with the squared error on very high-dimensional but sparse inputs. While a naive (i.e. what most people currently do) implementation would scale in $O(Dd)$ where $D$ is the input dimensionality and d the hidden layer dimensionality, they show that their trick allows to scale in $O(d^2)$.
Their experiments show that they can achieve speedup factors of over 500 on the CPU, and over 1500 on the GPU.
#### My two cents
This is a really neat, and frankly really surprising, mathematical contribution. I did not suspect getting rid of the dependence on D in the complexity would actually be achievable, even for the "simpler" case of the squared error.
The jury is still out as to whether we can leverage the full power of this trick in practice. Indeed, the squared error over sparse targets isn't the most natural choice in most situations. The authors did try to use this trick in the context of a version of the neural network language model that uses the squared error instead of the negative log-softmax (or at least I think that's what was done... I couldn't confirm this with 100% confidence). They showed that good measures of word similarity (Simlex-999) could be achieved in this way, though using the hierarchical softmax actually achieves better performance in about the same time.
But as far as I'm concerned, that doesn't make the trick less impressive. It's still a neat piece of new knowledge to have about reconstruction errors. Also, the authors mention that it would be possible to adapt the trick to the so-called (negative log) spherical softmax, which is like the softmax but where the numerator is the square of the pre-activation, instead of the exponential. I hope someone tries this out in the future, as perhaps it could be key to making this trick a real game changer!