[link]
This paper presents a linear algebraic trick for computing both the value and the gradient update for a loss function that compares a very highdimensional target with a (dense) output prediction. Most of the paper exposes the specific case of the squared error loss, though it can also be applied to some other losses such as the socalled spherical softmax. One use case could be for training autoencoders with the squared error on very highdimensional but sparse inputs. While a naive (i.e. what most people currently do) implementation would scale in $O(Dd)$ where $D$ is the input dimensionality and d the hidden layer dimensionality, they show that their trick allows to scale in $O(d^2)$. Their experiments show that they can achieve speedup factors of over 500 on the CPU, and over 1500 on the GPU. #### My two cents This is a really neat, and frankly really surprising, mathematical contribution. I did not suspect getting rid of the dependence on D in the complexity would actually be achievable, even for the "simpler" case of the squared error. The jury is still out as to whether we can leverage the full power of this trick in practice. Indeed, the squared error over sparse targets isn't the most natural choice in most situations. The authors did try to use this trick in the context of a version of the neural network language model that uses the squared error instead of the negative logsoftmax (or at least I think that's what was done... I couldn't confirm this with 100% confidence). They showed that good measures of word similarity (Simlex999) could be achieved in this way, though using the hierarchical softmax actually achieves better performance in about the same time. But as far as I'm concerned, that doesn't make the trick less impressive. It's still a neat piece of new knowledge to have about reconstruction errors. Also, the authors mention that it would be possible to adapt the trick to the socalled (negative log) spherical softmax, which is like the softmax but where the numerator is the square of the preactivation, instead of the exponential. I hope someone tries this out in the future, as perhaps it could be key to making this trick a real game changer!
Your comment:
