First published: 2017/05/24 (6 years ago) Abstract: Compression and computational efficiency in deep learning have become a
problem of great significance. In this work, we argue that the most principled
and effective way to attack this problem is by adopting a Bayesian point of
view, where through sparsity inducing priors we prune large parts of the
network. We introduce two novelties in this paper: 1) we use hierarchical
priors to prune nodes instead of individual weights, and 2) we use the
posterior uncertainties to determine the optimal fixed point precision to
encode the weights. Both factors significantly contribute to achieving the
state of the art in terms of compression rates, while still staying competitive
with methods designed to optimize for speed or energy efficiency.
This paper described an algorithm of parametrically adding noise and applying a variational regulariser similar to that in ["Variational Dropout Sparsifies Deep Neural Networks"][vardrop]. Both have the same goal: make neural networks more efficient by removing parameters (and therefore the computation applied with those parameters). Although, this paper also has the goal of giving a prescription of how many bits to store each parameter with as well.
There is a very nice derivation of the hierarchical variational approximation being used here, which I won't try to replicate here. In practice, the difference to prior work is that the stochastic gradient variational method uses hierarchical samples; ie it samples from a prior, then incorporates these samples when sampling over the weights (both applied through local reparameterization tricks). It's a powerful method, which allows them to test two different priors (although they are clearly not limited to just these), and compare both against competing methods. They are comparable, and the choice of prior offers some tradeoffs in terms of sparsity versus quantization.
[vardrop]: http://www.shortscience.org/paper?bibtexKey=journals/corr/1701.05369