Summary by jwturner 8 months ago
The central argument of the paper is that pruning deep neural networks by removing the smallest weights is not always wise. They provide two examples to show that regularisation in this form is unsatisfactory.
## **Pruning via batchnorm**
As an alternative to the traditional approach of removing small weights, the authors propose pruning filters using regularisation on the gamma term used to scale the result of batch normalization.
Consider a convolutional layer with batchnorm applied:
```
out = max{ gamma * BN( convolve(W,x) + beta, 0 }
```
By imposing regularisation on the gamma term the resulting image becomes constant almost everywhere (except for padding) because of the additive beta. The authors train the network using regularisation on the gamma term and after convergence remove any constant filters before fine-tuning the model with further training.
The general algorithm is as follows:
- **Compute the sparse penalty for each layer.** This essentially corresponds to determining the memory footprint of each channel of the layer. We refer to the penalty as lambda.
- **Rescale the gammas.** Choose some alpha in {0.001, 0.01, 0.1, 1} and use them to scale the gamma term of each layer - apply `1/alpha` to the successive convolutional layers.
- **Train the network using ISTA regularisation on gamma.** Train the network using SGD but applying the ISTA penalty to each layer using `rho * lambda` , where rho is another hyperparameter and lambda is the sparse penalty calculated in step 1.
- **Remove constant filters.**
- **Scale back.** Multiply gamma by `1 / gamma` and gamma respectively to scale the parameters back up.
- **Finetune.** Retrain the new network format for a small number of epochs.

more
less