SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model sizeSqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model sizeIandola, Forrest N. and Moskewicz, Matthew W. and Ashraf, Khalid and Han, Song and Dally, William J. and Keutzer, Kurt2016
Paper summarynizWhile preserving accuracy,
- Network architecture improvement decreases parameters 51X (240MB to 4.8MB).
- By using Deep Compression, parameters shrinks more 10X more (4.8MB to 0.47MB).
Even improves more accuracy for about 2% by using Simple Bypass (shortcut connection).
They show insightful architectural design strategies;
1. Less 3x3 filters to decrease size,
2. Decrease input channels also to decrease size,
3. Downsample late to have larger activation maps to lead to higher accuracy.
And great insights about CNN design space exploration by parametrize microarchitecture,
- Squeeze Ratio to find good balance between weight size and accuracy.
- 3x3 filter percentage to find enough number of it.
$\bf Summary:$
The paper is about squeezing the number of parameters in a convolutional neural network. The number of parameters in a convolutional layer is given by (number of input channels)$\times$(number of filters)$\times$(size of filter$\times$size of filter).
The paper proposes 2 strategies: (i) replace 3x3 filters with 1x1 filters and (ii) decrease the number of input channels. They assume the budget of the filter is given, i,e., they do not tinker with the number of filters. Decrease in number of parameters will lead to less accuracy. To compensate, the authors propose to downsample late in the network.
The results are quite impressive. Compared to AlexNet, they achieve a 50x reduction is model size while preserving the accuracy. Their model can be further compressed with existing methods like Deep Compression which are orthogonal to this paper's approach and this can give in total of around 510x reduction while still preserving accuracy of AlexNet.
$\bf Question$: The impact on running times (specially on feed forward phase which may be more typical on embedded devices) is not clear to me. Is it certain to be reduced as well or at least be *no worse* than the baseline models?