[link]
$\bf Summary:$ The paper is about squeezing the number of parameters in a convolutional neural network. The number of parameters in a convolutional layer is given by (number of input channels)$\times$(number of filters)$\times$(size of filter$\times$size of filter). The paper proposes 2 strategies: (i) replace 3x3 filters with 1x1 filters and (ii) decrease the number of input channels. They assume the budget of the filter is given, i,e., they do not tinker with the number of filters. Decrease in number of parameters will lead to less accuracy. To compensate, the authors propose to downsample late in the network. The results are quite impressive. Compared to AlexNet, they achieve a 50x reduction is model size while preserving the accuracy. Their model can be further compressed with existing methods like Deep Compression which are orthogonal to this paper's approach and this can give in total of around 510x reduction while still preserving accuracy of AlexNet. $\bf Question$: The impact on running times (specially on feed forward phase which may be more typical on embedded devices) is not clear to me. Is it certain to be reduced as well or at least be *no worse* than the baseline models?
Your comment:

[link]
* The authors train a variant of AlexNet that has significantly fewer parameters than the original network, while keeping the network's accuracy stable. * Advantages of this: * More efficient distributed training, because less parameters have to be transferred. * More efficient transfer via the internet, because the model's file size is smaller. * Possibly less memory demand in production, because fewer parameters have to be kept in memory. ### How * They define a Fire Module. A Fire Module contains of: * Squeeze Module: A 1x1 convolution that reduces the number of channels (e.g. from 128x32x32 to 64x32x32). * Expand Module: A 1x1 convolution and a 3x3 convolution, both applied to the output of the Squeeze Module. Their results are concatenated. * Using many 1x1 convolutions is advantageous, because they need less parameters than 3x3s. * They use ReLUs, only convolutions (no fully connected layers) and Dropout (50%, before the last convolution). * They use late maxpooling. They argue that applying pooling late  rather than early  improves accuracy while not needing more parameters. * They try residual connections: * One network without any residual connections (performed the worst). * One network with residual connections based on identity functions, but only between layers of same dimensionality (performed the best). * One network with residual connections based on identity functions and other residual connections with 1x1 convs (where dimensionality changed) (performance between the other two). * They use pruning from Deep Compression to reduce the parameters further. Pruning simply collects the 50% of all parameters of a layer that have the lowest values and sets them to zero. That creates a sparse matrix. ### Results * 50x parameter reduction of AlexNet (1.2M parameters before pruning, 0.4M after pruning). * 510x file size reduction of AlexNet (from 250mb to 0.47mb) when combined with Deep Compression. * Top1 accuracy remained stable. * Pruning apparently can be used safely, even after the network parameters have already been reduced significantly. * While pruning was generally safe, they found that two of their later layers reacted quite sensitive to it. Adding parameters to these (instead of removing them) actually significantly improved accuracy. * Generally they found 1x1 convs to react more sensitive to pruning than 3x3s. Therefore they focused pruning on 3x3 convs. * First pruning a network, then readding the pruned weights (initialized with 0s) and then retraining for some time significantly improved accuracy. * The network was rather resilient to significant channel reduction in the Squeeze Modules. Reducing to 2550% of the original channels (e.g. 128x32x32 to 64x32x32) seemed to be a good choice. * The network was rather resilient to removing 3x3 convs and replacing them with 1x1 convs. A ratio of 2:1 to 1:1 (1x1 to 3x3) seemed to produce good results while mostly keeping the accuracy. * Adding some residual connections between the Fire Modules improved the accuracy. * Adding residual connections with identity functions *and also* residual connections with 1x1 convs (where dimensionality changed) improved the accuracy, but not as much as using *only* residual connections with identity functions (i.e. it's better to keep some modules without identity functions).  ### Rough chapterwise notes * (1) Introduction and Motivation * Advantages from having less parameters: * More efficient distributed training, because less data (parameters) have to be transfered. * Less data to transfer to clients, e.g. when a model used by some app is updated. * FPGAs often have hardly any memory, i.e. a model has to be small to be executed. * Target here: Find a CNN architecture with less parameters than an existing one but comparable accuracy. * (2) Related Work * (2.1) Model Compression * SVDmethod: Just apply SVD to the parameters of an existing model. * Network Pruning: Replace parameters below threshold with zeros (> sparse matrix), then retrain a bit. * Add quantization and huffman encoding to network pruning = Deep Compression. * (2.2) CNN Microarchitecture * The term "CNN Microarchitecture" refers to the "organization and dimensions of the individual modules" (so an Inception module would have a complex CNN microarchitecture). * (2.3) CNN Macroarchitecture * CNN Macroarchitecture = "big picture" / organization of many modules in a network / general characteristics of the network, like depth * Adding connections between modules can help (e.g. residual networks) * (2.4) Neural Network Design Space Exploration * Approaches for Design Space Exporation (DSE): * Bayesian Optimization, Simulated Annealing, Randomized Search, Genetic Algorithms * (3) SqueezeNet: preserving accuracy with few parameters * (3.1) Architectural Design Strategies * A conv layer with N filters applied to CxHxW input (e.g. 3x128x128 for a possible first layer) with kernel size kHxkW (e.g. 3x3) has `N*C*kH*kW` parameters. * So one way to reduce the parameters is to decrease kH and kW, e.g. from 3x3 to 1x1 (reduces parameters by a factor of 9). * A second way is to reduce the number of channels (C), e.g. by using 1x1 convs before the 3x3 ones. * They think that accuracy can be improved by performing downsampling later in the network (if parameter count is kept constant). * (3.2) The Fire Module * The Fire Module has two components: * Squeeze Module: * One layer of 1x1 convs * Expand Module: * Concat the results of: * One layer of 1x1 convs * One layer of 3x3 convs * The Squeeze Module decreases the number of input channels significantly. * The Expand Module then increases the number of input channels again. * (3.3) The SqueezeNet architecture * One standalone conv, then several fire modules, then a standalone conv, then global average pooling, then softmax. * Three late max pooling laters. * Gradual increase of filter numbers. * (3.3.1) Other SqueezeNet details * ReLU activations * Dropout before the last conv layer. * No linear layers. * (4) Evaluation of SqueezeNet * Results of competing methods: * SVD: 5x compression, 56% top1 accuracy * Pruning: 9x compression, 57.2% top1 accuracy * Deep Compression: 35x compression, ~57% top1 accuracy * SqueezeNet: 50x compression, ~57% top1 accuracy * SqueezeNet combines low parameter counts with Deep Compression. * The accuracy does not go down because of that, i.e. apparently Deep Compression can even be applied to small models without giving up on performance. * (5) CNN Microarchitecture Design Space Exploration * (5.1) CNN Microarchitecture metaparameters * blabla we test various values for this and that parameter * (5.2) Squeeze Ratio * In a Fire Module there is first a Squeeze Module and then an Expand Module. The Squeeze Module decreases the number of input channels to which 1x1 and 3x3 both are applied (at the same time). * They analyzed how far you can go down with the Sqeeze Module by training multiple networks and calculating the top5 accuracy for each of them. * The accuracy by Squeeze Ratio (percentage of input channels kept in 1x1 squeeze, i.e. 50% = reduced by half, e.g. from 128 to 64): * 12%: ~80% top5 accuracy * 25%: ~82% top5 accuracy * 50%: ~85% top5 accuracy * 75%: ~86% top5 accuracy * 100%: ~86% top5 accuracy * (5.3) Trading off 1x1 and 3x3 filters * Similar to the Squeeze Ratio, they analyze the optimal ratio of 1x1 filters to 3x3 filters. * E.g. 50% would mean that half of all filters in each Fire Module are 1x1 filters. * Results: * 01%: ~76% top5 accuracy * 12%: ~80% top5 accuracy * 25%: ~82% top5 accuracy * 50%: ~85% top5 accuracy * 75%: ~85% top5 accuracy * 99%: ~85% top5 accuracy * (6) CNN Macroarchitecture Design Space Exploration * They compare the following networks: * (1) Without residual connections * (2) With residual connections between modules of same dimensionality * (3) With residual connections between all modules (except pooling layers) using 1x1 convs (instead of identity functions) where needed * Adding residual connections (2) improved top1 accuracy from 57.5% to 60.4% without any new parameters. * Adding complex residual connections (3) worsed top1 accuracy again to 58.8%, while adding new parameters. * (7) Model Compression Design Space Exploration * (7.1) Sensitivity Analysis: Where to Prune or Add parameters * They went through all layers (including each one in the Fire Modules). * In each layer they set the 50% smallest weights to zero (pruning) and measured the effect on the top5 accuracy. * It turns out that doing that has basically no influence on the top5 accuracy in most layers. * Two layers towards the end however had significant influence (accuracy went down by 510%). * Adding parameters to these layers improved top1 accuracy from 57.5% to 59.5%. * Generally they found 1x1 layers to be more sensitive than 3x3 layers so they pruned them less aggressively. * (7.2) Improving Accuracy by Densifying Sparse Models * They found that first pruning a model and then retraining it again (initializing the pruned weights to 0) leads to higher accuracy. * They could improve top1 accuracy by 4.3% in this way. 
[link]
This paper is about the reduction of model parameters while maintaining (most) of the models accuracy. The paper gives a nice overview over some key findings about CNNs. One part that is especially interesting is "2.4. Neural Network Design Space Exploration". ## Model compression Key ideas for model compression are: * singular value decomposition (SVD) * replace parameters that are below a certain threshold with zeros to form a sparse matrix * combining Network Pruning with quantization (to 8 bits or less) * huffman encoding (Deep Compression) Ideas used by this paper are * Replacing 3x3 filters by 1x1 filters * Decrease the number of input channels by using **squeeze layers** One key idea to maintain high accuracy is to downsample late in the network. This means close to the input layer, the layer parameters have stride = 1, later they have stride > 1. ## Fire module A Fire module is a squeeze convolution layer (which has only $n_1$ 1x1 filters), feeding into an expand layer that has a mix of $n_2$ 1x1 and $n_3$ 3x3 convolution filters. It is chosen $$n_1 < n_2 + n_3$$ (Why?) (to be continued) 
[link]
While preserving accuracy,  Network architecture improvement decreases parameters 51X (240MB to 4.8MB).  By using Deep Compression, parameters shrinks more 10X more (4.8MB to 0.47MB). Even improves more accuracy for about 2% by using Simple Bypass (shortcut connection). They show insightful architectural design strategies; 1. Less 3x3 filters to decrease size, 2. Decrease input channels also to decrease size, 3. Downsample late to have larger activation maps to lead to higher accuracy. And great insights about CNN design space exploration by parametrize microarchitecture,  Squeeze Ratio to find good balance between weight size and accuracy.  3x3 filter percentage to find enough number of it. 