evgeniizh's profile - ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation
Mark Sandler and Andrew Howard and Menglong Zhu and Andrey Zhmoginov and Liang-Chieh Chen
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CV
more

[link] Summary by evgeniizh 6 years ago

- **Linear Bottlenecks**. Authors show, that even though theoretically activations can be working in linear regime, removing activation from bottlenecks of residual network gives a boost to performance.
-**Inverted residuals**. The shortcut connecting bottleneck perform better than shortcuts connecting the expanded layers
- **SSDLite**.  Authors propose to replace convolutions in SSD by depthwise convolutions, significantly reducing both number of parameters and number of calculations, with minor impact on precision. 
- **MobileNetV2**. A new architecture, which is basically ResNet with changes mentioned above, outperforms or shows comaparable performance with MobileNetV1, ShuffleNet and NASNet for same number of MACs. Object detection with SSDLite can be ran on ARM core in 200ms. Also a potential of semantic segmentation on mobile devices is chown: a network achieving 75.32% mIOU  on PASCAL and  only requiring 2.75B MACs.

arxiv.org
arxiv-vanity.com
scholar.google.com

AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks
Aditya Devarakonda and Maxim Naumov and Michael Garland
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG, cs.CV, cs.DC, stat.ML, 68T05, , I.2.6; I.5.0
more

[link] Summary by evgeniizh 6 years ago

**TL;DR**: You can increase batch size in advanced phases of training without hurting accuracy and gaining some speedup. You should multiply the learning rate by the same value you multiplied batch size.

**Long version**: Authors propose to increase batch size gradually, starting with a small batch size $r$, and then progressively increase the batch size while adapting the learning rate $\alpha$ so that the ratio $\alpha/r$ remains constant (without taking in account scheduled LR reduce). In this paper, they double the batch size by schedule. At the same time, learning rate is decayed and then multiplied by 2 to compensate batch size increase: if in baseline lr is multiplied by $0.375$, it is multiplied by $0.75$ now.

The experiments on CIFAR-100 dataset show that the gradual increase of batch size allows to converge to the same values as constant small batch size. However, bigger batches allow faster training, providing $\times 1.5$ speedup on AlexNet, and around $\times 1.2$ speedup on ResNet and VGG, for both forward and backward passes on single GPU.

On multiple GPUs the approach allows to further increase batchsize. On fortunate setups authors manage to get up to $\times 1.6$ speedup compared to constant batch size equal to initial value, while the error is almost unchanged. For bigger batch sizes lr warmup is used.

For ImageNet, same behavior is shown for accuracy: gradual increase of batch size converges to same values as setup with initial batch size. Since authors haven't access to a system capable of processing large batches on ImageNet, no performance results are reported.

evgeniizh

sciscore: 4