First published: 2018/01/13 (3 months ago) Abstract: In this paper we describe a new mobile architecture, MobileNetV2, that
improves the state of the art performance of mobile models on multiple tasks
and benchmarks as well as across a spectrum of different model sizes. We also
describe efficient ways of applying these mobile models to object detection in
a novel framework we call SSDLite. Additionally, we demonstrate how to build
mobile semantic segmentation models through a reduced form of DeepLabv3 which
we call Mobile DeepLabv3.
The MobileNetV2 architecture is based on an inverted residual structure where
the input and output of the residual block are thin bottleneck layers opposite
to traditional residual models which use expanded representations in the input
an MobileNetV2 uses lightweight depthwise convolutions to filter features in
the intermediate expansion layer. Additionally, we find that it is important to
remove non-linearities in the narrow layers in order to maintain
representational power. We demonstrate that this improves performance and
provide an intuition that led to this design. Finally, our approach allows
decoupling of the input/output domains from the expressiveness of the
transformation, which provides a convenient framework for further analysis. We
measure our performance on Imagenet classification, COCO object detection, VOC
image segmentation. We evaluate the trade-offs between accuracy, and number of
operations measured by multiply-adds (MAdd), as well as the number of
- **Linear Bottlenecks**. Authors show, that even though theoretically activations can be working in linear regime, removing activation from bottlenecks of residual network gives a boost to performance.
-**Inverted residuals**. The shortcut connecting bottleneck perform better than shortcuts connecting the expanded layers
- **SSDLite**. Authors propose to replace convolutions in SSD by depthwise convolutions, significantly reducing both number of parameters and number of calculations, with minor impact on precision.
- **MobileNetV2**. A new architecture, which is basically ResNet with changes mentioned above, outperforms or shows comaparable performance with MobileNetV1, ShuffleNet and NASNet for same number of MACs. Object detection with SSDLite can be ran on ARM core in 200ms. Also a potential of semantic segmentation on mobile devices is chown: a network achieving 75.32% mIOU on PASCAL and only requiring 2.75B MACs.
First published: 2017/12/06 (4 months ago) Abstract: Training deep neural networks with Stochastic Gradient Descent, or its
variants, requires careful choice of both learning rate and batch size. While
smaller batch sizes generally converge in fewer training epochs, larger batch
sizes offer more parallelism and hence better computational efficiency. We have
developed a new training approach that, rather than statically choosing a
single batch size for all epochs, adaptively increases the batch size during
the training process. Our method delivers the convergence rate of small batch
sizes while achieving performance similar to large batch sizes. We analyse our
approach using the standard AlexNet, ResNet, and VGG networks operating on the
popular CIFAR-10, CIFAR-100, and ImageNet datasets. Our results demonstrate
that learning with adaptive batch sizes can improve performance by factors of
up to 6.25 on 4 NVIDIA Tesla P100 GPUs while changing accuracy by less than 1%
relative to training with fixed batch sizes.
**TL;DR**: You can increase batch size in advanced phases of training without hurting accuracy and gaining some speedup. You should multiply the learning rate by the same value you multiplied batch size.
**Long version**: Authors propose to increase batch size gradually, starting with a small batch size $r$, and then progressively increase the batch size while adapting the learning rate $\alpha$ so that the ratio $\alpha/r$ remains constant (without taking in account scheduled LR reduce). In this paper, they double the batch size by schedule. At the same time, learning rate is decayed and then multiplied by 2 to compensate batch size increase: if in baseline lr is multiplied by $0.375$, it is multiplied by $0.75$ now.
The experiments on CIFAR-100 dataset show that the gradual increase of batch size allows to converge to the same values as constant small batch size. However, bigger batches allow faster training, providing $\times 1.5$ speedup on AlexNet, and around $\times 1.2$ speedup on ResNet and VGG, for both forward and backward passes on single GPU.
On multiple GPUs the approach allows to further increase batchsize. On fortunate setups authors manage to get up to $\times 1.6$ speedup compared to constant batch size equal to initial value, while the error is almost unchanged. For bigger batch sizes lr warmup is used.
For ImageNet, same behavior is shown for accuracy: gradual increase of batch size converges to same values as setup with initial batch size. Since authors haven't access to a system capable of processing large batches on ImageNet, no performance results are reported.