Large-Batch Training for LSTM and Beyond on ShortScience.org

arxiv.org
scholar.google.com

Large-Batch Training for LSTM and Beyond
You, Yang and Hseu, Jonathan and Ying, Chris and Demmel, James and Keutzer, Kurt and Hsieh, Cho-Jui
arXiv e-Print archive - 2019 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 1

[link] Summary by sudharsansai 4 years ago

Often the best learning rate for a DNN is sensitive to batch size and hence need significant tuning while scaling batch sizes to large scale training. Theory suggests that when you scale the batch size by a factor of $k$ (in the case of multi-GPU training), the learning rate should be scaled by $\sqrt{k}$ to keep the variance of the gradient estimator constant (remember the variance of an estimator is inversely proportional to the sample size?). But in practice, often linear learning rate scaling works better (i.e. scale learning rate by $k$), with a gradual warmup scheme. 

This paper proposes a slight modification to the existing learning rate scheduling scheme called LEGW (Linear Epoch Gradual Warmup) which helps us in bridging the gap between theory and practice of large batch training.

The authors notice that in order to make square root scaling work well in practice, one should also scale the warmup period (in terms of epochs) by a factor of $k$. In other words, if you consider learning rate as a function of time period in terms of epochs, scale the periodicity of the function by $k$, while scaling the amplitude of the function by $\sqrt{k}$, when the batch size is scaled by $k$. The authors consider various learning rate scheduling schemes like exponential decay, polynomial decay and multi-step LR decay and find that square root scaling with LEGW scheme often leads to little to no loss in performance while scaling the batch sizes. In fact, one can use SGD with LEGW with no tuning and make it work as good as Adam.

Thus with this approach, one can tune the learning rate for a small batch size and extrapolate it to larger batch sizes while making use of parallel hardwares.

Your comment: