Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models
Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models
Sergey Ioffe
2017

Paper summary
qureai
[Batch Normalization Ioffe et. al 2015](Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift) is one of the remarkable ideas in the era of deep learning that sits with the likes of Dropout and Residual Connections. Nonetheless, last few years have shown a few shortcomings of the idea, which two years later Ioffe has tried to solve through the concept that he calls Batch Renormalization.
Issues with Batch Normalization
- Different parameters used to compute normalized output during training and inference
- Using Batch Norm with small minibatches
- Non-i.i.d minibatches can have a detrimental effect on models with batchnorm. For e.g. in a metric learning scenario, for a minibatch of size 32, we may randomly select 16 labels then choose 2 examples for each of these labels, the examples interact at every layer and may cause model to overfit to the specific distribution of minibatches and suffer when used on individual examples.
The problem with using moving averages in training, is that it causes gradient optimization and normalization in opposite direction and leads to model blowing up.
Idea of Batch Renormalization
We know that,
${\frac{x_i - \mu}{\sigma} = \frac{x_i - \mu_B}{\sigma_B}.r + d}$
where,
${r = \frac{\sigma_B}{\sigma}, d = \frac{\mu_B - \mu}{\sigma}}$
So the batch renormalization algorithm is defined as follows
![Batch Renorm Algo](https://fractalanalytic-my.sharepoint.com/personal/shubham_jain_fractalanalytics_com/_layouts/15/guestaccess.aspx?docid=0c2c627424786442f8de65367755e1fd1&authkey=ARSCi3QfpM_uBVuWCYARKNg)
Ioffe writes further that for practical purposes,
> In practice, it is beneficial to train the model for a certain number of iterations with batchnorm alone, without the correction, then ramp up the amount of allowed correction. We do this by imposing bounds on r and d, which initially constrain them to 1 and 0, respectively, and then are gradually relaxed.
In experiments,
For Batch Renorm, author used $r_{max}$ = 1, $d_{max}$ = 0 (i.e. simply batchnorm) for the first 5000 training steps, after which these were gradually relaxed to reach $r_{max}$ = 3 at 40k steps, and $d_{max}$ = 5 at 25k steps. A training step means, an update to the model.
Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models

Sergey Ioffe

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG

**First published:** 2017/02/10 (3 years ago)

**Abstract:** Batch Normalization is quite effective at accelerating and improving the
training of deep models. However, its effectiveness diminishes when the
training minibatches are small, or do not consist of independent samples. We
hypothesize that this is due to the dependence of model layer inputs on all the
examples in the minibatch, and different activations being produced between
training and inference. We propose Batch Renormalization, a simple and
effective extension to ensure that the training and inference models generate
the same outputs that depend on individual examples rather than the entire
minibatch. Models trained with Batch Renormalization perform substantially
better than batchnorm when training with small or non-i.i.d. minibatches. At
the same time, Batch Renormalization retains the benefits of batchnorm such as
insensitivity to initialization and training efficiency.
more
less

Sergey Ioffe

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG

[link]
[Batch Normalization Ioffe et. al 2015](Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift) is one of the remarkable ideas in the era of deep learning that sits with the likes of Dropout and Residual Connections. Nonetheless, last few years have shown a few shortcomings of the idea, which two years later Ioffe has tried to solve through the concept that he calls Batch Renormalization. Issues with Batch Normalization - Different parameters used to compute normalized output during training and inference - Using Batch Norm with small minibatches - Non-i.i.d minibatches can have a detrimental effect on models with batchnorm. For e.g. in a metric learning scenario, for a minibatch of size 32, we may randomly select 16 labels then choose 2 examples for each of these labels, the examples interact at every layer and may cause model to overfit to the specific distribution of minibatches and suffer when used on individual examples. The problem with using moving averages in training, is that it causes gradient optimization and normalization in opposite direction and leads to model blowing up. Idea of Batch Renormalization We know that, ${\frac{x_i - \mu}{\sigma} = \frac{x_i - \mu_B}{\sigma_B}.r + d}$ where, ${r = \frac{\sigma_B}{\sigma}, d = \frac{\mu_B - \mu}{\sigma}}$ So the batch renormalization algorithm is defined as follows ![Batch Renorm Algo](https://fractalanalytic-my.sharepoint.com/personal/shubham_jain_fractalanalytics_com/_layouts/15/guestaccess.aspx?docid=0c2c627424786442f8de65367755e1fd1&authkey=ARSCi3QfpM_uBVuWCYARKNg) Ioffe writes further that for practical purposes, > In practice, it is beneficial to train the model for a certain number of iterations with batchnorm alone, without the correction, then ramp up the amount of allowed correction. We do this by imposing bounds on r and d, which initially constrain them to 1 and 0, respectively, and then are gradually relaxed. In experiments, For Batch Renorm, author used $r_{max}$ = 1, $d_{max}$ = 0 (i.e. simply batchnorm) for the first 5000 training steps, after which these were gradually relaxed to reach $r_{max}$ = 3 at 40k steps, and $d_{max}$ = 5 at 25k steps. A training step means, an update to the model. |

You must log in before you can submit this summary! Your draft will not be saved!

Preview:

About