Summary by Qure.ai 1 year ago
[Batch Normalization Ioffe et. al 2015](Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift) is one of the remarkable ideas in the era of deep learning that sits with the likes of Dropout and Residual Connections. Nonetheless, last few years have shown a few shortcomings of the idea, which two years later Ioffe has tried to solve through the concept that he calls Batch Renormalization.
Issues with Batch Normalization
- Different parameters used to compute normalized output during training and inference
- Using Batch Norm with small minibatches
- Non-i.i.d minibatches can have a detrimental effect on models with batchnorm. For e.g. in a metric learning scenario, for a minibatch of size 32, we may randomly select 16 labels then choose 2 examples for each of these labels, the examples interact at every layer and may cause model to overfit to the specific distribution of minibatches and suffer when used on individual examples.
The problem with using moving averages in training, is that it causes gradient optimization and normalization in opposite direction and leads to model blowing up.
Idea of Batch Renormalization
We know that,
${\frac{x_i - \mu}{\sigma} = \frac{x_i - \mu_B}{\sigma_B}.r + d}$
where,
${r = \frac{\sigma_B}{\sigma}, d = \frac{\mu_B - \mu}{\sigma}}$
So the batch renormalization algorithm is defined as follows
![Batch Renorm Algo](https://fractalanalytic-my.sharepoint.com/personal/shubham_jain_fractalanalytics_com/_layouts/15/guestaccess.aspx?docid=0c2c627424786442f8de65367755e1fd1&authkey=ARSCi3QfpM_uBVuWCYARKNg)
Ioffe writes further that for practical purposes,
> In practice, it is beneficial to train the model for a certain number of iterations with batchnorm alone, without the correction, then ramp up the amount of allowed correction. We do this by imposing bounds on r and d, which initially constrain them to 1 and 0, respectively, and then are gradually relaxed.
In experiments,
For Batch Renorm, author used $r_{max}$ = 1, $d_{max}$ = 0 (i.e. simply batchnorm) for the first 5000 training steps, after which these were gradually relaxed to reach $r_{max}$ = 3 at 40k steps, and $d_{max}$ = 5 at 25k steps. A training step means, an update to the model.

more
less