Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Paper summary Network training is very sensitive to learning rate and initialization factors. Each layer output distribution is different than its input distribution (called covariate shift) which implies that layers have to permanently adapt to new input distribution. In this paper the author introduce batch normalization, a new layer to reduce covariate shift. _Dataset:_ [MNIST](http://yann.lecun.com/exdb/mnist/), [ImageNet](www.image-net.org/). #### Inner workings: Batch normalization fixes the means and variances of layer inputs for a training batch by computing the following normalization on each batch. [![screen shot 2017-04-13 at 10 21 39 am](https://cloud.githubusercontent.com/assets/17261080/24996464/4027fbba-2033-11e7-966a-2db3c0f1389d.png)](https://cloud.githubusercontent.com/assets/17261080/24996464/4027fbba-2033-11e7-966a-2db3c0f1389d.png) The parameters Gamma and Beta are then learned with a gradient descent. During inference the statistics are computed using unbiased estimators of the whole dataset (and not just the batch). #### Results: Batch normalization provides several advantages: 1. Use of a higher learning rate without risk of divergence by stabilizing the gradient scale. 2. Regularizes the model. 3. Reduces the need for dropout. 4. Avoid the network to get stuck when using saturating nonlinearities. #### What to do? 1. Add batch norm layer before activation layers. 2. Increase the learning rate. 3. Remove dropout. 4. Reduce L2 weight regularization. 5. Accelerate learning rate decay. 6. Reduce picture distorsion for data augmentation.
jmlr.org
scholar.google.com
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Ioffe, Sergey and Szegedy, Christian
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp


Summary by José Manuel Rodríguez Sotelo 1 year ago
Loading...
Your comment:
Summary by Shagun Sodhani 1 year ago
Loading...
Do you have a source for how the normalization works for CNNs? Do you know of any follow-up work which did what you mentioned in "Future work"? (And there is a typo: "archwitecture")

To see effect of batch normalization on CNN, you may refer this benchmark [https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md] Thanks for pointing out the typo :)

Your comment:
Summary by Denny Britz 1 year ago
Loading...
Could you please explain why adding the parameters $\beta$ and $\gamma$ does not change the variance?

What do you mean by "shuffle training examples more thoroughly"?

Your comment:
Summary by Alexander Jung 4 months ago
Loading...
Your comment:
Summary by Martin Thoma 1 year ago
Loading...
Your comment:
Summary by Cubs Reading Group 4 months ago
Loading...
Your comment:
Summary by Joseph Paul Cohen 1 year ago
Loading...
Your comment:
Summary by Léo Paillier 1 month ago
Loading...
Your comment:


ShortScience.org allows researchers to publish paper summaries that are voted on and ranked!
About

Sponsored by: and