ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Ioffe, Sergey and Szegedy, Christian
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by José Manuel Rodríguez Sotelo 7 years ago

The main contribution of this paper is introducing a new transformation that the authors call Batch Normalization (BN). The need for BN comes from the fact that during the training of deep neural networks (DNNs) the distribution of each layer’s input change. This phenomenon is called internal covariate shift (ICS).

#### What is BN?
Normalize each (scalar) feature independently with respect to the mean and variance of the mini batch. Scale and shift the normalized values with two new parameters (per activation) that will be learned. The BN consists of making normalization part of the model architecture.

#### What do we gain?
According to the author, the use of BN provides a great speed up in the training of DNNs. In particular, the gains are greater when it is combined with higher learning rates. In addition, BN works as a regularizer for the model which allows to use less dropout or less L2 normalization. Furthermore, since the distribution of the inputs is normalized, it also allows to use sigmoids as activation functions without the saturation problem.

#### What follows?
This seems to be specially promising for training recurrent neural networks (RNNs). The vanishing and exploding gradient problems \cite{journals/tnn/BengioSF94} have their origin in the iteration of transformation that scale up or down the activations in certain directions (eigenvectors). It seems that this regularization would be specially useful in this context since this would allow the gradient to flow more easily. When we unroll the RNNs, we usually have ultra deep networks.

#### Like
* Simple idea that seems to improve training.
* Makes training faster.
* Simple to implement. Probably.
* You can be less careful with initialization.

#### Dislike
* Does not work with stochastic gradient descent (minibatch size = 1).
* This could reduce the parallelism of the algorithm since now all the examples in a mini batch are tied.
* Results on ensemble of networks for ImageNet makes it harder to evaluate the relevance of BN by itself. (Although they do mention the performance of a single model).

papers.nips.cc
scholar.google.com

Scan Order in Gibbs Sampling: Models in Which it Matters and Bounds on How Much
He, Bryan D. and Sa, Christopher De and Mitliagkas, Ioannis and Ré, Christopher
Neural Information Processing Systems Conference - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 7 years ago

A study of how scan orders influence Mixing time in Gibbs sampling.

This paper is interested in comparing the mixing rates of Gibbs sampling using either systematic scan or random updates. The basic contributions are two: First, in Section 2, a set of cases where 1) systematic scan is polynomially faster than random updates. Together with a previously known case where it can be slower this contradicts a conjecture that the speeds of systematic and random updates are similar. Secondly, (In Theorem 1) a set of mild conditions under which the mixing times of systematic scan and random updates are not "too" different (roughly within squares of each other).

First, following from a recent paper by Roberts and Rosenthal, the authors construct several examples which do not satisfy the commonly held belief that systematic scan is never more than a constant factor slower and a log factor faster than random scan. The authors then provide a result Theorem 1 which provides weaker bounds, which however they verify at least under some conditions. In fact the Theorem compares random scan to a lazy version of the systematic scan and shows that and obtains bounds in terms of various other quantities, like the minimum probability, or the minimum holding probability.

MCMC is at the heart of many applications of modern machine learning and statistics. It is thus important to understand the computational and theoretical performance under various conditions. The present paper focused on examining systematic Gibbs sampling in comparison to random scan Gibbs. They do so first though the construction of several examples which challenge the dominant intuitions about mixing times, and develop theoretical bounds which are much wider than previously conjectured.

arxiv.org
arxiv-vanity.com
scholar.google.com

Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models
Sergey Ioffe
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Qure.ai 7 years ago

[Batch Normalization Ioffe et. al 2015](Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift) is one of the remarkable ideas in the era of deep learning that sits with the likes of Dropout and Residual Connections. Nonetheless, last few years have shown a few shortcomings of the idea, which two years later Ioffe has tried to solve through the concept that he calls Batch Renormalization.

Issues with Batch Normalization

- Different parameters used to compute normalized output during training and inference
- Using Batch Norm with small minibatches
- Non-i.i.d minibatches can have a detrimental effect on models with batchnorm. For e.g. in a metric learning scenario, for a minibatch of size 32, we may randomly select 16 labels then choose 2 examples for each of these labels, the examples interact at every layer and may cause model to overfit to the specific distribution of minibatches and suffer when used on individual examples.

The problem with using moving averages in training, is that it causes gradient optimization and normalization in opposite direction and leads to model blowing up.

Idea of Batch Renormalization

We know that,

${\frac{x_i - \mu}{\sigma} = \frac{x_i - \mu_B}{\sigma_B}.r + d}$

where, 

${r = \frac{\sigma_B}{\sigma}, d = \frac{\mu_B - \mu}{\sigma}}$

So the batch renormalization algorithm is defined as follows

![Batch Renorm Algo](https://fractalanalytic-my.sharepoint.com/personal/shubham_jain_fractalanalytics_com/_layouts/15/guestaccess.aspx?docid=0c2c627424786442f8de65367755e1fd1&authkey=ARSCi3QfpM_uBVuWCYARKNg)

Ioffe writes further that for practical purposes, 

> In practice, it is beneficial to train the model for a certain number of iterations with batchnorm alone, without the correction, then ramp up the amount of allowed correction. We do this by imposing bounds on r and d, which initially constrain them to 1 and 0, respectively, and then are gradually relaxed.

In experiments,

For Batch Renorm, author used $r_{max}$ = 1, $d_{max}$ = 0 (i.e. simply batchnorm) for the first 5000 training steps, after which these were gradually relaxed to reach $r_{max}$ = 3 at 40k steps, and $d_{max}$ = 5 at 25k steps. A training step means, an update to the model.

2 Comments

proceedings.mlr.press
scholar.google.com

Bayesian Uncertainty Estimation for Batch Normalized Deep Networks
Teye, Mattias and Azizpour, Hossein and Smith, Kevin
International Conference on Machine Learning - 2018 via Local Bibsonomy
Keywords: dblp

[link] Summary by David Stutz 4 years ago

Teye et al. show that neural networks with batch normalization can be used to give uncertainty estimates through Monte Carlo sampling. In particular, instead of using the test mode of batch normalization, where the statistics (mean and variance) of each batch normalization layer are fixed, these statistics are computed per batch, as in training mode. To this end, for a specific query image, random batches from the training set are sampled, and prediction uncertainty is estimated using Monte Carlo sampling to compute mean and variance. This is summarized in Algorithm 1, depicting the proposed Monte Carlo Batch Normalization method. In the paper, this approach is further interpreted as approximate inference in Bayesian models.

https://i.imgur.com/nRdOvzs.jpg
Algorithm 1: Monte Carlo approach for using batch normalization for uncertainty estimation.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

papers.nips.cc
scholar.google.com

Algorithms for Non-negative Matrix Factorization
Lee, Daniel D. and Seung, H. Sebastian
Neural Information Processing Systems Conference - 2000 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 7 years ago

We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So 

$$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$

Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value.

$$
V = \left[\begin{array}{c c c}
5 & 4 & 1  \\\\
4 & 5 & 1 \\\\
2 & 1 & 5
\end{array}\right]
$$


We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues:


$$
W = \left[\begin{array}{c c c}
-0.656 \\\
 -0.652 \\\
 -0.379
\end{array}\right],
H = \left[\begin{array}{c c c}
-6.48 & -6.26 & -3.20\\\\
\end{array}\right]
$$

We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and  $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$):

$$
W = \left[\begin{array}{c c c}
0.388 \\\\
0.386 \\\\
0.224
\end{array}\right],
H = \left[\begin{array}{c c c}
11.22 & 10.57 & 5.41  \\\\
\end{array}\right]
$$

Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. 

$$
V \approx WH = \left[\begin{array}{c c c}
4.36 & 4.11 & 2.10 \\\
4.33 & 4.08 & 2.09 \\\
2.52 & 2.37 & 1.21 \\\
\end{array}\right]
$$


If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better`



#### Paper Contribution 

This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. 

The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$.



### Still a draft