Layer Normalization on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Layer Normalization
Jimmy Lei Ba and Jamie Ryan Kiros and Geoffrey E. Hinton
arXiv e-Print archive - 2016 via Local arXiv
Keywords: stat.ML, cs.LG
more

Summaries/Notes 2

[link] Summary by Denny Britz 7 years ago

TLDR; The authors propose a new normalization scheme called "Layer Normalization" that works especially well for recurrent networks. Layer Normalization is similar to Batch Normalization, but only depends on a single training case. As such, it's well suited for variable length sequences or small batches. In Layer Normalization each hidden unit shares the same normalization term. The authors show through experiments that Layer Normalization converges faster, and sometimes to better solutions, than batch- or unnormalized RNNs. Batch normalization still performs better for CNNs.

Your comment:

[link] Summary by David Stutz 4 years ago

Ba et al. propose layer normalization, normalizing the activations of a layer by its mean and standard deviation. In contrast to batch normalization, this scheme does not depend on the current batch; thus, it performs the same computation at training and test time. The general scheme, however, is very similar. Given the $l$-th layer of a multi-layer perceptron,

$a_i^l = (w_i^l)^T h^l$ and $h_i^{l + 1} = f(a_i^l + b_i^l)$

with $W^l$ being the weight matrix, the activations $a_i^l$ are normalized by mean $\mu_i^l$ and standard deviation $\sigma_i^l$. For batch normalization these are estimated over the current mini batch:

$\mu_i^l = \mathbb{E}_{p(x)} [a_i^l]$ and $\sigma_i^l = \sqrt{\mathbb{E}_{p(x)} [(a_i^l - \mu_i^l)^2}$.

However, this estimation depends heavily on the batch size; additionally, models change during training and test time (at test time, these statistics are estimated over the training set). For layer normalization, instead, these statistics are evaluated over the activations in the same layer:

$\mu^l = \frac{1}{H}\sum_{i = 1}^H a_i^l$ and $\sigma^l = \sqrt{\frac{1}{H}\sum_{i = 1}^H (a_i^l - \mu^l)^2}$.

Thus, the normalization is not depending on the batch size anymore. Additionally, layer normalization is invariant to scaling and shifts of the weight matrix (for batch normalization, this only holds for the columns of the matrix). In experiments, this approach is shown to work well for a variety of tasks including models with attention mechanisms and recurrent neural networks. For convolutional neural networks, the authors state that layer normalization does not outperform batch normalization, but performs better than using no normalization at all.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private