[link]
Ba et al. propose layer normalization, normalizing the activations of a layer by its mean and standard deviation. In contrast to batch normalization, this scheme does not depend on the current batch; thus, it performs the same computation at training and test time. The general scheme, however, is very similar. Given the $l$th layer of a multilayer perceptron, $a_i^l = (w_i^l)^T h^l$ and $h_i^{l + 1} = f(a_i^l + b_i^l)$ with $W^l$ being the weight matrix, the activations $a_i^l$ are normalized by mean $\mu_i^l$ and standard deviation $\sigma_i^l$. For batch normalization these are estimated over the current mini batch: $\mu_i^l = \mathbb{E}_{p(x)} [a_i^l]$ and $\sigma_i^l = \sqrt{\mathbb{E}_{p(x)} [(a_i^l  \mu_i^l)^2}$. However, this estimation depends heavily on the batch size; additionally, models change during training and test time (at test time, these statistics are estimated over the training set). For layer normalization, instead, these statistics are evaluated over the activations in the same layer: $\mu^l = \frac{1}{H}\sum_{i = 1}^H a_i^l$ and $\sigma^l = \sqrt{\frac{1}{H}\sum_{i = 1}^H (a_i^l  \mu^l)^2}$. Thus, the normalization is not depending on the batch size anymore. Additionally, layer normalization is invariant to scaling and shifts of the weight matrix (for batch normalization, this only holds for the columns of the matrix). In experiments, this approach is shown to work well for a variety of tasks including models with attention mechanisms and recurrent neural networks. For convolutional neural networks, the authors state that layer normalization does not outperform batch normalization, but performs better than using no normalization at all. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).
Your comment:
