[link]
Wu and He propose group normalization as alternative to batch normalization. Instead of computing the statistics used for normalization based on the current minibatch, group normalization computes these statistics per instance but in groups of channels (for convolutional layers). Specifically, given activations $x_i$ with $i = (i_N, i_C, i_H, i_W)$ indexing along batch size, channels, height and width, batch normalization computes $\mu_i = \frac{1}{S}\sum_{k \in S} x_k$ and $\sigma_i = \sqrt{\frac{1}{S} \sum_{k \in S} (x_k  \mu_i)^2 + \epsilon}$ with the set $S$ holds all indices for a specific channel (i.e. across samples, height and width). For group normalization, in contrast, $S$ holds all indices of the current instance and group of channels. Meaning the statistics are computed across height, width and the current group of channels. Here, all channels can be divided into groups arbitrarily. In the paper, on ImageNet, groups of $32$ channels are used. Then, Figure 1 shows that for a batch size of 32, group normalization performs enpar with batch normalization – although the validation error is slightly larger. This is attributed to the stochastic element of batch normalization that leads to regularization. Figure 2 additionally shows the influence of the batch size of batch normalization and group normalization. https://i.imgur.com/lwP5ycw.jpg Figure 1: Training and validation error for different normalization schemes on ImageNet. https://i.imgur.com/0c3CnEX.jpg Figure 2: Validation error for different batch sizes. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).
Your comment:
