Group Normalization on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Group Normalization
Yuxin Wu and Kaiming He
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CV, cs.LG
more

Summaries/Notes 3

[link] Summary by Hadrien Bertrand 4 years ago

Batch Normalization doesn't work well when using small batch sizes, which is often required for memory intensive tasks such as detection or segmentation, or memory intensive data such as 3D images, videos or high-res images.

Group Normalization is a simple alternative that is independent of the batch size:
![image](https://user-images.githubusercontent.com/8659132/57881829-3e255080-77f0-11e9-8ba0-56089c711e7b.png)

It works as BN, except with a different set of features for computing the mean and std:
![image](https://user-images.githubusercontent.com/8659132/57882429-ab85b100-77f1-11e9-86df-2c9865d28e8b.png)
The $\gamma$ and $\beta$ are learned per group and applied as usual:
![image](https://user-images.githubusercontent.com/8659132/57882468-c9ebac80-77f1-11e9-9d19-82b83b49ea24.png)

A group is defined as a set of channels, and the mean and std is computed for that set of channels for one sample, as illustrated:
![image](https://user-images.githubusercontent.com/8659132/57882184-200c2000-77f1-11e9-9d2c-8d3fad6d6827.png)
By default, there are 32 groups, but they show GN works well as long as there is more than one group but less than the number of channels.

In term of experiments, they try on ImageNet classification, detection and segmentation in COCO, and video classification in Kinetics. The conclusion is that **GN results in the same performance no matter the batch size, and that performance is the same as BN in large batches.** The most impressive result is a 10% increase in accuracy on ImageNet with a batch size of 2 over BN.

# Comments

- This paper got an honorable mention at ECCV 2018.
- I don't understand how it works at the entrance of the network, when there is only 1 or 3 channels. Are we just not supposed to put GN there?
- Also, the number of channels tends to increase in the network, but the number of groups stays fixed. Should it scale with the number of channels?
- They tested GN on many tasks, but mostly on ResNet. There was only one experiment on VGG-16, where they found no big difference with BN. For now I'm not convinced GN is useful outside of ResNet.

Code: https://github.com/facebookresearch/Detectron/tree/master/projects/GN

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private