Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave ConvolutionDrop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave ConvolutionYunpeng Chen and Haoqi Fan and Bing Xu and Zhicheng Yan and Yannis Kalantidis and Marcus Rohrbach and Shuicheng Yan and Jiashi Feng2019
Paper summaryhbertrandNatural images can be decomposed in frequencies, higher frequencies contain small changes and details, while lower frequencies contain the global structure. We can see an example in this image:
Each filter of a convolutional layer focuses on different frequencies of the image. This paper proposes a way to group them explicitly into high and low frequency filters.
To do that, the low frequency group is reduced spatially by 2 in all dimensions (which they define as an octave), before applying the convolution. The spatial reduction, which is a pooling operation, makes sense as it is a low pass filter, small details are discarded but the global structure is kept.
More concretely, the layer takes as input two groups of feature maps, one with a higher resolution than the other. The output is also two groups of feature maps, separated as high/low frequencies. Information is exchanged between the two groups by pooling or upsampling as needed, and as is shown on this image:
The proportion of high and low frequency feature maps is controlled through a single parameter, and through testing the authors found that having around 25% of low frequency features gives the best performance.
One important fact about this layer is that it can simply be used as replacement for a standard convolutional layer, and thus does not require other changes to the architecture. They test on various ResNets, DenseNets and MobileNets.
In terms of tasks, they get performance near state-of-the-art on [ImageNet top-1](https://paperswithcode.com/sota/image-classification-on-imagenet) and top-5. So why use this octave convolution? Because it reduces the amount of memory and computation required by the network.
- I would have liked to see more groups of varying frequencies. Since an octave is a spatial reduction of 2^n, the authors could do the same with n > 1. I expect this will be addressed in future work.
- While the results are not quite SOTA, octave convolutions seem compatible with EfficientNet, and I expect this would improve the performance of both.
- Since each octave convolution layer outputs a multi-scale representation of the input, doesn't that mean that pooling becomes less necessary in a network? If so, octave convolutions would give better performances on a new architecture optimized for them.
Code: [Official](https://github.com/facebookresearch/OctConv), [all implementations](https://paperswithcode.com/paper/drop-an-octave-reducing-spatial-redundancy-in)
Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution
arXiv e-Print archive - 2019 via Local arXiv
First published: 2019/04/10 (8 months ago) Abstract: In natural images, information is conveyed at different frequencies where
higher frequencies are usually encoded with fine details and lower frequencies
are usually encoded with global structures. Similarly, the output feature maps
of a convolution layer can also be seen as a mixture of information at
different frequencies. In this work, we propose to factorize the mixed feature
maps by their frequencies and design a novel Octave Convolution (OctConv)
operation to store and process feature maps that vary spatially "slower" at a
lower spatial resolution reducing both memory and computation cost. Unlike
existing multi-scale meth-ods, OctConv is formulated as a single, generic,
plug-and-play convolutional unit that can be used as a direct replacement of
(vanilla) convolutions without any adjustments in the network architecture. It
is also orthogonal and complementary to methods that suggest better topologies
or reduce channel-wise redundancy like group or depth-wise convolutions. We
experimentally show that by simply replacing con-volutions with OctConv, we can
consistently boost accuracy for both image and video recognition tasks, while
reducing memory and computational cost. An OctConv-equipped ResNet-152 can
achieve 82.9% top-1 classification accuracy on ImageNet with merely 22.2