An Intriguing Failing of Convolutional Neural Networks and the CoordConv SolutionAn Intriguing Failing of Convolutional Neural Networks and the CoordConv SolutionRosanne Liu and Joel Lehman and Piero Molino and Felipe Petroski Such and Eric Frank and Alex Sergeev and Jason Yosinski2018
Paper summarydecodyngThis is a paper where I keep being torn between the response of “this is so simple it’s brilliant; why haven’t people done it before,” and “this is so simple it’s almost tautological, and the results I’m seeing aren’t actually that surprising”. The basic observation this paper makes is one made frequently before, most recently to my memory by Geoff Hinton in his Capsule Net paper: sometimes the translation invariance of convolutional networks can be a bad thing, and lead to worse performance. In a lot of ways, translation invariance is one of the benefits of using a convolutional architecture in the first place: instead of having to learn separate feature detectors for “a frog in this corner” and “a frog in that corner,” we can instead use the same feature detector, and just move it over different areas of the image. However, this paper argues, this makes convolutional networks perform worse than might naively be expected at tasks that require them to remember or act in accordance with coordinates of elements within an image.
For example, they find that normal convolutional networks take nearly an hour and 200K worth of parameters to learn to “predict” the one-hot encoding of a pixel, when given the (x,y) coordinates of that pixel as input, and only get up to about 80% accuracy. Similarly, trying to take an input image with only one pixel active, and predict the (x,y) coordinates as output, is something the network is able to do successfully, but only when the test points are sampled from the same spatial region as the training points: if the test points are from a held-out quadrant, the model can’t extrapolate to the (x, y) coordinates there, and totally falls apart.
The solution proposed by the authors is a really simple one: at one or more layers within the network, in addition to the feature channels sent up from the prior layer, add two addition channels: one with a with deterministic values going from -1 (left) to 1 (right), and the other going top to bottom. This essentially adds two fixed “features” to each pixel, which jointly carry information about where it is in space. Just by adding this small change, we give the network the ability to use spatial information or not, as it sees fit. If these features don’t prove useful, their weights will stay around their initialization values of expectation-zero, and the behavior should be much like a normal convolutional net. However, if it proves useful, convolution filters at the next layer can take position information into account. It’s easy to see how this would be useful for this paper’s toy problems: you can just create a feature detector for “if this pixel is active, pass forward information about it’s spatial position,” and predict the (x, y) coordinates out easily. You can also imagine this capability helping with more typical image classification problems, by having feature filters that carry with them not only content information, but information about where a pattern was found spatially.
The authors do indeed find comparable performance or small benefits to ImageNet, MNIST, and Atari RL, when applying their layers in lieu of normal convolutional layer. On GANs in particular, they find less mode collapse, though I don’t yet 100% follow the intuition of why this would be the case.
An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution
Felipe Petroski Such
arXiv e-Print archive - 2018 via Local arXiv
cs.CV, cs.LG, stat.ML
First published: 2018/07/09 (1 year ago) Abstract: Few ideas have enjoyed as large an impact on deep learning as convolution.
For any problem involving pixels or spatial representations, common intuition
holds that convolutional neural networks may be appropriate. In this paper we
show a striking counterexample to this intuition via the seemingly trivial
coordinate transform problem, which simply requires learning a mapping between
coordinates in (x,y) Cartesian space and one-hot pixel space. Although
convolutional networks would seem appropriate for this task, we show that they
fail spectacularly. We demonstrate and carefully analyze the failure first on a
toy problem, at which point a simple fix becomes obvious. We call this solution
CoordConv, which works by giving convolution access to its own input
coordinates through the use of extra coordinate channels. Without sacrificing
the computational and parametric efficiency of ordinary convolution, CoordConv
allows networks to learn either perfect translation invariance or varying
degrees of translation dependence, as required by the task. CoordConv solves
the coordinate transform problem with perfect generalization and 150 times
faster with 10--100 times fewer parameters than convolution. This stark
contrast raises the question: to what extent has this inability of convolution
persisted insidiously inside other tasks, subtly hampering performance from
within? A complete answer to this question will require further investigation,
but we show preliminary evidence that swapping convolution for CoordConv can
improve models on a diverse set of tasks. Using CoordConv in a GAN produced
less mode collapse as the transform between high-level spatial latents and
pixels becomes easier to learn. A Faster R-CNN detection model trained on MNIST
detection showed 24% better IOU when using CoordConv, and in the RL domain
agents playing Atari games benefit significantly from the use of CoordConv