Spatial Transformer NetworksSpatial Transformer NetworksJaderberg, Max and Simonyan, Karen and Zisserman, Andrew and Kavukcuoglu, Koray2015
Paper summarycubs#### Problem addressed:
A module to spatially transform feature maps conditioned on feature maps itself. Attempts to improve rotation, scale and shift invariance in neural nets.
#### Summary:
This paper introduces Spatial Transformer Networks (STN) for rotation, shift and scale invariance. The module consists of three parts - Localization function, Grid point generation and Sampling. Each of these modules are differentiable and can be inserted at any point in a standard neural network architecture. The constraints are that the learnt spatial transform must be parametrized. Localization function learns these parameters by looking at the previous layer output (typically HxWxC for convolutional layers) and regressing to the parameters using FC layers or convolutional layers. The source grid generator parameters are learnt the same way. Given these two, the output of the STN is constructed by sampling (using any differentiable kernel) the source grid and using the transform parameters.
#### Novelty:
A new module is introduced to increase invariance to rotation, scale and shift
#### Drawbacks:
Since only some points in the source feature maps are selected due to grid generation it is unclear how the error is backpropagated to previous layers
#### Datasets:
Distorted MNIST, CUB-200-2011 Birds, SVHN
#### Resources:
http://arxiv.org/pdf/1506.02025v1.pdf
#### Presenter:
Bhargava U. Kota
This paper presents a novel layer that can be used in convolutional neural networks. A spatial transformer layer computes re-sampling points of the signal based on another neural network. The suggested transformations include scaling, cropping, rotations and non-rigid deformation whose paramerters are trained end-to-end with the rest of the model. The resulting re-sampling grid is then used to create a new representation of the underlying signal through bi-linear or nearest neighbor interpolation. This has interesting implications: the network can learn to co-locate objects in a set of images that all contain the same object, the transformation parameter localize the attention area explicitly, fine data resolution is restricted to areas important for the task. Furthermore, the model improves over previous state-of-the-art on a number of tasks.
The layer has one mini neural network that regresses on the parameters of a parametric transformation, e.g. affine), then there is a module that applies the transformation to a regular grid and a third more or less "reads off" the values in the transformed positions and maps them to a regular grid, hence under-forming the image or previous layer. Gradients for back-propagation in a few cases are derived. The results are mostly of the classic deep learning variety, including mnist and svhn, but there is also the fine-grained birds dataset. The networks with spatial transformers seem to lead to improved results in all cases.