Spatial Transformer NetworksSpatial Transformer NetworksJaderberg, Max and Simonyan, Karen and Zisserman, Andrew and Kavukcuoglu, Koray2015

Paper summarydennybritzTLDR; The authors introduce a new spatial transformation module that can be inserted into any Neural Network. The module consists of a spatial transformation network that predicts transformation parameters, a grid generator that chooses a sampling grid from the input, and a sampler that produces the output. Possible learned transformations include things cropping, translation, rotation, scaling or attention. The module can be trained end-to-end using backpropagation. The authors evaluate evaluate the module on both CNNs and MLPs, achieving state on distorted MNIST data, street view numbers, and fine-grained bird classification.
#### Key Points:
- STMs can be inserted between any layers, typically after the input or extracted features. The transform is dynamic and happens based on the input data.
- The module is fast and doesn't adversely impact training speed.
- The actual transformation parameters (output of localization network) can be fed into higher layers.
- Attention can be seen as a special transformation that increases computational efficiency.
- Can also be applied to RNNs, but more investigation is needed.

This paper presents a novel layer that can be used in convolutional neural networks. A spatial transformer layer computes re-sampling points of the signal based on another neural network. The suggested transformations include scaling, cropping, rotations and non-rigid deformation whose paramerters are trained end-to-end with the rest of the model. The resulting re-sampling grid is then used to create a new representation of the underlying signal through bi-linear or nearest neighbor interpolation. This has interesting implications: the network can learn to co-locate objects in a set of images that all contain the same object, the transformation parameter localize the attention area explicitly, fine data resolution is restricted to areas important for the task. Furthermore, the model improves over previous state-of-the-art on a number of tasks.
The layer has one mini neural network that regresses on the parameters of a parametric transformation, e.g. affine), then there is a module that applies the transformation to a regular grid and a third more or less "reads off" the values in the transformed positions and maps them to a regular grid, hence under-forming the image or previous layer. Gradients for back-propagation in a few cases are derived. The results are mostly of the classic deep learning variety, including mnist and svhn, but there is also the fine-grained birds dataset. The networks with spatial transformers seem to lead to improved results in all cases.

TLDR; The authors introduce a new spatial transformation module that can be inserted into any Neural Network. The module consists of a spatial transformation network that predicts transformation parameters, a grid generator that chooses a sampling grid from the input, and a sampler that produces the output. Possible learned transformations include things cropping, translation, rotation, scaling or attention. The module can be trained end-to-end using backpropagation. The authors evaluate evaluate the module on both CNNs and MLPs, achieving state on distorted MNIST data, street view numbers, and fine-grained bird classification.
#### Key Points:
- STMs can be inserted between any layers, typically after the input or extracted features. The transform is dynamic and happens based on the input data.
- The module is fast and doesn't adversely impact training speed.
- The actual transformation parameters (output of localization network) can be fed into higher layers.
- Attention can be seen as a special transformation that increases computational efficiency.
- Can also be applied to RNNs, but more investigation is needed.