Spatial Transformer NetworksSpatial Transformer NetworksJaderberg, Max and Simonyan, Karen and Zisserman, Andrew and Kavukcuoglu, Koray2015

Paper summarycubs#### Problem addressed:
A module to spatially transform feature maps conditioned on feature maps itself. Attempts to improve rotation, scale and shift invariance in neural nets.
#### Summary:
This paper introduces Spatial Transformer Networks (STN) for rotation, shift and scale invariance. The module consists of three parts - Localization function, Grid point generation and Sampling. Each of these modules are differentiable and can be inserted at any point in a standard neural network architecture. The constraints are that the learnt spatial transform must be parametrized. Localization function learns these parameters by looking at the previous layer output (typically HxWxC for convolutional layers) and regressing to the parameters using FC layers or convolutional layers. The source grid generator parameters are learnt the same way. Given these two, the output of the STN is constructed by sampling (using any differentiable kernel) the source grid and using the transform parameters.
#### Novelty:
A new module is introduced to increase invariance to rotation, scale and shift
#### Drawbacks:
Since only some points in the source feature maps are selected due to grid generation it is unclear how the error is backpropagated to previous layers
#### Datasets:
Distorted MNIST, CUB-200-2011 Birds, SVHN
#### Resources:
http://arxiv.org/pdf/1506.02025v1.pdf
#### Presenter:
Bhargava U. Kota

TLDR; The authors introduce a new spatial transformation module that can be inserted into any Neural Network. The module consists of a spatial transformation network that predicts transformation parameters, a grid generator that chooses a sampling grid from the input, and a sampler that produces the output. Possible learned transformations include things cropping, translation, rotation, scaling or attention. The module can be trained end-to-end using backpropagation. The authors evaluate evaluate the module on both CNNs and MLPs, achieving state on distorted MNIST data, street view numbers, and fine-grained bird classification.
#### Key Points:
- STMs can be inserted between any layers, typically after the input or extracted features. The transform is dynamic and happens based on the input data.
- The module is fast and doesn't adversely impact training speed.
- The actual transformation parameters (output of localization network) can be fed into higher layers.
- Attention can be seen as a special transformation that increases computational efficiency.
- Can also be applied to RNNs, but more investigation is needed.

This paper presents a novel layer that can be used in convolutional neural networks. A spatial transformer layer computes re-sampling points of the signal based on another neural network. The suggested transformations include scaling, cropping, rotations and non-rigid deformation whose paramerters are trained end-to-end with the rest of the model. The resulting re-sampling grid is then used to create a new representation of the underlying signal through bi-linear or nearest neighbor interpolation. This has interesting implications: the network can learn to co-locate objects in a set of images that all contain the same object, the transformation parameter localize the attention area explicitly, fine data resolution is restricted to areas important for the task. Furthermore, the model improves over previous state-of-the-art on a number of tasks.
The layer has one mini neural network that regresses on the parameters of a parametric transformation, e.g. affine), then there is a module that applies the transformation to a regular grid and a third more or less "reads off" the values in the transformed positions and maps them to a regular grid, hence under-forming the image or previous layer. Gradients for back-propagation in a few cases are derived. The results are mostly of the classic deep learning variety, including mnist and svhn, but there is also the fine-grained birds dataset. The networks with spatial transformers seem to lead to improved results in all cases.

This paper introduces a neural networks module that can learn input-dependent
spatial transformations and can be inserted into any neural network. It supports
transformations like scaling, cropping, rotations, and non-rigid deformations.
Main contributions:
- The spatial transformer network consists of the following:
- Localization network that regresses to the transformation parameters
given the input.
- Grid generator that uses the transformation parameters to produce a
grid to sample from the input.
- Sampler that produces the output feature map sampled from the input
at the grid points.
- Differentiable sampling mechanism
- The sampling is written in a way such that sub-gradients can be defined
with respect to grid coordinates.
- This enables gradients to be propagated through the grid generator and
localization network, and for the network to jointly learn the spatial
transformer along with rest of the network.
- A network can have multiple STNs
- at different points in the network, to model incremental transformations
at different levels of abstraction.
- in parallel, to learn to focus on different regions of interest. For example,
on the bird classification task, they show that one STN learns to be a head detector,
while the other focuses on the central part of the body.
## Strengths
- Their attention (and by extension transformation) mechanism is differentiable
as opposed to earlier works on non-differentiable attention mechanisms that used
reinforcement learning (REINFORCE). It also supports a richer variety of
transformations as opposed to earlier works on learning transformations, like DRAW.
- State-of-the-art classification performance on distorted MNIST, SVHN, CUB-200-2011.
## Weaknesses / Notes
This is a really nice way to generalize spatial transformations in a differentiable
manner so the model can be trained end-to-end. Classification performance, and more
importantly, qualitative results of the kind of transformations learnt on larger datasets
(like ImageNet) should be evaluated.

#### Problem addressed:
A module to spatially transform feature maps conditioned on feature maps itself. Attempts to improve rotation, scale and shift invariance in neural nets.
#### Summary:
This paper introduces Spatial Transformer Networks (STN) for rotation, shift and scale invariance. The module consists of three parts - Localization function, Grid point generation and Sampling. Each of these modules are differentiable and can be inserted at any point in a standard neural network architecture. The constraints are that the learnt spatial transform must be parametrized. Localization function learns these parameters by looking at the previous layer output (typically HxWxC for convolutional layers) and regressing to the parameters using FC layers or convolutional layers. The source grid generator parameters are learnt the same way. Given these two, the output of the STN is constructed by sampling (using any differentiable kernel) the source grid and using the transform parameters.
#### Novelty:
A new module is introduced to increase invariance to rotation, scale and shift
#### Drawbacks:
Since only some points in the source feature maps are selected due to grid generation it is unclear how the error is backpropagated to previous layers
#### Datasets:
Distorted MNIST, CUB-200-2011 Birds, SVHN
#### Resources:
http://arxiv.org/pdf/1506.02025v1.pdf
#### Presenter:
Bhargava U. Kota