[link]
TLDR; The authors introduce a new spatial transformation module that can be inserted into any Neural Network. The module consists of a spatial transformation network that predicts transformation parameters, a grid generator that chooses a sampling grid from the input, and a sampler that produces the output. Possible learned transformations include things cropping, translation, rotation, scaling or attention. The module can be trained endtoend using backpropagation. The authors evaluate evaluate the module on both CNNs and MLPs, achieving state on distorted MNIST data, street view numbers, and finegrained bird classification. #### Key Points:  STMs can be inserted between any layers, typically after the input or extracted features. The transform is dynamic and happens based on the input data.  The module is fast and doesn't adversely impact training speed.  The actual transformation parameters (output of localization network) can be fed into higher layers.  Attention can be seen as a special transformation that increases computational efficiency.  Can also be applied to RNNs, but more investigation is needed.
Your comment:

[link]
This paper presents a novel layer that can be used in convolutional neural networks. A spatial transformer layer computes resampling points of the signal based on another neural network. The suggested transformations include scaling, cropping, rotations and nonrigid deformation whose paramerters are trained endtoend with the rest of the model. The resulting resampling grid is then used to create a new representation of the underlying signal through bilinear or nearest neighbor interpolation. This has interesting implications: the network can learn to colocate objects in a set of images that all contain the same object, the transformation parameter localize the attention area explicitly, fine data resolution is restricted to areas important for the task. Furthermore, the model improves over previous stateoftheart on a number of tasks. The layer has one mini neural network that regresses on the parameters of a parametric transformation, e.g. affine), then there is a module that applies the transformation to a regular grid and a third more or less "reads off" the values in the transformed positions and maps them to a regular grid, hence underforming the image or previous layer. Gradients for backpropagation in a few cases are derived. The results are mostly of the classic deep learning variety, including mnist and svhn, but there is also the finegrained birds dataset. The networks with spatial transformers seem to lead to improved results in all cases. 
[link]
This paper introduces a neural networks module that can learn inputdependent spatial transformations and can be inserted into any neural network. It supports transformations like scaling, cropping, rotations, and nonrigid deformations. Main contributions:  The spatial transformer network consists of the following:  Localization network that regresses to the transformation parameters given the input.  Grid generator that uses the transformation parameters to produce a grid to sample from the input.  Sampler that produces the output feature map sampled from the input at the grid points.  Differentiable sampling mechanism  The sampling is written in a way such that subgradients can be defined with respect to grid coordinates.  This enables gradients to be propagated through the grid generator and localization network, and for the network to jointly learn the spatial transformer along with rest of the network.  A network can have multiple STNs  at different points in the network, to model incremental transformations at different levels of abstraction.  in parallel, to learn to focus on different regions of interest. For example, on the bird classification task, they show that one STN learns to be a head detector, while the other focuses on the central part of the body. ## Strengths  Their attention (and by extension transformation) mechanism is differentiable as opposed to earlier works on nondifferentiable attention mechanisms that used reinforcement learning (REINFORCE). It also supports a richer variety of transformations as opposed to earlier works on learning transformations, like DRAW.  Stateoftheart classification performance on distorted MNIST, SVHN, CUB2002011. ## Weaknesses / Notes This is a really nice way to generalize spatial transformations in a differentiable manner so the model can be trained endtoend. Classification performance, and more importantly, qualitative results of the kind of transformations learnt on larger datasets (like ImageNet) should be evaluated. 
[link]
#### Problem addressed: A module to spatially transform feature maps conditioned on feature maps itself. Attempts to improve rotation, scale and shift invariance in neural nets. #### Summary: This paper introduces Spatial Transformer Networks (STN) for rotation, shift and scale invariance. The module consists of three parts  Localization function, Grid point generation and Sampling. Each of these modules are differentiable and can be inserted at any point in a standard neural network architecture. The constraints are that the learnt spatial transform must be parametrized. Localization function learns these parameters by looking at the previous layer output (typically HxWxC for convolutional layers) and regressing to the parameters using FC layers or convolutional layers. The source grid generator parameters are learnt the same way. Given these two, the output of the STN is constructed by sampling (using any differentiable kernel) the source grid and using the transform parameters. #### Novelty: A new module is introduced to increase invariance to rotation, scale and shift #### Drawbacks: Since only some points in the source feature maps are selected due to grid generation it is unclear how the error is backpropagated to previous layers #### Datasets: Distorted MNIST, CUB2002011 Birds, SVHN #### Resources: http://arxiv.org/pdf/1506.02025v1.pdf #### Presenter: Bhargava U. Kota 