Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1519 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Algorithms for Non-negative Matrix Factorization

Lee, Daniel D. and Seung, H. Sebastian

Neural Information Processing Systems Conference - 2000 via Local Bibsonomy

Keywords: dblp

Lee, Daniel D. and Seung, H. Sebastian

Neural Information Processing Systems Conference - 2000 via Local Bibsonomy

Keywords: dblp

[link]
We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So $$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$ Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value. $$ V = \left[\begin{array}{c c c} 5 & 4 & 1 \\\\ 4 & 5 & 1 \\\\ 2 & 1 & 5 \end{array}\right] $$ We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues: $$ W = \left[\begin{array}{c c c} -0.656 \\\ -0.652 \\\ -0.379 \end{array}\right], H = \left[\begin{array}{c c c} -6.48 & -6.26 & -3.20\\\\ \end{array}\right] $$ We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$): $$ W = \left[\begin{array}{c c c} 0.388 \\\\ 0.386 \\\\ 0.224 \end{array}\right], H = \left[\begin{array}{c c c} 11.22 & 10.57 & 5.41 \\\\ \end{array}\right] $$ Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. $$ V \approx WH = \left[\begin{array}{c c c} 4.36 & 4.11 & 2.10 \\\ 4.33 & 4.08 & 2.09 \\\ 2.52 & 2.37 & 1.21 \\\ \end{array}\right] $$ If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better` #### Paper Contribution This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$. ### Still a draft |

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Ren, Shaoqing and He, Kaiming and Girshick, Ross B. and Sun, Jian

Neural Information Processing Systems Conference - 2015 via Local Bibsonomy

Keywords: dblp

Ren, Shaoqing and He, Kaiming and Girshick, Ross B. and Sun, Jian

Neural Information Processing Systems Conference - 2015 via Local Bibsonomy

Keywords: dblp

[link]
**Object detection** is the task of drawing one bounding box around each instance of the type of object one wants to detect. Typically, image classification is done before object detection. With neural networks, the usual procedure for object detection is to train a classification network, replace the last layer with a regression layer which essentially predicts pixel-wise if the object is there or not. An bounding box inference algorithm is added at last to make a consistent prediction (see [Deep Neural Networks for Object Detection](http://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection.pdf)). The paper introduces RPNs (Region Proposal Networks). They are end-to-end trained to generate region proposals.They simoultaneously regress region bounds and bjectness scores at each location on a regular grid. RPNs are one type of fully convolutional networks. They take an image of any size as input and output a set of rectangular object proposals, each with an objectness score. ## See also * [R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen) * [Fast R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen) * [Faster R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/nips/RenHGS15#martinthoma) * [Mask R-CNN](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeGDG17) |

Comparing Rewinding and Fine-tuning in Neural Network Pruning

Renda, Alex and Frankle, Jonathan and Carbin, Michael

International Conference on Learning Representations - 2020 via Local Bibsonomy

Keywords: dblp

Renda, Alex and Frankle, Jonathan and Carbin, Michael

International Conference on Learning Representations - 2020 via Local Bibsonomy

Keywords: dblp

[link]
This is an interestingly pragmatic paper that makes a super simple observation. Often, we may want a usable network with fewer parameters, to make our network more easily usable on small devices. It's been observed (by these same authors, in fact), that pruned networks can achieve comparable weights to their fully trained counterparts if you rewind and retrain from early in the training process, to compensate for the loss of the (not ultimately important) pruned weights. This observation has been dubbed the "Lottery Ticket Hypothesis", after the idea that there's some small effective subnetwork you can find if you sample enough networks. Given these two facts - the usefulness of pruning, and the success of weight rewinding - the authors explore the effectiveness of various ways to train after pruning. Current standard practice is to prune low-magnitude weights, and then continue training remaining weights from values they had at pruning time, keeping the final learning rate of the network constant. The authors find that: 1. Weight rewinding, where you rewind weights to *near* their starting value, and then retrain using the learning rates of early in training, outperforms fine tuning from the place weights were when you pruned but, also 2. Learning rate rewinding, where you keep weights as they are, but rewind learning rates to what they were early in training, are actually the most effective for a given amount of training time/search cost To me, this feels a little bit like burying the lede: the takeaway seems to be that when you prune, it's beneficial to make your network more "elastic" (in the metaphor-to-neuroscience sense) so it can more effectively learn to compensate for the removed neurons. So, what was really valuable in weight rewinding was the ability to "heat up" learning on a smaller set of weights, so they could adapt more quickly. And the fact that learning rate rewinding works better than weight rewinding suggests that there is value in the learned weights after all, that value is just outstripped by the benefit of rolling back to old learning rates. All in all, not a super radical conclusion, but a useful and practical one to have so clearly laid out in a paper. |

Robustness of classifiers: from adversarial to random noise

Alhussein Fawzi and Seyed-Mohsen Moosavi-Dezfooli and Pascal Frossard

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG, cs.CV, stat.ML

**First published:** 2016/08/31 (3 years ago)

**Abstract:** Several recent works have shown that state-of-the-art classifiers are
vulnerable to worst-case (i.e., adversarial) perturbations of the datapoints.
On the other hand, it has been empirically observed that these same classifiers
are relatively robust to random noise. In this paper, we propose to study a
\textit{semi-random} noise regime that generalizes both the random and
worst-case noise regimes. We propose the first quantitative analysis of the
robustness of nonlinear classifiers in this general noise regime. We establish
precise theoretical bounds on the robustness of classifiers in this general
regime, which depend on the curvature of the classifier's decision boundary.
Our bounds confirm and quantify the empirical observations that classifiers
satisfying curvature constraints are robust to random noise. Moreover, we
quantify the robustness of classifiers in terms of the subspace dimension in
the semi-random noise regime, and show that our bounds remarkably interpolate
between the worst-case and random noise regimes. We perform experiments and
show that the derived bounds provide very accurate estimates when applied to
various state-of-the-art deep neural networks and datasets. This result
suggests bounds on the curvature of the classifiers' decision boundaries that
we support experimentally, and more generally offers important insights onto
the geometry of high dimensional classification problems.
more
less

Alhussein Fawzi and Seyed-Mohsen Moosavi-Dezfooli and Pascal Frossard

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG, cs.CV, stat.ML

[link]
Fawzi et al. study robustness in the transition from random samples to semi-random and adversarial samples. Specifically they present bounds relating the norm of an adversarial perturbation to the norm of random perturbations – for the exact form I refer to the paper. Personally, I find the definition of semi-random noise most interesting, as it allows to get an intuition for distinguishing random noise from adversarial examples. As in related literature, adversarial examples are defined as $r_S(x_0) = \arg\min_{x_0 \in S} \|r\|_2$ s.t. $f(x_0 + r) \neq f(x_0)$ where $f$ is the classifier to attack and $S$ the set of allowed perturbations (e.g. requiring that the perturbed samples are still images). If $S$ is mostly unconstrained regarding the direction of $r$ in high dimensional space, Fawzi et al. consider $r$ to be an adversarial examples – intuitively, and adversary can choose $r$ arbitrarily to fool the classifier. If, however, the directions considered in $S$ are constrained to an $m$-dimensional subspace, Fawzi et al. consider $r$ to be semi-random noise. In the extreme case, if $m = 1$, $r$ is random noise. In this case, we can intuitively think of $S$ as a randomly chosen one dimensional subspace – i.e. a random direction in multi-dimensional space. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu and Mike Schuster and Zhifeng Chen and Quoc V. Le and Mohammad Norouzi and Wolfgang Macherey and Maxim Krikun and Yuan Cao and Qin Gao and Klaus Macherey and Jeff Klingner and Apurva Shah and Melvin Johnson and Xiaobing Liu and Łukasz Kaiser and Stephan Gouws and Yoshikiyo Kato and Taku Kudo and Hideto Kazawa and Keith Stevens et al.

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CL, cs.AI, cs.LG

**First published:** 2016/09/26 (3 years ago)

**Abstract:** Neural Machine Translation (NMT) is an end-to-end learning approach for
automated translation, with the potential to overcome many of the weaknesses of
conventional phrase-based translation systems. Unfortunately, NMT systems are
known to be computationally expensive both in training and in translation
inference. Also, most NMT systems have difficulty with rare words. These issues
have hindered NMT's use in practical deployments and services, where both
accuracy and speed are essential. In this work, we present GNMT, Google's
Neural Machine Translation system, which attempts to address many of these
issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder
layers using attention and residual connections. To improve parallelism and
therefore decrease training time, our attention mechanism connects the bottom
layer of the decoder to the top layer of the encoder. To accelerate the final
translation speed, we employ low-precision arithmetic during inference
computations. To improve handling of rare words, we divide words into a limited
set of common sub-word units ("wordpieces") for both input and output. This
method provides a good balance between the flexibility of "character"-delimited
models and the efficiency of "word"-delimited models, naturally handles
translation of rare words, and ultimately improves the overall accuracy of the
system. Our beam search technique employs a length-normalization procedure and
uses a coverage penalty, which encourages generation of an output sentence that
is most likely to cover all the words in the source sentence. On the WMT'14
English-to-French and English-to-German benchmarks, GNMT achieves competitive
results to state-of-the-art. Using a human side-by-side evaluation on a set of
isolated simple sentences, it reduces translation errors by an average of 60%
compared to Google's phrase-based production system.
more
less

Yonghui Wu and Mike Schuster and Zhifeng Chen and Quoc V. Le and Mohammad Norouzi and Wolfgang Macherey and Maxim Krikun and Yuan Cao and Qin Gao and Klaus Macherey and Jeff Klingner and Apurva Shah and Melvin Johnson and Xiaobing Liu and Łukasz Kaiser and Stephan Gouws and Yoshikiyo Kato and Taku Kudo and Hideto Kazawa and Keith Stevens et al.

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CL, cs.AI, cs.LG

[link]
This is a very techincal paper and I only covered items that interested me * Model * Encoder * 8 layers LSTM * bi-directional only first encoder layer * top 4 layers add input to output (residual network) * Decoder * same as encoder except all layers are just forward direction * encoder state is not passed as a start point to Decoder state * Attention * energy computed using NN with one hidden layer as appose to dot product or the usual practice of no hidden layer and $\tanh$ activation at the output layer * computed from output of 1st decoder layer * pre-feed to all layers * Training has two steps: ML and RL * ML (cross-entropy) training: * common wisdom, initialize all trainable parameters uniformly between [-0.04, 0.04] * clipping=5, batch=128 * Adam (lr=2e-4) 60K steps followed by SGD (lr=.5 which is probably a typo!) 1.2M steps + 4x(lr/=2 200K steps) * 12 async machines, each machine with 8 GPUs (K80) on which the model is spread X 6days * [dropout](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZarembaSV14) 0.2-0.3 (higher for smaller datasets) * RL - [Reinforcement Learning](http://www.shortscience.org/paper?bibtexKey=journals/corr/RanzatoCAZ15) * sequence score, $\text{GLEU} = r = \min(\text{precision}, \text{recall})$ computed on n-grams of size 1-4 * mixed loss $\alpha \text{ML} + \text{RL}, \alpha =0.25$ * mean $r$ computed from $m=15$ samples * SGD, 400K steps, 3 days, no drouput * Prediction (i.e. Decoder) * beam search (3 beams) * A normalized score is computed to every beam that ended (died) * did not normalize beam score by $\text{beam_length}^\alpha , \alpha \in [0.6-0.7]$ * normalized with similar formula in which 5 is add to length and a coverage factor is added, which is the sum-log of attention weight of every input word (i.e. after summing over all output words) * Do a second pruning using normalized scores |

Scan Order in Gibbs Sampling: Models in Which it Matters and Bounds on How Much

He, Bryan D. and Sa, Christopher De and Mitliagkas, Ioannis and Ré, Christopher

Neural Information Processing Systems Conference - 2016 via Local Bibsonomy

Keywords: dblp

He, Bryan D. and Sa, Christopher De and Mitliagkas, Ioannis and Ré, Christopher

Neural Information Processing Systems Conference - 2016 via Local Bibsonomy

Keywords: dblp

[link]
A study of how scan orders influence Mixing time in Gibbs sampling. This paper is interested in comparing the mixing rates of Gibbs sampling using either systematic scan or random updates. The basic contributions are two: First, in Section 2, a set of cases where 1) systematic scan is polynomially faster than random updates. Together with a previously known case where it can be slower this contradicts a conjecture that the speeds of systematic and random updates are similar. Secondly, (In Theorem 1) a set of mild conditions under which the mixing times of systematic scan and random updates are not "too" different (roughly within squares of each other). First, following from a recent paper by Roberts and Rosenthal, the authors construct several examples which do not satisfy the commonly held belief that systematic scan is never more than a constant factor slower and a log factor faster than random scan. The authors then provide a result Theorem 1 which provides weaker bounds, which however they verify at least under some conditions. In fact the Theorem compares random scan to a lazy version of the systematic scan and shows that and obtains bounds in terms of various other quantities, like the minimum probability, or the minimum holding probability. MCMC is at the heart of many applications of modern machine learning and statistics. It is thus important to understand the computational and theoretical performance under various conditions. The present paper focused on examining systematic Gibbs sampling in comparison to random scan Gibbs. They do so first though the construction of several examples which challenge the dominant intuitions about mixing times, and develop theoretical bounds which are much wider than previously conjectured. |

Dynamic Routing Between Capsules.

Sara Sabour and Nicholas Frosst and Geoffrey E. Hinton

Neural Information Processing Systems Conference - 2017 via Local dblp

Keywords:

Sara Sabour and Nicholas Frosst and Geoffrey E. Hinton

Neural Information Processing Systems Conference - 2017 via Local dblp

Keywords:

[link]
An attention-based architecture for combining information from different convolutional layers. The attention values are calculated using an iterative process, making use of a custom squashing function. The evaluations on MNIST show robustness to affine transformations. |

Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation

Tompson, Jonathan J. and Jain, Arjun and LeCun, Yann and Bregler, Christoph

Neural Information Processing Systems Conference - 2014 via Local Bibsonomy

Keywords: dblp

Tompson, Jonathan J. and Jain, Arjun and LeCun, Yann and Bregler, Christoph

Neural Information Processing Systems Conference - 2014 via Local Bibsonomy

Keywords: dblp

[link]
* They describe a model for human pose estimation, i.e. one that finds the joints ("skeleton") of a person in an image. * They argue that part of their model resembles a Markov Random Field (but in reality its implemented as just one big neural network). ### How * They have two components in their network: * Part-Detector: * Finds candidate locations for human joints in an image. * Pretty standard ConvNet. A few convolutional layers with pooling and ReLUs. * They use two branches: A fine and a coarse one. Both branches have practically the same architecture (convolutions, pooling etc.). The coarse one however receives the image downscaled by a factor of 2 (half width/height) and upscales it by a factor of 2 at the end of the branch. * At the end they merge the results of both branches with more convolutions. * The output of this model are 4 heatmaps (one per joint? unclear), each having lower resolution than the original image. * Spatial-Model: * Takes the results of the part detector and tries to remove all detections that were false positives. * They derive their architecture from a fully connected Markov Random Field which would be solved with one step of belief propagation. * They use large convolutions (128x128) to resemble the "fully connected" part. * They initialize the weights of the convolutions with joint positions gathered from the training set. * The convolutions are followed by log(), element-wise additions and exp() to resemble an energy function. * The end result are the input heatmaps, but cleaned up. ### Results * Beats all previous models (with and without spatial model). * Accuracy seems to be around 90% (with enough (16px) tolerance in pixel distance from ground truth). * Adding the spatial model adds a few percentage points of accuracy. * Using two branches instead of one (in the part detector) adds a bit of accuracy. Adding a third branch adds a tiny bit more. ![Results](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Joint_Training_of_a_ConvNet_and_a_PGM_for_HPE__results.png?raw=true "Results") *Example results.* ![Part Detector](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Joint_Training_of_a_ConvNet_and_a_PGM_for_HPE__part_detector.png?raw=true "Part Detector") *Part Detector network.* ![Spatial Model](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Joint_Training_of_a_ConvNet_and_a_PGM_for_HPE__spatial_model.png?raw=true "Spatial Model") *Spatial Model (apparently only for two input heatmaps).* ------------------------- # Rough chapter-wise notes * (1) Introduction * Human Pose Estimation (HPE) from RGB images is difficult due to the high dimensionality of the input. * Approaches: * Deformable-part models: Traditionally based on hand-crafted features. * Deep-learning based disciminative models: Recently outperformed other models. However, it is hard to incorporate priors (e.g. possible joint- inter-connectivity) into the model. * They combine: * A part-detector (ConvNet, utilizes multi-resolution feature representation with overlapping receptive fields) * Part-based Spatial-Model (approximates loopy belief propagation) * They backpropagate through the spatial model and then the part-detector. * (3) Model * (3.1) Convolutional Network Part-Detector * This model locates possible positions of human key joints in the image ("part detector"). * Input: RGB image. * Output: 4 heatmaps, one per key joint (per pixel: likelihood). * They use a fully convolutional network. * They argue that applying convolutions to every pixel is similar to moving a sliding window over the image. * They use two receptive field sizes for their "sliding window": A large but coarse/blurry one, a small but fine one. * To implement that, they use two branches. Both branches are mostly identical (convolutions, poolings, ReLU). They simply feed a downscaled (half width/height) version of the input image into the coarser branch. At the end they upscale the coarser branch once and then merge both branches. * After the merge they apply 9x9 convolutions and then 1x1 convolutions to get it down to 4xHxW (H=60, W=90 where expected input was H=320, W=240). * (3.2) Higher-level Spatial-Model * This model takes the detected joint positions (heatmaps) and tries to remove those that are probably false positives. * It is a ConvNet, which tries to emulate (1) a Markov Random Field and (2) solving that MRF approximately via one step of belief propagation. * The raw MRF formula would be something like `<likelihood of joint A per px> = normalize( <product over joint v from joints V> <probability of joint A per px given a> * <probability of joint v at px?> + someBiasTerm)`. * They treat the probabilities as energies and remove from the formula the partition function (`normalize`) for various reasons (e.g. because they are only interested in the maximum value anyways). * They use exp() in combination with log() to replace the product with a sum. * They apply SoftPlus and ReLU so that the energies are always positive (and therefore play well with log). * Apparently `<probability of joint v at px?>` are the input heatmaps of the part detector. * Apparently `<probability of joint A per px given a>` is implemented as the weights of a convolution. * Apparently `someBiasTerm` is implemented as the bias of a convolution. * The convolutions that they use are large (128x128) to emulate a fully connected graph. * They initialize the convolution weights based on histograms gathered from the dataset (empirical distribution of joint displacements). * (3.3) Unified Models * They combine the part-based model and the spatial model to a single one. * They first train only the part-based model, then only the spatial model, then both. * (4) Results * Used datasets: FLIC (4k training images, 1k test, mostly front-facing and standing poses), FLIC-plus (17k, 1k ?), extended-LSP (10k, 1k). * FLIC contains images showing multiple persons with only one being annotated. So for FLIC they add a heatmap of the annotated body torso to the input (i.e. the part-detector does not have to search for the person any more). * The evaluation metric roughly measures, how often predicted joint positions are within a certain radius of the true joint positions. * Their model performs significantly better than competing models (on both FLIC and LSP). * Accuracy seems to be at around 80%-95% per joint (when choosing high enough evaluation tolerance, i.e. 10px+). * Adding the spatial model to the part detector increases the accuracy by around 10-15 percentage points. * Training the part detector and the spatial model jointly adds ~3 percentage points accuracy over training them separately. * Adding the second filter bank (coarser branch in the part detector) adds around 5 percentage points accuracy. Adding a third filter bank adds a tiny bit more accuracy. |

Sequence to Sequence Learning with Neural Networks

Sutskever, Ilya and Vinyals, Oriol and Le, Quoc V.

Neural Information Processing Systems Conference - 2014 via Local Bibsonomy

Keywords: dblp

Sutskever, Ilya and Vinyals, Oriol and Le, Quoc V.

Neural Information Processing Systems Conference - 2014 via Local Bibsonomy

Keywords: dblp

[link]
#### Introduction * The paper proposes a general and end-to-end approach for sequence learning that uses two deep LSTMs, one to map input sequence to vector space and another to map vector to the output sequence. * For sequence learning, Deep Neural Networks (DNNs) requires the dimensionality of input and output sequences be known and fixed. This limitation is overcome by using the two LSTMs. * [Link to the paper](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) #### Model * Recurrent Neural Networks (RNNs) generalizes feed forward neural networks to sequences. * Given a sequence of inputs $(x_{1}, x_{2}...x_{t})$, RNN computes a sequence of outputs $(y_1, y_2...y_t')$ by iterating over the following equation: $$h_t = sigm(W^{hx}x_t + W^{hh} h_{t-1})$$ $$y^{t} = W^{yh}h_{t}$$ * To map variable length sequences, the input is mapped to a fixed size vector using an RNN and this fixed size vector is mapped to output sequence using another RNN. * Given the long-term dependencies between the two sequences, LSTMs are preferred over RNNs. * LSTMs estimate the conditional probability *p(output sequence | input sequence)* by first mapping the input sequence to a fixed dimensional representation and then computing the probability of output with a standard LST-LM formulation. ##### Differences between the model and standard LSTMs * The model uses two LSTMs (one for input sequence and another for output sequence), thereby increasing the number of model parameters at negligible computing cost. * Model uses Deep LSTMs (4 layers). * The words in the input sequences are reversed to introduce short-term dependencies and to reduce the "minimal time lag". By reversing the word order, the first few words in the source sentence (input sentence) are much closer to first few words in the target sentence (output sentence) thereby making it easier for LSTM to "establish" communication between input and output sentences. #### Experiments * WMT'14 English to French dataset containing 12 million sentences consisting of 348 million French words and 304 million English words. * Model tested on translation task and on the task of re-scoring the n-best results of baseline approach. * Deep LSTMs trained in sentence pairs by maximizing the log probability of a correct translation $T$, given the source sentence $S$ * The training objective is to maximize this log probability, averaged over all the pairs in the training set. * Most likely translation is found by performing a simple, left-to-right beam search for translation. * A hard constraint is enforced on the norm of the gradient to avoid the exploding gradient problem. * Min batches are selected to have sentences of similar lengths to reduce training time. * Model performs better when reversed sentences are used for training. * While the model does not beat the state-of-the-art, it is the first pure neural translation system to outperform a phrase-based SMT baseline. * The model performs well on long sentences as well with only a minor degradation for the largest sentences. * The paper prepares ground for the application of sequence-to-sequence based learning models in other domains by demonstrating how a simple and relatively unoptimised neural model could outperform a mature SMT system on translation tasks. |

Spatial Transformer Networks

Jaderberg, Max and Simonyan, Karen and Zisserman, Andrew and Kavukcuoglu, Koray

Neural Information Processing Systems Conference - 2015 via Local Bibsonomy

Keywords: dblp

Jaderberg, Max and Simonyan, Karen and Zisserman, Andrew and Kavukcuoglu, Koray

Neural Information Processing Systems Conference - 2015 via Local Bibsonomy

Keywords: dblp

[link]
This paper presents a novel layer that can be used in convolutional neural networks. A spatial transformer layer computes re-sampling points of the signal based on another neural network. The suggested transformations include scaling, cropping, rotations and non-rigid deformation whose paramerters are trained end-to-end with the rest of the model. The resulting re-sampling grid is then used to create a new representation of the underlying signal through bi-linear or nearest neighbor interpolation. This has interesting implications: the network can learn to co-locate objects in a set of images that all contain the same object, the transformation parameter localize the attention area explicitly, fine data resolution is restricted to areas important for the task. Furthermore, the model improves over previous state-of-the-art on a number of tasks. The layer has one mini neural network that regresses on the parameters of a parametric transformation, e.g. affine), then there is a module that applies the transformation to a regular grid and a third more or less "reads off" the values in the transformed positions and maps them to a regular grid, hence under-forming the image or previous layer. Gradients for back-propagation in a few cases are derived. The results are mostly of the classic deep learning variety, including mnist and svhn, but there is also the fine-grained birds dataset. The networks with spatial transformers seem to lead to improved results in all cases. |

About