The proposed approach consists in corrupting the training targets with a noise derived from the task reward while doing maximum likelihood training. This simple but specific smoothing of the target distribution allows to significantly boost the performance of neural structured output prediction as showcased on TIMIT phone and translation tasks. The link between this approach and RL-based expected reward maximization is also made clear by the paper,
Prior work has chosen either maximum likelihood learning, which is relatively tractable but assumes a log likelihood loss, or reinforcement learning, which can be performed for a task-specific loss function but requires sampling many predictions to estimate gradients. The proposed objective bridges the gap with "reward-augmented maximum likelihood," which is similar to maximum likelihood but estimates the expected loss with samples that are drawn in proportion to their distance from the ground truth. Empirical results show good improvements with LSTM-based predictors on speech recognition and machine translation benchmarks relative to maximum likelihood training.
This work is inspired by recent advancement in reinforcement learning and likelihood learning. The authors suggest to learn parameters so as to minimize the KL divergence between CRFs and a probability model that is proportional to the reward function (which the authors call payoff distribution, see Equation 4). The authors suggest an optimization algorithm for the KL-divergence minimization that depends on sampling from the payoff distribution.
Current methods to learn a model for structured prediction include max margin optimisation and reinforcement learning. However, the max margin approach only optimises a bound on the true reward, and requires loss augmented inference to obtain gradients, which can be expensive. On the other hand, reinforcement learning does not make use of available supervision, and can therefore struggle when the reward is sparse, and furthermore the gradients can have high variance. The paper proposes a novel approach to learning for problems that involve structured prediction. They relate their approach to simple maximum likelihood (ML) learning and reinforcement learning (RL): ML optimises the KL divergence of a delta distribution relative to the model distribution, and RL optimises the KL divergence of the model distribution relative to the exponentiated reward distribution. They propose reward-augmented maximum likelihood learning, which optimises the KL divergence of the exponentiated reward distribution relative to the model distribution. Compared to RL, the arguments of the KL divergence are swapped. Compared to ML, the delta distribution is generalised to the exponentiated reward distribution. Training is cheap in RML learning. It is only necessary to sample from the output set according to the exponentiated reward distribution. All experiments are performed in speech recognition and machine translation, where the structure over the output set is defined by the edit distance. An improvement is demonstrated over simple ML.
The paper ([arxiv](https://arxiv.org/abs/1610.09716)) introduces DCNNs (Doubly Convolutional Neural Networks). Those are CNNs which contain a new layer type which generalized convolutional layers.
CNNs seem to learn many filters which are similar to other learned filters in the same layer. The weights are only slightly shifted.
The idea of double convolution is to learn groups filters where filters within each group are translated versions of each other. To achieve this, a doubly convolutional layer allocates a set of meta filters which has filter sizes that are larger than the effective filter size. Effective filters can be then extracted from each meta filter, which corresponds to convolving the meta filters with an identity kernel. All the extracted filters are then concatenated, and convolved with the input.
> We have also confirmed that replacing a convolutional layer with a doubly convolutional layer consistently improves the performance, regardless of the depth of the layer.
* CIFAR-10+: 7.24% error
* CIFAR-100+: 26.53% error
* ImageNet: 8.23% Top-5 error
The k-translation correlation is effectively a [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). I think the authors should have mentioned that.
A study of how scan orders influence Mixing time in Gibbs sampling.
This paper is interested in comparing the mixing rates of Gibbs sampling using either systematic scan or random updates. The basic contributions are two: First, in Section 2, a set of cases where 1) systematic scan is polynomially faster than random updates. Together with a previously known case where it can be slower this contradicts a conjecture that the speeds of systematic and random updates are similar. Secondly, (In Theorem 1) a set of mild conditions under which the mixing times of systematic scan and random updates are not "too" different (roughly within squares of each other).
First, following from a recent paper by Roberts and Rosenthal, the authors construct several examples which do not satisfy the commonly held belief that systematic scan is never more than a constant factor slower and a log factor faster than random scan. The authors then provide a result Theorem 1 which provides weaker bounds, which however they verify at least under some conditions. In fact the Theorem compares random scan to a lazy version of the systematic scan and shows that and obtains bounds in terms of various other quantities, like the minimum probability, or the minimum holding probability.
MCMC is at the heart of many applications of modern machine learning and statistics. It is thus important to understand the computational and theoretical performance under various conditions. The present paper focused on examining systematic Gibbs sampling in comparison to random scan Gibbs. They do so first though the construction of several examples which challenge the dominant intuitions about mixing times, and develop theoretical bounds which are much wider than previously conjectured.
The paper addresses the problem of compressive sensing MRI (CS-MRI) by proposing a "deep unfolding" approach (cf. http://arxiv.org/abs/1409.2574) with a sparsity-based data prior and inference via ADMM. All layers of the proposed ADMM-Net are based on a generalization of ADMM inference steps and are discriminatively trained to minimize a reconstruction error. In contrast to other methods for CS-MRI, the proposed approach offers both high reconstruction quality and fast run-time.
The basic idea is to convert the convention optimization based CS reconstruction algorithm into a fixed neural network learned with back-propagation algorithm. Specifically, the ADMM-based CS reconstruction is approximated with a deep neural network. Experimental results show that the approximated neural network outperforms several existing CS-MRI algorithms with less computational time.
The ADMM algorithm has proven to be useful for solving problems with differentiable and non-differentiable terms, and therefore has a clear link with compressed sensing. Experiments prove some gain in performance with respect to the state of the art, specially in terms of computational cost at test time.
This paper proposes several definitions of measures of complexity of a recurrent neural network. They measure 1) recurrent depth (degree of multi-layeredness as a function of time of recursive connections) 2) feedforward depth (degree of multi-layeredness as a function of input -> output connections) 3) recurrent skip coefficient (degree of directness, like the inverse of multilayeredness, of connections) In addition to the actual definitions, there are two main contributions: - The authors show that the measures (which are limits as the number of time steps -> infinity) are well defined. - The authors correlate the measures with empirical performance in various ways, showing that all measure of depth can lead to improved performance.
This paper provides 3 measures of complexity for RNNs. They then show experimentally that these complexity measures are meaningful, in the sense that increasingly complexity seems to correlated with better performance.
The authors first present a rigorous graph-theoretic framework that describes the connecting architectures of RNNs in general, with which the authors easily explain how we can unfold an RNN. The authors then go on and propose tree architecture complexity measures of RNNs, namely the recurrent depth, the feedforward depth and the recurrent skip coefficient. Experiments on various tasks show the importance of certain measures on certain tasks, which indicates that those three complexity measures might be good guidelines when designing a recurrent neural network for certain tasks.
This paper has a simple premise: that the, say, LSTM cell works better with multiplicative updates (equation 2) rather than additive ones (equation 1). This additive update is used in various places in lieu of additive ones, in various places in the LSTM recurrence equations (the exact formulation is in the supplementary material). A slightly hand wavy argument is made in favour of the multiplicative update, on the grounds of superior gradient flow (section 2.2). Mainly however, the authors make a rather thorough empirical investigation which shows remarkably good performance of their new architectures, on a range of real problems. Figure 1(a) is nice, showing an apparent greater information flow (as defined by a particular gradient) through time for the new scheme, as well as faster convergence and less saturated hidden unit activations. Overall, the experimental results appear thorough and convincing, although I am not a specialist in this area.
This model presents a multiplicative alternative (with an additive component) to the additive update which happens at the core of various RNNs (Simple RNNs, GRUs, LSTMs). The multiplicative component, without introducing a significant change in the number of parameters, yields better gradient passing properties which enable the learning of better models, as shown in experiments.
The authors propose to replace the notion of 'attention' in neural architectures with the notion of 'active memory' where rather than focusing on a single part of the memory one would operate on the whole of it in parallel.
This paper introduces an extension to neural GPUs for machine translation. I found the experimental analysis section lacking in both comparisons to state of the art MT techniques as well as thoroughly evaluating the proposed method.
This paper proposes active memory, which is a memory mechanism that operates all the part in parallel. The active memory was compared to attention mechanism and it is shown that the active memory is more effective for long sentence translation than the attention mechanism in English-French translation.
This paper proposes two new models for modeling sequential data in the sequence-to-sequence framework. The first is called the Markovian Neural GPU and the second is called the Extended Neural GPU. Both models are extensions of the Neural GPU model (Kaiser and Sutskever, 2016), but unlike the Neural GPU, the proposed models do not model the outputs independently but instead connect the output token distributions recursively. The paper provides empirical evidence on a machine translation task showing that the two proposed models perform better than the Neural GPU model and that the Extended Neural GPU performs on par with a GRU-based encoder-decoder model with attention.
The paper proposes a "neural transducer" model for sequence-to-sequence tasks that operates in a left-to-right and on-line fashion. In other words, the model produces output as the input is received instead of waiting until the full input is received like most sequence-to-sequence models do. Key ideas used to make the model work include a recurrent attention mechanism, the use of an end-of-block symbol in the output alphabet to indicate when the transducer should move to the next input block, and approximate algorithms based on dynamic programming and beam search for training and inference with the transducer model. Experiments on the TIMIT speech task show that the model works well and explore some of the design parameters of the model.
Like similar models of this type, the input is processed by an encoder and a decoder produces an output sequence using the information provided by the encoder and conditioned on its own previous predictions. The method is evaluated on a toy problem and the TIMIT phoneme recognition task. The authors also propose some smaller ideas like two different attention mechanism variations.
The map from block input to output is governed by a standard sequence-to-sequence model with additional state carried over from the previous block. Alignment of the two sequences is approximated by a dynamic program using a greedy local search heuristic. Experimental results are presented for phone recognition on TIMIT.
The encoder is a multi-layer LSTM RNN. The decoder is an RNN model conditioned on weighted sums of the last layer of the encoder and it's previous output. The weighting schemes (attention) varies and can be conditioned on the hidden states or also previous attention vectors. The decoder model produces a sequence of symbols, until it outputs a special end character "e" and is moved to the next block (other mechanisms where explored as well (no end-of-block-symbol and separately predicting the end of a block given the attention vector). It is then fed the weighted sum of the next block of encoder states. The resulting sequence of symbols determines an alignment of the target symbols over the blocks of inputs, where each block may be assigned a variable number of characters. The system is trained by fixing an alignment, that approximately resembles the best alignment. Finding this approximately best alignment is akin to a beam-search with a beam size of M (line 169), but a restricted set of symbols conditional on the last symbol in a particular hypothesis (since the target sequence is known). Alignments are computed less frequently than model updates (typically every 100 to 300 sequences). For inference, an unconstrained beam-search procedure is performed with a threshold on sequence length and beam size.
The authors presented a new generative model that learns to disentangle the factors of variations of the data. The authors claim that the proposed model is pretty robust to supervision. This is achieved by combining two of the most successful generative models: VAE and GAN. The model is able to resolve the analogies in a consistent way on several datasets with minimal parameter/architecture tunning.
This paper presents a way to learn latent codes for data, that captures both the information relevant for a given classification task, as well as the remaining irrelevant factors of variation (rather than discarding the latter as a classification model would). This is done by combining a VAE-style generative model, and adversarial training. This model proves capable of disentangling style and content in images (without explicit supervision for style information), and proves useful for analogy resolution.
This paper introduces a generative model for learning to disentangle hidden factors of variation. The disentangling separates the code into two, where one is claimed to be the code that descries factors relevant to solving a specific task, and the other describing the remaining factors. Experimental results show that the proposed method is promising.
The authors combine state of the art methods VAE and GAN to generate images with two complementary codes: one relevant and one irrelevant. They major contribution of the paper is the development of a training procedure that exploits triplets of images (two sharing the relevant code, one note sharing) to regularize the encoder-decoder architecture and avoid trivial solutions. The results are qualitatively good and comparable to previous article using more sources of supervision.
Paper seeks to explore the variations amongst samples which separate multiple classes using auto encoders and decoders. Specifically, the authors propose combining generative adversarial networks and variational auto encoders. The idea mimics the game play between two opponents, where one attempts to fool the other into believing a synthetic sample is in fact a natural sample. The paper proposes an iterative training procedure where a generative model was first trained on a number of samples while keeping the weights of the adversary constant and later the adversary is trained while keeping the generative model weights constant. The paper performs experiments on generation of instances between classes, retrieval of instances belonging to a given class, and interpolation of instances between two classes. The experiments were performed on MNIST, a set of 2D character animation sprites, and 2D NORB toy image dataset.