The proposed approach consists in corrupting the training targets with a noise derived from the task reward while doing maximum likelihood training. This simple but specific smoothing of the target distribution allows to significantly boost the performance of neural structured output prediction as showcased on TIMIT phone and translation tasks. The link between this approach and RL-based expected reward maximization is also made clear by the paper,
Prior work has chosen either maximum likelihood learning, which is relatively tractable but assumes a log likelihood loss, or reinforcement learning, which can be performed for a task-specific loss function but requires sampling many predictions to estimate gradients. The proposed objective bridges the gap with "reward-augmented maximum likelihood," which is similar to maximum likelihood but estimates the expected loss with samples that are drawn in proportion to their distance from the ground truth. Empirical results show good improvements with LSTM-based predictors on speech recognition and machine translation benchmarks relative to maximum likelihood training.
This work is inspired by recent advancement in reinforcement learning and likelihood learning. The authors suggest to learn parameters so as to minimize the KL divergence between CRFs and a probability model that is proportional to the reward function (which the authors call payoff distribution, see Equation 4). The authors suggest an optimization algorithm for the KL-divergence minimization that depends on sampling from the payoff distribution.
Current methods to learn a model for structured prediction include max margin optimisation and reinforcement learning. However, the max margin approach only optimises a bound on the true reward, and requires loss augmented inference to obtain gradients, which can be expensive. On the other hand, reinforcement learning does not make use of available supervision, and can therefore struggle when the reward is sparse, and furthermore the gradients can have high variance. The paper proposes a novel approach to learning for problems that involve structured prediction. They relate their approach to simple maximum likelihood (ML) learning and reinforcement learning (RL): ML optimises the KL divergence of a delta distribution relative to the model distribution, and RL optimises the KL divergence of the model distribution relative to the exponentiated reward distribution. They propose reward-augmented maximum likelihood learning, which optimises the KL divergence of the exponentiated reward distribution relative to the model distribution. Compared to RL, the arguments of the KL divergence are swapped. Compared to ML, the delta distribution is generalised to the exponentiated reward distribution. Training is cheap in RML learning. It is only necessary to sample from the output set according to the exponentiated reward distribution. All experiments are performed in speech recognition and machine translation, where the structure over the output set is defined by the edit distance. An improvement is demonstrated over simple ML.
The paper ([arxiv](https://arxiv.org/abs/1610.09716)) introduces DCNNs (Doubly Convolutional Neural Networks). Those are CNNs which contain a new layer type which generalized convolutional layers.
CNNs seem to learn many filters which are similar to other learned filters in the same layer. The weights are only slightly shifted.
The idea of double convolution is to learn groups filters where filters within each group are translated versions of each other. To achieve this, a doubly convolutional layer allocates a set of meta filters which has filter sizes that are larger than the effective filter size. Effective filters can be then extracted from each meta filter, which corresponds to convolving the meta filters with an identity kernel. All the extracted filters are then concatenated, and convolved with the input.
> We have also confirmed that replacing a convolutional layer with a doubly convolutional layer consistently improves the performance, regardless of the depth of the layer.
* CIFAR-10+: 7.24% error
* CIFAR-100+: 26.53% error
* ImageNet: 8.23% Top-5 error
The k-translation correlation is effectively a [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). I think the authors should have mentioned that.
A study of how scan orders influence Mixing time in Gibbs sampling.
This paper is interested in comparing the mixing rates of Gibbs sampling using either systematic scan or random updates. The basic contributions are two: First, in Section 2, a set of cases where 1) systematic scan is polynomially faster than random updates. Together with a previously known case where it can be slower this contradicts a conjecture that the speeds of systematic and random updates are similar. Secondly, (In Theorem 1) a set of mild conditions under which the mixing times of systematic scan and random updates are not "too" different (roughly within squares of each other).
First, following from a recent paper by Roberts and Rosenthal, the authors construct several examples which do not satisfy the commonly held belief that systematic scan is never more than a constant factor slower and a log factor faster than random scan. The authors then provide a result Theorem 1 which provides weaker bounds, which however they verify at least under some conditions. In fact the Theorem compares random scan to a lazy version of the systematic scan and shows that and obtains bounds in terms of various other quantities, like the minimum probability, or the minimum holding probability.
MCMC is at the heart of many applications of modern machine learning and statistics. It is thus important to understand the computational and theoretical performance under various conditions. The present paper focused on examining systematic Gibbs sampling in comparison to random scan Gibbs. They do so first though the construction of several examples which challenge the dominant intuitions about mixing times, and develop theoretical bounds which are much wider than previously conjectured.
The paper addresses the problem of compressive sensing MRI (CS-MRI) by proposing a "deep unfolding" approach (cf. http://arxiv.org/abs/1409.2574) with a sparsity-based data prior and inference via ADMM. All layers of the proposed ADMM-Net are based on a generalization of ADMM inference steps and are discriminatively trained to minimize a reconstruction error. In contrast to other methods for CS-MRI, the proposed approach offers both high reconstruction quality and fast run-time.
The basic idea is to convert the convention optimization based CS reconstruction algorithm into a fixed neural network learned with back-propagation algorithm. Specifically, the ADMM-based CS reconstruction is approximated with a deep neural network. Experimental results show that the approximated neural network outperforms several existing CS-MRI algorithms with less computational time.
The ADMM algorithm has proven to be useful for solving problems with differentiable and non-differentiable terms, and therefore has a clear link with compressed sensing. Experiments prove some gain in performance with respect to the state of the art, specially in terms of computational cost at test time.
This paper proposes several definitions of measures of complexity of a recurrent neural network. They measure 1) recurrent depth (degree of multi-layeredness as a function of time of recursive connections) 2) feedforward depth (degree of multi-layeredness as a function of input -> output connections) 3) recurrent skip coefficient (degree of directness, like the inverse of multilayeredness, of connections) In addition to the actual definitions, there are two main contributions: - The authors show that the measures (which are limits as the number of time steps -> infinity) are well defined. - The authors correlate the measures with empirical performance in various ways, showing that all measure of depth can lead to improved performance.
This paper provides 3 measures of complexity for RNNs. They then show experimentally that these complexity measures are meaningful, in the sense that increasingly complexity seems to correlated with better performance.
The authors first present a rigorous graph-theoretic framework that describes the connecting architectures of RNNs in general, with which the authors easily explain how we can unfold an RNN. The authors then go on and propose tree architecture complexity measures of RNNs, namely the recurrent depth, the feedforward depth and the recurrent skip coefficient. Experiments on various tasks show the importance of certain measures on certain tasks, which indicates that those three complexity measures might be good guidelines when designing a recurrent neural network for certain tasks.
This paper has a simple premise: that the, say, LSTM cell works better with multiplicative updates (equation 2) rather than additive ones (equation 1). This additive update is used in various places in lieu of additive ones, in various places in the LSTM recurrence equations (the exact formulation is in the supplementary material). A slightly hand wavy argument is made in favour of the multiplicative update, on the grounds of superior gradient flow (section 2.2). Mainly however, the authors make a rather thorough empirical investigation which shows remarkably good performance of their new architectures, on a range of real problems. Figure 1(a) is nice, showing an apparent greater information flow (as defined by a particular gradient) through time for the new scheme, as well as faster convergence and less saturated hidden unit activations. Overall, the experimental results appear thorough and convincing, although I am not a specialist in this area.
This model presents a multiplicative alternative (with an additive component) to the additive update which happens at the core of various RNNs (Simple RNNs, GRUs, LSTMs). The multiplicative component, without introducing a significant change in the number of parameters, yields better gradient passing properties which enable the learning of better models, as shown in experiments.
The authors propose to replace the notion of 'attention' in neural architectures with the notion of 'active memory' where rather than focusing on a single part of the memory one would operate on the whole of it in parallel.
This paper introduces an extension to neural GPUs for machine translation. I found the experimental analysis section lacking in both comparisons to state of the art MT techniques as well as thoroughly evaluating the proposed method.
This paper proposes active memory, which is a memory mechanism that operates all the part in parallel. The active memory was compared to attention mechanism and it is shown that the active memory is more effective for long sentence translation than the attention mechanism in English-French translation.
This paper proposes two new models for modeling sequential data in the sequence-to-sequence framework. The first is called the Markovian Neural GPU and the second is called the Extended Neural GPU. Both models are extensions of the Neural GPU model (Kaiser and Sutskever, 2016), but unlike the Neural GPU, the proposed models do not model the outputs independently but instead connect the output token distributions recursively. The paper provides empirical evidence on a machine translation task showing that the two proposed models perform better than the Neural GPU model and that the Extended Neural GPU performs on par with a GRU-based encoder-decoder model with attention.
The paper proposes a "neural transducer" model for sequence-to-sequence tasks that operates in a left-to-right and on-line fashion. In other words, the model produces output as the input is received instead of waiting until the full input is received like most sequence-to-sequence models do. Key ideas used to make the model work include a recurrent attention mechanism, the use of an end-of-block symbol in the output alphabet to indicate when the transducer should move to the next input block, and approximate algorithms based on dynamic programming and beam search for training and inference with the transducer model. Experiments on the TIMIT speech task show that the model works well and explore some of the design parameters of the model.
Like similar models of this type, the input is processed by an encoder and a decoder produces an output sequence using the information provided by the encoder and conditioned on its own previous predictions. The method is evaluated on a toy problem and the TIMIT phoneme recognition task. The authors also propose some smaller ideas like two different attention mechanism variations.
The map from block input to output is governed by a standard sequence-to-sequence model with additional state carried over from the previous block. Alignment of the two sequences is approximated by a dynamic program using a greedy local search heuristic. Experimental results are presented for phone recognition on TIMIT.
The encoder is a multi-layer LSTM RNN. The decoder is an RNN model conditioned on weighted sums of the last layer of the encoder and it's previous output. The weighting schemes (attention) varies and can be conditioned on the hidden states or also previous attention vectors. The decoder model produces a sequence of symbols, until it outputs a special end character "e" and is moved to the next block (other mechanisms where explored as well (no end-of-block-symbol and separately predicting the end of a block given the attention vector). It is then fed the weighted sum of the next block of encoder states. The resulting sequence of symbols determines an alignment of the target symbols over the blocks of inputs, where each block may be assigned a variable number of characters. The system is trained by fixing an alignment, that approximately resembles the best alignment. Finding this approximately best alignment is akin to a beam-search with a beam size of M (line 169), but a restricted set of symbols conditional on the last symbol in a particular hypothesis (since the target sequence is known). Alignments are computed less frequently than model updates (typically every 100 to 300 sequences). For inference, an unconstrained beam-search procedure is performed with a threshold on sequence length and beam size.
The authors presented a new generative model that learns to disentangle the factors of variations of the data. The authors claim that the proposed model is pretty robust to supervision. This is achieved by combining two of the most successful generative models: VAE and GAN. The model is able to resolve the analogies in a consistent way on several datasets with minimal parameter/architecture tunning.
This paper presents a way to learn latent codes for data, that captures both the information relevant for a given classification task, as well as the remaining irrelevant factors of variation (rather than discarding the latter as a classification model would). This is done by combining a VAE-style generative model, and adversarial training. This model proves capable of disentangling style and content in images (without explicit supervision for style information), and proves useful for analogy resolution.
This paper introduces a generative model for learning to disentangle hidden factors of variation. The disentangling separates the code into two, where one is claimed to be the code that descries factors relevant to solving a specific task, and the other describing the remaining factors. Experimental results show that the proposed method is promising.
The authors combine state of the art methods VAE and GAN to generate images with two complementary codes: one relevant and one irrelevant. They major contribution of the paper is the development of a training procedure that exploits triplets of images (two sharing the relevant code, one note sharing) to regularize the encoder-decoder architecture and avoid trivial solutions. The results are qualitatively good and comparable to previous article using more sources of supervision.
Paper seeks to explore the variations amongst samples which separate multiple classes using auto encoders and decoders. Specifically, the authors propose combining generative adversarial networks and variational auto encoders. The idea mimics the game play between two opponents, where one attempts to fool the other into believing a synthetic sample is in fact a natural sample. The paper proposes an iterative training procedure where a generative model was first trained on a number of samples while keeping the weights of the adversary constant and later the adversary is trained while keeping the generative model weights constant. The paper performs experiments on generation of instances between classes, retrieval of instances belonging to a given class, and interpolation of instances between two classes. The experiments were performed on MNIST, a set of 2D character animation sprites, and 2D NORB toy image dataset.
This paper developed a semantically rich representation for natural sound using unlabeled videos as a bridge to
transfer discriminative visual knowledge from well-established visual recognition models into the sound modality.
The learned sound representation yields significant performance improvements on standard benchmarks for acoustic
scene classification task.
### Key Points
- The natural synchronization between vision and sound can be leveraged as a supervision signal for each other.
- Cross-modal learning can overcome overfitting if the target modal have much fewer data than other modals, which is essential for deep networks to work well.
- In the sound classification task, **pool5** and **conv6** extracted from SoundNet achieve best performance.
- The authors proposed a student-teacher training procedure to transfer discriminative visual knowledge from visual recognition models
trained on ImageNet and Places into the SoundNet by minimizing KL divergence between their predictions.
- Two reasons to use CNN for sound: 1. invariant to translations; 2. stacking layers to detect higher-level concepts.
- Adding a linear SVM upon representation learned from SoundNet outperforms other existing methods 10%.
- Using lots of unlabeled videos as supervision signals enable the deeper SoundNet to work, or otherwise the 8-layer networks
performs poorly due to overfitting.
- Simultaneous Using Places and ImageNet as supervision beats using only one of them 3%.
- Multi-modal recognition models use visual and sound data together yields 2% gain in classification accuracy.
I think this paper is really complete since it contains good intuition, ablation analysis, representation visualization, hidden unit visualization, and significent performance imporvements.
- Although paper said that "To handle variable-temporal-length of input sound, this model uses a fully convolutional network and produces an output over multiple timesteps in video.", but the code seems to set the length of each excerpts fixed to 5 seconds.
- It looks not clear for me about the data augmentation technique used in training.
This paper performs activation maximization (AM) using Deep Generator Network (DGN), which served as a learned natural iamge prior, to synthesize realistic images as inputs and feed it into the DNN we want to understand.
By visualizing synthesized images that highly activate particular neurons in the DNN, we can interpret what each of neurons in the DNN learned to detect.
### Key Points
- DGN (natural image prior) generates more coherent images when optimizing fully-connected layer codes instead of low-level codes. However, previous studies showed that low-level features results in better reconstructions beacuse it contains more image details. The difference is that here DGN-AM is trying to synthesize an entire layer code from scratch. Features in low-level only has a small, local receptive field so that the optimization process has to independently tune image without knowing the global structure. Also, the code space at a convolutional layer is much more high-dimensional, making it harder to optimize.
- The learned prior trained on ImageNet can also generalize to Places.
- It doesn't generalize well if architecture of the encoder trained with DGN is different with the DNN we wish to inspect.
- The learned prior also generalizes to visualize hidden neurons, producing more realistic textures/colors.
- When visualizing hidden neurons, DGN-AM trained on ImageNet also generalize to Places and produce similar results as .
- The synthesized images are showed to teach us what neurons in DNN we wish to inspect prefer instead of what prior prefer.
Solid paper with diverse visualizations and thorough analysis.
 Object Detectors Emerge In Deep Scene CNNs, B.Zhou et. al.
* What it is
* Weight Normalization (WN) is a normalization technique, similar to Batch Normalization (BN).
* It normalizes each layer's weights.
* Differences to BN
* WN normalizes based on each weight vector's orientation and magnitude. BN normalizes based on each weight's mean and variance in a batch.
* WN works on each example on its own. BN works on whole batches.
* WN is more deterministic than BN (due to not working an batches).
* WN is better suited for noisy environment (RNNs, LSTMs, reinforcement learning, generative models). (Due to being more deterministic.)
* WN is computationally simpler than BN.
* How its done
* WN is a module added on top of a linear or convolutional layer.
* If that layer's weights are `w` then WN learns two parameters `g` (scalar) and `v` (vector, identical dimension to `w`) so that `w = gv / ||v||` is fullfilled (`||v||` = euclidean norm of v).
* `g` is the magnitude of the weights, `v` are their orientation.
* `v` is initialized to zero mean and a standard deviation of 0.05.
* For networks without recursions (i.e. not RNN/LSTM/GRU):
* Right after initialization, they feed a single batch through the network.
* For each neuron/weight, they calculate the mean and standard deviation after the WN layer.
* They then adjust the bias to `-mean/stdDev` and `g` to `1/stdDev`.
* That makes the network start with each feature being roughly zero-mean and unit-variance.
* The same method can also be applied to networks without WN.
* They define BN-MEAN as a variant of BN which only normalizes to zero-mean (not unit-variance).
* CIFAR-10 image classification (no data augmentation, some dropout, some white noise):
* WN, BN, BN-MEAN all learn similarly fast. Network without normalization learns slower, but catches up towards the end.
* BN learns "more" per example, but is about 16% slower (time-wise) than WN.
* WN reaches about same test error as no normalization (both ~8.4%), BN achieves better results (~8.0%).
* WN + BN-MEAN achieves best results with 7.31%.
* Optimizer: Adam
* Convolutional VAE on MNIST and CIFAR-10:
* WN learns more per example und plateaus at better values than network without normalization. (BN was not tested.)
* Optimizer: Adamax
* DRAW on MNIST (heavy on LSTMs):
* WN learns significantly more example than network without normalization.
* Also ends up with better results. (Normal network might catch up though if run longer.)
* Deep Reinforcement Learning (Space Invaders):
* WN seemed to overall acquire a bit more reward per epoch than network without normalization. Variance (in acquired reward) however also grew.
* Results not as clear as in DRAW.
* Optimizer: Adamax
* They argue that initializing `g` to `exp(cs)` (`c` constant, `s` learned) might be better, but they didn't get better test results with that.
* Due to some gradient effects, `||v||` currently grows monotonically with every weight update. (Not necessarily when using optimizers that use separate learning rates per parameters.)
* That grow effect leads the network to be more robust to different learning rates.
* Setting a small hard limit/constraint for `||v||` can lead to better test set performance (parameter updates are larger, introducing more noise).
![CIFAR-10 results](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Weight_Normalization__cifar10.png?raw=true "CIFAR-10 results")
*Performance of WN on CIFAR-10 compared to BN, BN-MEAN and no normalization.*
![DRAW, DQN results](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Weight_Normalization__draw_dqn.png?raw=true "DRAW, DQN results")
*Performance of WN for DRAW (left) and deep reinforcement learning (right).*