[link]
# Main Results (tl;dr) ## Deep *Linear* Networks 1. Loss function is **nonconvex** and nonconcave 2. **Every local minimum is a global minimum** 3. Shallow neural networks *don't* have bad saddle points 4. Deep neural networks *do* have bad saddle points ## Deep *ReLU* Networks * Same results as above by reduction to deep linear networks under strong simplifying assumptions * Strong assumptions: * The probability that a path through the ReLU network is active is the same, agnostic to which path it is. * The activations of the network are independent of the input data and the weights. ## Highlighted Takeaways * Depth *doesn't* create nonglobal minima, but depth *does* create bad saddle points. * This paper moves deep linear networks closer to a good model for deep ReLU networks by discarding 5 of the 7 of the previously used assumptions. This gives more "support" for the conjecture that deep ReLU networks don't have bad local minima. * Deep linear networks don't have bad local minima, so if deep ReLU networks do have bad local minima, it's purely because of the introduction of nonlinear activations. This highlights the importance of the activation function used. * Shallow linear networks don't have bad saddles point while deep linear networks do, indicating that the saddle point problem is introduced with depth beyond the first hidden layer. Bad saddle point : saddle point whose Hessian has no negative eigenvalues (no direction to descend) Shallow neural network : single hidden layer Deep neural network : more than one hidden layer Bad local minima : local minima that aren't global minima # Position in Research Landscape * Conjecture from 1989: For deep linear networks, every local minimum is a global minimum: [Neural networks and principal component analysis: Learning from examples without local minima (Neural networks 1989)](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.408.1839&rep=rep1&type=pdf) * This paper proves that conjecture. * Given 7 strong assumptions, the losses of local minima are concentrated in an exponentially (with dimension) tight band: [The Loss Surfaces of Multilayer Networks (AISTATS 2015)](https://arxiv.org/abs/1412.0233) * Discarding some of the above assumptions is an open problem: [Open Problem: The landscape of the loss surfaces of multilayer networks (COLT 2015)](http://proceedings.mlr.press/v40/Choromanska15.pdf) * This paper discards 5 of those assumptions and proves the result for a strictly more general deep nonlinear model class. # More Details ## Deep *Linear* Networks * Main result is Result 2, which proves the conjecture from 1989: every local minimum is a global minimum. * Not where the strong assumptions come in * Assumptions (realistic and practically easy to satisfy): * $XX^T$ and $XY^T$ are full rank * $d_y \leq d_x$ (output is lower dimension than input) * $\Sigma = YX^T(XX^T )^{−1}XY^T$ has $d_y$ distinct eigenvalues * specific to the squared error loss function * Essentially gives a comprehensive understanding of the loss surface of deep linear networks ## Deep ReLU Networks * Specific to ReLU activation. Makes strong use of its properties * Choromanska et al. (2015) relate the loss function to the Hamiltonian of the spherical spinglass model, using 3 reshaping assumptions. This allows them to apply existing random matrix theory results. This paper drops those reshaping assumptions by performing completely different analysis. * Because Choromanska et al. (2015) used random matrix theory, they analyzed a random Hessian, which means they need to make 2 distributional assumptions. This paper also drops those 2 assumptions and analyzes a deterministic Hessian. * Remaining Unrealistic Assumptions: * The probability that a path through the ReLU network is active is the same, agnostic to which path it is. * The activations of the network are independent of the input data and the weights. # Related Resources * [NIPS Oral Presentation](https://channel9.msdn.com/Events/NeuralInformationProcessingSystemsConference/NeuralInformationProcessingSystemsConferenceNIPS2016/DeepLearningwithoutPoorLocalMinima) 
[link]
* They present a variation of Faster RCNN, i.e. a model that predicts bounding boxes in images and classifies them. * In contrast to Faster RCNN, their model is fully convolutional. * In contrast to Faster RCNN, the computation per bounding box candidate (region proposal) is very low. ### How * The basic architecture is the same as in Faster RCNN: * A base network transforms an image to a feature map. Here they use ResNet101 to do that. * A region proposal network (RPN) uses the feature map to locate bounding box candidates ("region proposals") in the image. * A classifier uses the feature map and the bounding box candidates and classifies each one of them into `C+1` classes, where `C` is the number of object classes to spot (e.g. "person", "chair", "bottle", ...) and `1` is added for the background. * During that process, small subregions of the feature maps (those that match the bounding box candidates) must be extracted and converted to fixedsizes matrices. The method to do that is called "Region of Interest Pooling" (RoIPooling) and is based on max pooling. It is mostly the same as in Faster RCNN. * Visualization of the basic architecture: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/RFCN__architecture.jpg?raw=true "Architecture") * Positionsensitive classification * Fully convolutional bounding box detectors tend to not work well. * The authors argue, that the problems come from the translationinvariance of convolutions, which is a desirable property in the case of classification but not when precise localization of objects is required. * They tackle that problem by generating multiple heatmaps per object class, each one being slightly shifted ("positionsensitive score maps"). * More precisely: * The classifier generates per object class `c` a total of `k*k` heatmaps. * In the simplest form `k` is equal to `1`. Then only one heatmap is generated, which signals whether a pixel is part of an object of class `c`. * They use `k=3*3`. The first of those heatmaps signals, whether a pixel is part of the *top left* corner of a bounding box of class `c`. The second heatmap signals, whether a pixel is part of the *top center* of a bounding box of class `c` (and so on). * The RoIPooling is applied to these heatmaps. * For `k=3*3`, each bounding box candidate is converted to `3*3` values. The first one resembles the top left corner of the bounding box candidate. Its value is generated by taking the average of the values in that area in the first heatmap. * Once the `3*3` values are generated, the final score of class `c` for that bounding box candidate is computed by averaging the values. * That process is repeated for all classes and a softmax is used to determine the final class. * The graphic below shows examples for that: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/RFCN__examples.jpg?raw=true "Examples") * The above described RoIPooling uses only averages and hence is almost (computationally) free. * They make use of that during the training by sampling many candidates and only backpropagating on those with high losses (online hard example mining, OHEM). * À trous trick * In order to increase accuracy for small bounding boxes they use the à trous trick. * That means that they use a pretrained base network (here ResNet101), then remove a pooling layer and set the à trous rate (aka dilation) of all convolutions after the removed pooling layer to `2`. * The á trous rate describes the distance of sampling locations of a convolution. Usually that is `1` (sampled locations are right next to each other). If it is set to `2`, there is one value "skipped" between each pair of neighbouring sampling location. * By doing that, the convolutions still behave as if the pooling layer existed (and therefore their weights can be reused). At the same time, they work at an increased resolution, making them more capable of classifying small objects. (Runtime increases though.) * Training of RFCN happens similarly to Faster RCNN. ### Results * Similar accuracy as the most accurate Faster RCNN configurations at a lower runtime of roughly 170ms per image. * Switching to ResNet50 decreases accuracy by about 2 percentage points mAP (at faster runtime). Switching to ResNet152 seems to provide no measureable benefit. * OHEM improves mAP by roughly 2 percentage points. * À trous trick improves mAP by roughly 2 percentage points. * Training on `k=1` (one heatmap per class) results in a failure, i.e. a model that fails to predict bounding boxes. `k=7` is slightly more accurate than `k=3`.
1 Comments

[link]
* Usually GANs transform a noise vector `z` into images. `z` might be sampled from a normal or uniform distribution. * The effect of this is, that the components in `z` are deeply entangled. * Changing single components has hardly any influence on the generated images. One has to change multiple components to affect the image. * The components end up not being interpretable. Ideally one would like to have meaningful components, e.g. for human faces one that controls the hair length and a categorical one that controls the eye color. * They suggest a change to GANs based on Mutual Information, which leads to interpretable components. * E.g. for MNIST a component that controls the stroke thickness and a categorical component that controls the digit identity (1, 2, 3, ...). * These components are learned in a (mostly) unsupervised fashion. ### How * The latent code `c` * "Normal" GANs parameterize the generator as `G(z)`, i.e. G receives a noise vector and transforms it into an image. * This is changed to `G(z, c)`, i.e. G now receives a noise vector `z` and a latent code `c` and transforms both into an image. * `c` can contain multiple variables following different distributions, e.g. in MNIST a categorical variable for the digit identity and a gaussian one for the stroke thickness. * Mutual Information * If using a latent code via `G(z, c)`, nothing forces the generator to actually use `c`. It can easily ignore it and just deteriorate to `G(z)`. * To prevent that, they force G to generate images `x` in a way that `c` must be recoverable. So, if you have an image `x` you must be able to reliable tell which latent code `c` it has, which means that G must use `c` in a meaningful way. * This relationship can be expressed with mutual information, i.e. the mutual information between `x` and `c` must be high. * The mutual information between two variables X and Y is defined as `I(X; Y) = entropy(X)  entropy(XY) = entropy(Y)  entropy(YX)`. * If the mutual information between X and Y is high, then knowing Y helps you to decently predict the value of X (and the other way round). * If the mutual information between X and Y is low, then knowing Y doesn't tell you much about the value of X (and the other way round). * The new GAN loss becomes `old loss  lambda * I(G(z, c); c)`, i.e. the higher the mutual information, the lower the result of the loss function. * Variational Mutual Information Maximization * In order to minimize `I(G(z, c); c)`, one has to know the distribution `P(cx)` (from image to latent code), which however is unknown. * So instead they create `Q(cx)`, which is an approximation of `P(cx)`. * `I(G(z, c); c)` is then computed using a lower bound maximization, similar to the one in variational autoencoders (called "Variational Information Maximization", hence the name "InfoGAN"). * Basic equation: `LowerBoundOfMutualInformation(G, Q) = E[log Q(cx)] + H(c) <= I(G(z, c); c)` * `c` is the latent code. * `x` is the generated image. * `H(c)` is the entropy of the latent codes (constant throughout the optimization). * Optimization w.r.t. Q is done directly. * Optimization w.r.t. G is done via the reparameterization trick. * If `Q(cx)` approximates `P(cx)` *perfectly*, the lower bound becomes the mutual information ("the lower bound becomes tight"). * In practice, `Q(cx)` is implemented as a neural network. Both Q and D have to process the generated images, which means that they can share many convolutional layers, significantly reducing the extra cost of training Q. ### Results * MNIST * They use for `c` one categorical variable (10 values) and two continuous ones (uniform between 1 and +1). * InfoGAN learns to associate the categorical one with the digit identity and the continuous ones with rotation and width. * Applying Q(cx) to an image and then classifying only on the categorical variable (i.e. fully unsupervised) yields 95% accuracy. * Sampling new images with exaggerated continuous variables in the range `[2,+2]` yields sound images (i.e. the network generalizes well). * ![MNIST examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/InfoGAN__mnist.png?raw=true "MNIST examples") * 3D face images * InfoGAN learns to represent the faces via pose, elevation, lighting. * They used five uniform variables for `c`. (So two of them apparently weren't associated with anything sensible? They are not mentioned.) * 3D chair images * InfoGAN learns to represent the chairs via identity (categorical) and rotation or width (apparently they did two experiments). * They used one categorical variable (four values) and one continuous variable (uniform `[1, +1]`). * SVHN * InfoGAN learns to represent lighting and to spot the center digit. * They used four categorical variables (10 values each) and two continuous variables (uniform `[1, +1]`). (Again, a few variables were apparently not associated with anything sensible?) * CelebA * InfoGAN learns to represent pose, presence of sunglasses (not perfectly), hair style and emotion (in the sense of "smiling or not smiling"). * They used 10 categorical variables (10 values each). (Again, a few variables were apparently not associated with anything sensible?) * ![CelebA examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/InfoGAN__celeba.png?raw=true "CelebA examples") 
[link]
* Weight Normalization (WN) is a normalization technique, similar to Batch Normalization (BN). * It normalizes each layer's weights. ### Differences to BN * WN normalizes based on each weight vector's orientation and magnitude. BN normalizes based on each weight's mean and variance in a batch. * WN works on each example on its own. BN works on whole batches. * WN is more deterministic than BN (due to not working an batches). * WN is better suited for noisy environment (RNNs, LSTMs, reinforcement learning, generative models). (Due to being more deterministic.) * WN is computationally simpler than BN. ### How its done * WN is a module added on top of a linear or convolutional layer. * If that layer's weights are `w` then WN learns two parameters `g` (scalar) and `v` (vector, identical dimension to `w`) so that `w = gv / v` is fullfilled (`v` = euclidean norm of v). * `g` is the magnitude of the weights, `v` are their orientation. * `v` is initialized to zero mean and a standard deviation of 0.05. * For networks without recursions (i.e. not RNN/LSTM/GRU): * Right after initialization, they feed a single batch through the network. * For each neuron/weight, they calculate the mean and standard deviation after the WN layer. * They then adjust the bias to `mean/stdDev` and `g` to `1/stdDev`. * That makes the network start with each feature being roughly zeromean and unitvariance. * The same method can also be applied to networks without WN. ### Results: * They define BNMEAN as a variant of BN which only normalizes to zeromean (not unitvariance). * CIFAR10 image classification (no data augmentation, some dropout, some white noise): * WN, BN, BNMEAN all learn similarly fast. Network without normalization learns slower, but catches up towards the end. * BN learns "more" per example, but is about 16% slower (timewise) than WN. * WN reaches about same test error as no normalization (both ~8.4%), BN achieves better results (~8.0%). * WN + BNMEAN achieves best results with 7.31%. * Optimizer: Adam * Convolutional VAE on MNIST and CIFAR10: * WN learns more per example und plateaus at better values than network without normalization. (BN was not tested.) * Optimizer: Adamax * DRAW on MNIST (heavy on LSTMs): * WN learns significantly more example than network without normalization. * Also ends up with better results. (Normal network might catch up though if run longer.) * Deep Reinforcement Learning (Space Invaders): * WN seemed to overall acquire a bit more reward per epoch than network without normalization. Variance (in acquired reward) however also grew. * Results not as clear as in DRAW. * Optimizer: Adamax ### Extensions * They argue that initializing `g` to `exp(cs)` (`c` constant, `s` learned) might be better, but they didn't get better test results with that. * Due to some gradient effects, `v` currently grows monotonically with every weight update. (Not necessarily when using optimizers that use separate learning rates per parameters.) * That grow effect leads the network to be more robust to different learning rates. * Setting a small hard limit/constraint for `v` can lead to better test set performance (parameter updates are larger, introducing more noise). ![CIFAR10 results](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Weight_Normalization__cifar10.png?raw=true "CIFAR10 results") *Performance of WN on CIFAR10 compared to BN, BNMEAN and no normalization.* ![DRAW, DQN results](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Weight_Normalization__draw_dqn.png?raw=true "DRAW, DQN results") *Performance of WN for DRAW (left) and deep reinforcement learning (right).* 
[link]
This paper performs activation maximization (AM) using Deep Generator Network (DGN), which served as a learned natural iamge prior, to synthesize realistic images as inputs and feed it into the DNN we want to understand. By visualizing synthesized images that highly activate particular neurons in the DNN, we can interpret what each of neurons in the DNN learned to detect. ### Key Points  DGN (natural image prior) generates more coherent images when optimizing fullyconnected layer codes instead of lowlevel codes. However, previous studies showed that lowlevel features results in better reconstructions beacuse it contains more image details. The difference is that here DGNAM is trying to synthesize an entire layer code from scratch. Features in lowlevel only has a small, local receptive field so that the optimization process has to independently tune image without knowing the global structure. Also, the code space at a convolutional layer is much more highdimensional, making it harder to optimize.  The learned prior trained on ImageNet can also generalize to Places.  It doesn't generalize well if architecture of the encoder trained with DGN is different with the DNN we wish to inspect.  The learned prior also generalizes to visualize hidden neurons, producing more realistic textures/colors.  When visualizing hidden neurons, DGNAM trained on ImageNet also generalize to Places and produce similar results as [1].  The synthesized images are showed to teach us what neurons in DNN we wish to inspect prefer instead of what prior prefer. ### Model ![](https://cloud.githubusercontent.com/assets/7057863/21002626/b094d7aebd6111e68c95fd4931648426.png) ### Thought Solid paper with diverse visualizations and thorough analysis. ### Reference [1] Object Detectors Emerge In Deep Scene CNNs, B.Zhou et. al. 
[link]
This paper developed a semantically rich representation for natural sound using unlabeled videos as a bridge to transfer discriminative visual knowledge from wellestablished visual recognition models into the sound modality. The learned sound representation yields significant performance improvements on standard benchmarks for acoustic scene classification task. ### Key Points  The natural synchronization between vision and sound can be leveraged as a supervision signal for each other.  Crossmodal learning can overcome overfitting if the target modal have much fewer data than other modals, which is essential for deep networks to work well.  In the sound classification task, **pool5** and **conv6** extracted from SoundNet achieve best performance. ### Model  The authors proposed a studentteacher training procedure to transfer discriminative visual knowledge from visual recognition models trained on ImageNet and Places into the SoundNet by minimizing KL divergence between their predictions. ![](https://cloud.githubusercontent.com/assets/7057863/20856609/05fe12d6b94e11e68c92995ee84fe0d7.png)  Two reasons to use CNN for sound: 1. invariant to translations; 2. stacking layers to detect higherlevel concepts. ### Exp  Adding a linear SVM upon representation learned from SoundNet outperforms other existing methods 10%.  Using lots of unlabeled videos as supervision signals enable the deeper SoundNet to work, or otherwise the 8layer networks performs poorly due to overfitting.  Simultaneous Using Places and ImageNet as supervision beats using only one of them 3%.  Multimodal recognition models use visual and sound data together yields 2% gain in classification accuracy. ### Thought I think this paper is really complete since it contains good intuition, ablation analysis, representation visualization, hidden unit visualization, and significent performance imporvements. ### Questions  Although paper said that "To handle variabletemporallength of input sound, this model uses a fully convolutional network and produces an output over multiple timesteps in video.", but the code seems to set the length of each excerpts fixed to 5 seconds.  It looks not clear for me about the data augmentation technique used in training. 
[link]
The authors presented a new generative model that learns to disentangle the factors of variations of the data. The authors claim that the proposed model is pretty robust to supervision. This is achieved by combining two of the most successful generative models: VAE and GAN. The model is able to resolve the analogies in a consistent way on several datasets with minimal parameter/architecture tunning. This paper presents a way to learn latent codes for data, that captures both the information relevant for a given classification task, as well as the remaining irrelevant factors of variation (rather than discarding the latter as a classification model would). This is done by combining a VAEstyle generative model, and adversarial training. This model proves capable of disentangling style and content in images (without explicit supervision for style information), and proves useful for analogy resolution. This paper introduces a generative model for learning to disentangle hidden factors of variation. The disentangling separates the code into two, where one is claimed to be the code that descries factors relevant to solving a specific task, and the other describing the remaining factors. Experimental results show that the proposed method is promising. The authors combine state of the art methods VAE and GAN to generate images with two complementary codes: one relevant and one irrelevant. They major contribution of the paper is the development of a training procedure that exploits triplets of images (two sharing the relevant code, one note sharing) to regularize the encoderdecoder architecture and avoid trivial solutions. The results are qualitatively good and comparable to previous article using more sources of supervision. Paper seeks to explore the variations amongst samples which separate multiple classes using auto encoders and decoders. Specifically, the authors propose combining generative adversarial networks and variational auto encoders. The idea mimics the game play between two opponents, where one attempts to fool the other into believing a synthetic sample is in fact a natural sample. The paper proposes an iterative training procedure where a generative model was first trained on a number of samples while keeping the weights of the adversary constant and later the adversary is trained while keeping the generative model weights constant. The paper performs experiments on generation of instances between classes, retrieval of instances belonging to a given class, and interpolation of instances between two classes. The experiments were performed on MNIST, a set of 2D character animation sprites, and 2D NORB toy image dataset. 
[link]
The paper proposes a "neural transducer" model for sequencetosequence tasks that operates in a lefttoright and online fashion. In other words, the model produces output as the input is received instead of waiting until the full input is received like most sequencetosequence models do. Key ideas used to make the model work include a recurrent attention mechanism, the use of an endofblock symbol in the output alphabet to indicate when the transducer should move to the next input block, and approximate algorithms based on dynamic programming and beam search for training and inference with the transducer model. Experiments on the TIMIT speech task show that the model works well and explore some of the design parameters of the model. Like similar models of this type, the input is processed by an encoder and a decoder produces an output sequence using the information provided by the encoder and conditioned on its own previous predictions. The method is evaluated on a toy problem and the TIMIT phoneme recognition task. The authors also propose some smaller ideas like two different attention mechanism variations. The map from block input to output is governed by a standard sequencetosequence model with additional state carried over from the previous block. Alignment of the two sequences is approximated by a dynamic program using a greedy local search heuristic. Experimental results are presented for phone recognition on TIMIT. The encoder is a multilayer LSTM RNN. The decoder is an RNN model conditioned on weighted sums of the last layer of the encoder and it's previous output. The weighting schemes (attention) varies and can be conditioned on the hidden states or also previous attention vectors. The decoder model produces a sequence of symbols, until it outputs a special end character "e" and is moved to the next block (other mechanisms where explored as well (no endofblocksymbol and separately predicting the end of a block given the attention vector). It is then fed the weighted sum of the next block of encoder states. The resulting sequence of symbols determines an alignment of the target symbols over the blocks of inputs, where each block may be assigned a variable number of characters. The system is trained by fixing an alignment, that approximately resembles the best alignment. Finding this approximately best alignment is akin to a beamsearch with a beam size of M (line 169), but a restricted set of symbols conditional on the last symbol in a particular hypothesis (since the target sequence is known). Alignments are computed less frequently than model updates (typically every 100 to 300 sequences). For inference, an unconstrained beamsearch procedure is performed with a threshold on sequence length and beam size. 
[link]
The authors propose to replace the notion of 'attention' in neural architectures with the notion of 'active memory' where rather than focusing on a single part of the memory one would operate on the whole of it in parallel. This paper introduces an extension to neural GPUs for machine translation. I found the experimental analysis section lacking in both comparisons to state of the art MT techniques as well as thoroughly evaluating the proposed method. This paper proposes active memory, which is a memory mechanism that operates all the part in parallel. The active memory was compared to attention mechanism and it is shown that the active memory is more effective for long sentence translation than the attention mechanism in EnglishFrench translation. This paper proposes two new models for modeling sequential data in the sequencetosequence framework. The first is called the Markovian Neural GPU and the second is called the Extended Neural GPU. Both models are extensions of the Neural GPU model (Kaiser and Sutskever, 2016), but unlike the Neural GPU, the proposed models do not model the outputs independently but instead connect the output token distributions recursively. The paper provides empirical evidence on a machine translation task showing that the two proposed models perform better than the Neural GPU model and that the Extended Neural GPU performs on par with a GRUbased encoderdecoder model with attention. 
[link]
This paper has a simple premise: that the, say, LSTM cell works better with multiplicative updates (equation 2) rather than additive ones (equation 1). This additive update is used in various places in lieu of additive ones, in various places in the LSTM recurrence equations (the exact formulation is in the supplementary material). A slightly hand wavy argument is made in favour of the multiplicative update, on the grounds of superior gradient flow (section 2.2). Mainly however, the authors make a rather thorough empirical investigation which shows remarkably good performance of their new architectures, on a range of real problems. Figure 1(a) is nice, showing an apparent greater information flow (as defined by a particular gradient) through time for the new scheme, as well as faster convergence and less saturated hidden unit activations. Overall, the experimental results appear thorough and convincing, although I am not a specialist in this area. This model presents a multiplicative alternative (with an additive component) to the additive update which happens at the core of various RNNs (Simple RNNs, GRUs, LSTMs). The multiplicative component, without introducing a significant change in the number of parameters, yields better gradient passing properties which enable the learning of better models, as shown in experiments. 
[link]
This paper proposes several definitions of measures of complexity of a recurrent neural network. They measure 1) recurrent depth (degree of multilayeredness as a function of time of recursive connections) 2) feedforward depth (degree of multilayeredness as a function of input > output connections) 3) recurrent skip coefficient (degree of directness, like the inverse of multilayeredness, of connections) In addition to the actual definitions, there are two main contributions:  The authors show that the measures (which are limits as the number of time steps > infinity) are well defined.  The authors correlate the measures with empirical performance in various ways, showing that all measure of depth can lead to improved performance. This paper provides 3 measures of complexity for RNNs. They then show experimentally that these complexity measures are meaningful, in the sense that increasingly complexity seems to correlated with better performance. The authors first present a rigorous graphtheoretic framework that describes the connecting architectures of RNNs in general, with which the authors easily explain how we can unfold an RNN. The authors then go on and propose tree architecture complexity measures of RNNs, namely the recurrent depth, the feedforward depth and the recurrent skip coefficient. Experiments on various tasks show the importance of certain measures on certain tasks, which indicates that those three complexity measures might be good guidelines when designing a recurrent neural network for certain tasks. 
[link]
The proposed approach consists in corrupting the training targets with a noise derived from the task reward while doing maximum likelihood training. This simple but specific smoothing of the target distribution allows to significantly boost the performance of neural structured output prediction as showcased on TIMIT phone and translation tasks. The link between this approach and RLbased expected reward maximization is also made clear by the paper, Prior work has chosen either maximum likelihood learning, which is relatively tractable but assumes a log likelihood loss, or reinforcement learning, which can be performed for a taskspecific loss function but requires sampling many predictions to estimate gradients. The proposed objective bridges the gap with "rewardaugmented maximum likelihood," which is similar to maximum likelihood but estimates the expected loss with samples that are drawn in proportion to their distance from the ground truth. Empirical results show good improvements with LSTMbased predictors on speech recognition and machine translation benchmarks relative to maximum likelihood training. This work is inspired by recent advancement in reinforcement learning and likelihood learning. The authors suggest to learn parameters so as to minimize the KL divergence between CRFs and a probability model that is proportional to the reward function (which the authors call payoff distribution, see Equation 4). The authors suggest an optimization algorithm for the KLdivergence minimization that depends on sampling from the payoff distribution. Current methods to learn a model for structured prediction include max margin optimisation and reinforcement learning. However, the max margin approach only optimises a bound on the true reward, and requires loss augmented inference to obtain gradients, which can be expensive. On the other hand, reinforcement learning does not make use of available supervision, and can therefore struggle when the reward is sparse, and furthermore the gradients can have high variance. The paper proposes a novel approach to learning for problems that involve structured prediction. They relate their approach to simple maximum likelihood (ML) learning and reinforcement learning (RL): ML optimises the KL divergence of a delta distribution relative to the model distribution, and RL optimises the KL divergence of the model distribution relative to the exponentiated reward distribution. They propose rewardaugmented maximum likelihood learning, which optimises the KL divergence of the exponentiated reward distribution relative to the model distribution. Compared to RL, the arguments of the KL divergence are swapped. Compared to ML, the delta distribution is generalised to the exponentiated reward distribution. Training is cheap in RML learning. It is only necessary to sample from the output set according to the exponentiated reward distribution. All experiments are performed in speech recognition and machine translation, where the structure over the output set is defined by the edit distance. An improvement is demonstrated over simple ML. 
[link]
The paper addresses the problem of compressive sensing MRI (CSMRI) by proposing a "deep unfolding" approach (cf. http://arxiv.org/abs/1409.2574) with a sparsitybased data prior and inference via ADMM. All layers of the proposed ADMMNet are based on a generalization of ADMM inference steps and are discriminatively trained to minimize a reconstruction error. In contrast to other methods for CSMRI, the proposed approach offers both high reconstruction quality and fast runtime. The basic idea is to convert the convention optimization based CS reconstruction algorithm into a fixed neural network learned with backpropagation algorithm. Specifically, the ADMMbased CS reconstruction is approximated with a deep neural network. Experimental results show that the approximated neural network outperforms several existing CSMRI algorithms with less computational time. The ADMM algorithm has proven to be useful for solving problems with differentiable and nondifferentiable terms, and therefore has a clear link with compressed sensing. Experiments prove some gain in performance with respect to the state of the art, specially in terms of computational cost at test time. 
[link]
A study of how scan orders influence Mixing time in Gibbs sampling. This paper is interested in comparing the mixing rates of Gibbs sampling using either systematic scan or random updates. The basic contributions are two: First, in Section 2, a set of cases where 1) systematic scan is polynomially faster than random updates. Together with a previously known case where it can be slower this contradicts a conjecture that the speeds of systematic and random updates are similar. Secondly, (In Theorem 1) a set of mild conditions under which the mixing times of systematic scan and random updates are not "too" different (roughly within squares of each other). First, following from a recent paper by Roberts and Rosenthal, the authors construct several examples which do not satisfy the commonly held belief that systematic scan is never more than a constant factor slower and a log factor faster than random scan. The authors then provide a result Theorem 1 which provides weaker bounds, which however they verify at least under some conditions. In fact the Theorem compares random scan to a lazy version of the systematic scan and shows that and obtains bounds in terms of various other quantities, like the minimum probability, or the minimum holding probability. MCMC is at the heart of many applications of modern machine learning and statistics. It is thus important to understand the computational and theoretical performance under various conditions. The present paper focused on examining systematic Gibbs sampling in comparison to random scan Gibbs. They do so first though the construction of several examples which challenge the dominant intuitions about mixing times, and develop theoretical bounds which are much wider than previously conjectured. 
[link]
The paper ([arxiv](https://arxiv.org/abs/1610.09716)) introduces DCNNs (Doubly Convolutional Neural Networks). Those are CNNs which contain a new layer type which generalized convolutional layers. ## Ideas CNNs seem to learn many filters which are similar to other learned filters in the same layer. The weights are only slightly shifted. The idea of double convolution is to learn groups filters where filters within each group are translated versions of each other. To achieve this, a doubly convolutional layer allocates a set of meta filters which has filter sizes that are larger than the effective filter size. Effective filters can be then extracted from each meta filter, which corresponds to convolving the meta filters with an identity kernel. All the extracted filters are then concatenated, and convolved with the input. > We have also confirmed that replacing a convolutional layer with a doubly convolutional layer consistently improves the performance, regardless of the depth of the layer. ## Evaluation * CIFAR10+: 7.24% error * CIFAR100+: 26.53% error * ImageNet: 8.23% Top5 error ## Critique The ktranslation correlation is effectively a [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). I think the authors should have mentioned that. ## Related TODO 
[link]
TLDR; The authors encourage exploration by adding a pseudoreward of the form $\frac{\beta}{\sqrt{count(state)}}$ for infrequently visited states. State visits are counted using Locality Sensitive Hashing (LSH) based on an environmentspecific feature representation like raw pixels or autoencoder representations. The authors show that this simple technique achieves gains in various classic RL control tasks and several games in the ATARI domain. While the algorithm itself is simple there are now several more hyperaprameters to tune: The bonus coefficient `beta`, the LSH hashing granularity (how many bits to use for hashing) as well as the type of feature representation based on which the hash is computed, which itself may have more parameters. The experiments don't paint a consistent picture and different environments seem to need vastly different hyperparameter settings, which in my opinion will make this technique difficult to use in practice. 
[link]
#### Introduction * Automated Theorem Proving (ATP)  Attempting to prove mathematical theorems automatically. * Bottlenecks in ATP: * **Autoformalization**  Semantic or formal parsing of informal proofs. * **Automated Reasoning**  Reasoning about already formalised proofs. * Paper evaluates the effectiveness of neural sequence models for premise selection (related to automated reasoning) without using hand engineered features. * [Link to the paper](https://arxiv.org/abs/1606.04442) #### Premise Selection * Given a large set of premises P, an ATP system A with given resource limits, and a new conjecture C, predict those premises from P that will most likely lead to an automatically constructed proof of C by A #### Dataset * Mizar Mathematical Library (MML) used as the formal corpus. * The premise length varies from 5 to 84299 characters and over 60% if the words occur fewer than 10 times (rare word problem). #### Approach * The model predicts the probability that a given axiom is useful for proving a given conjecture. * Conjecture and axiom sequences are separately embedded into fixed length real vectors, then concatenated and passed to a third network with few fully connected layers and logistic loss. * The two embedding networks and the joint predictor path are trained jointly. ##### Stage 1: Characterlevel Models * Treat premises on character level using an 80dimensional one hot encoding vector. * Architectures for embedding: * pure recurrent LSTM and GRU Network * CNN (with max pooling) * Recurrentconvolutional network that shortens the sequence using convolutional layer before feeding it to LSTM. ##### Stage 2: Wordlevel Models * MML dataset contains both implicit and explicit definitions. * To avoid manually encoding the implicit definitions, the entire statement defining an identifier is embedded and the definition embeddings are used as word level embeddings. * This is better than recursively expanding and embedding the word definition as the definition chains can be very deep. * Once word level embeddings are obtained, the architecture from Characterlevel models can be reused. #### Experiments ##### Metrics * For each conjecture, the model ranks the possible premises. * Primary metric is the number of conjectures proved from topk premises. * Average Max Relative Rank (AMMR) is more sophisticated measure based on the motivation that conjectures are easier to prove if all their dependencies occur earlier in ranking. * Since it is very costly to rank all axioms for a conjecture, an approximation is made and a fixed number of random false dependencies are used for evaluating AMMR. ##### Network Training * Asynchronous distributed stochastic gradient descent using Adam optimizer. * Clipped vs Nonclipped Gradients. * Max Sequence length of 2048 for characterlevel sequences and 500 for wordlevel sequences. ##### Results * Best Selection Pipeline  Stage 1 characterlevel CNN which produces wordlevel embeddings for the next stage. * Best models use simple CNNs followed by max pooling and twostage definitionbased defCNN outperforms naive wordCNN (where word embeddings are learnt in a single pass). 