Welcome to ShortScience.org! 
This paper presents a recurrent neural network architecture in which some of the recurrent weights dynamically change during the forward pass, using a hebbianlike rule. They correspond to the matrices $A(t)$ in the figure below: ![Fast weights RNN figure](http://i.imgur.com/DCznSf4.png) These weights $A(t)$ are referred to as *fast weights*. Comparatively, the recurrent weights $W$ are referred to as slow weights, since they are only changing due to normal training and are otherwise kept constant at test time. More specifically, the proposed fast weights RNN compute a series of hidden states $h(t)$ over time steps $t$, but, unlike regular RNNs, the transition from $h(t)$ to $h(t+1)$ consists of multiple ($S$) recurrent layers $h_1(t+1), \dots, h_{S1}(t+1), h_S(t+1)$, defined as follows: $$h_{s+1}(t+1) = f(W h(t) + C x(t) + A(t) h_s(t+1))$$ where $f$ is an elementwise nonlinearity such as the ReLU activation. The next hidden state $h(t+1)$ is simply defined as the last "inner loop" hidden state $h_S(t+1)$, before moving to the next time step. As for the fast weights $A(t)$, they too change between time steps, using the hebbianlike rule: $$A(t+1) = \lambda A(t) + \eta h(t) h(t)^T$$ where $\lambda$ acts as a decay rate (to partially forget some of what's in the past) and $\eta$ as the fast weight's "learning rate" (not to be confused with the learning rate used during backprop). Thus, the role played by the fast weights is to rapidly adjust to the recent hidden states and remember the recent past. In fact, the authors show an explicit relation between these fast weights and memoryaugmented architectures that have recently been popular. Indeed, by recursively applying and expending the equation for the fast weights, one obtains $$A(t) = \eta \sum_{\tau = 1}^{\tau = t1}\lambda^{t\tau1} h(\tau) h(\tau)^T$$ *(note the difference with Equation 3 of the paper... I think there was a typo)* which implies that when computing the $A(t) h_s(t+1)$ term in the expression to go from $h_s(t+1)$ to $h_{s+1}(t+1)$, this term actually corresponds to $$A(t) h_s(t+1) = \eta \sum_{\tau =1}^{\tau = t1} \lambda^{t\tau1} h(\tau) (h(\tau)^T h_s(t+1))$$ i.e. $A(t) h_s(t+1)$ is a weighted sum of all previous hidden states $h(\tau)$, with each hidden states weighted by an "attention weight" $h(\tau)^T h_s(t+1)$. The difference with many recent memoryaugmented architectures is thus that the attention weights aren't computed using a softmax nonlinearity. Experimentally, they find it beneficial to use [layer normalization](https://arxiv.org/abs/1607.06450). Good values for $\eta$ and $\lambda$ seem to be 0.5 and 0.9 respectively. I'm not 100% sure, but I also understand that using $S=1$, i.e. using the fast weights only once per time steps, was usually found to be optimal. Also see Figure 3 for the architecture used on the image classification datasets, which is slightly more involved. The authors present a series 4 experiments, comparing with regular RNNs (IRNNs, which are RNNs with ReLU units and whose recurrent weights are initialized to a scaled identity matrix) and LSTMs (as well as an associative LSTM for a synthetic associative retrieval task and ConvNets for the two image datasets). Generally, experiments illustrate that the fast weights RNN tends to train faster (in number of updates) and better than the other recurrent architectures. Surprisingly, the fast weights RNN can even be competitive with a ConvNet on the two image classification benchmarks, where the RNN traverses glimpses from the image using a fixed policy. **My two cents** This is a very thought provoking paper which, based on the comparison with LSTMs, suggests that fast weights RNNs might be a very good alternative. I'd be quite curious to see what would happen if one was to replace LSTMs with them in the myriad of papers using LSTMs (e.g. all the Seq2Seq work). Intuitively, LSTMs seem to be able to do more than just attending to the recent past. But, for a given task, if one was to observe that fast weights RNNs are competitive to LSTMs, it would suggests that the LSTM isn't doing something that much more complex. So it would be interesting to determine what are the tasks where the extra capacity of an LSTM is actually valuable and exploitable. Hopefully the authors will release some code, to facilitate this exploration. The discussion at the end of Section 3 on how exploiting the "memory augmented" view of fast weights is useful to allow the use of minibatches is interesting. However, it also suggests that computations in the fast weights RNN scales quadratically with the sequence size (since in this view, the RNN technically must attend to all previous hidden states, since the beginning of the sequence). This is something to keep in mind, if one was to consider applying this to very long sequences (i.e. much longer than the hidden state dimensionality). Also, I don't quite get the argument that the "memory augmented" view of fast weights is more amenable to minibatch training. I understand that having an explicit weight matrix $A(t)$ for each minibatch sequence complicates things. However, in the memory augmented view, we also have a "memory matrix" that is different for each sequence, and yet we can handle that fine. The problem I can imagine is that storing a *sequence of arbitrary weight matrices* for each sequence might be storage demanding (and thus perhaps make it impossible to store a forward/backward pass for more than one sequence at a time), while the implicit memory matrix only requires appending a new row at each time step. Perhaps the argument to be made here is more that there's already minibatch compatible code out there for dealing with the use of a memory matrix of stored previous memory states. This work strikes some (partial) resemblance to other recent work, which may serve as food for thought here. The use of possibly multiple computation layers between time steps reminds me of [Adaptive Computation Time (ACT) RNN]( http://www.shortscience.org/paper?bibtexKey=journals/corr/Graves16). Also, expressing a backpropable architecture that involves updates to weights (here, hebbianlike updates) reminds me of recent work that does backprop through the updates of a gradient descent procedure (for instance as in [this work]( http://www.shortscience.org/paper?bibtexKey=conf/icml/MaclaurinDA15)). Finally, while I was familiar with the notion of fast weights from the work on [Using Fast Weights to Improve Persistent Contrastive Divergence](http://people.ee.duke.edu/~lcarin/FastGibbsMixing.pdf), I didn't realize that this concept dated as far back as the late 80s. So, for young researchers out there looking for inspiration for research ideas, this paper confirms that looking at the older neural network literature for inspiration is probably a very good strategy :) To sum up, this is really nice work, and I'm looking forward to the NIPS 2016 oral presentation of it! 
This paper describes how to apply the idea of batch normalization (BN) successfully to recurrent neural networks, specifically to LSTM networks. The technique involves the 3 following ideas: **1) Careful initialization of the BN scaling parameter.** While standard practice is to initialize it to 1 (to have unit variance), they show that this situation creates problems with the gradient flow through time, which vanishes quickly. A value around 0.1 (used in the experiments) preserves gradient flow much better. **2) Separate BN for the "hiddens to hiddens preactivation and for the "inputs to hiddens" preactivation.** In other words, 2 separate BN operators are applied on each contributions to the preactivation, before summing and passing through the tanh and sigmoid nonlinearities. **3) Use of largest timestep BN statistics for longer testtime sequences.** Indeed, one issue with applying BN to RNNs is that if the input sequences have varying length, and if one uses pertimestep mean/variance statistics in the BN transformation (which is the natural thing to do), it hasn't been clear how do deal with the last time steps of longer sequences seen at test time, for which BN has no statistics from the training set. The paper shows evidence that the preactivation statistics tend to gradually converge to stationary values over time steps, which supports the idea of simply using the training set's last time step statistics. Among these ideas, I believe the most impactful idea is 1). The papers mentions towards the end that improper initialization of the BN scaling parameter probably explains previous failed attempts to apply BN to recurrent networks. Experiments on 4 datasets confirms the method's success. **My two cents** This is an excellent development for LSTMs. BN has had an important impact on our success in training deep neural networks, and this approach might very well have a similar impact on the success of LSTMs in practice. 
This paper presents an unsupervised generative model, based on the variational autoencoder framework, but where the encoder is a recurrent neural network that sequentially infers the identity, pose and number of objects in some input scene (2D image or 3D scene). In short, this is done by extending the DRAW model to incorporate discrete latent variables that determine whether an additional object is present or not. Since the reparametrization trick cannot be used for discrete variables, the authors estimate the gradient through the sampling operation using a likelihood ratio estimator. Another innovation over DRAW is the application to 3D scenes, in which the decoder is a graphics renderer. Since it is not possible to backpropagate through the renderer, gradients are estimated using finitedifference estimates (which require going through the renderer several times). Experiments are presented where the evaluation is focused on the ability of the model to detect and count the number of objects in the image or scene. **My two cents** This is a nice, natural extension of DRAW. I'm particularly impressed by the results for the 3D scene setting. Despite the fact that setup is obviously synthetic and simplistic, I really surprised that estimating the decoder gradients using finitedifferences worked at all. It's also interesting to see that the proposed model does surprisingly well compared to a CNN supervised approach that directly predicts the objects identity and pose. Quite cool! To see the model in action, see [this cute video][1]. [1]: https://www.youtube.com/watch?v=4tc84kKdpY4 
This paper presents Swapout, a simple dropout method applied to Residual Networks (ResNets). In a ResNet, a layer $Y$ is computed from the previous layer $X$ as $Y = X + F(X)$ where $F(X)$ is essentially the composition of a few convolutional layers. Swapout simply applies dropout separately on both terms of a layer's equation: $Y = \Theta_1 \odot X + \Theta_2 \odot F(X)$ where $\Theta_1$ and $\Theta_2$ are independent dropout masks for each term. The paper shows that this form of dropout is at least as good or superior as other forms of dropout, including the recently proposed [stochastic depth dropout][1]. Much like in the stochastic depth paper, better performance is achieved by linearly increasing the dropout rate (from 0 to 0.5) from the first hidden layer to the last. In addition to this observation, I also note the following empirical observations: 1. At test time, averaging the output layers of multiple dropout mask samples (referenced to as stochastic inference) is better than replacing the masks by their expectation (deterministic inference), the latter being the usual standard. 2. Comparable performance is achieved by making the ResNet wider (e.g. 4 times) and with fewer layers (e.g. 32) than the orignal ResNet work with thin but very deep (more than 1000 layers) ResNets. This would confirm a similar observation from [this paper][2]. Overall, these are useful observations to be aware of for anyone wanting to use ResNets in practice. [1]: http://arxiv.org/abs/1603.09382v1 [2]: https://arxiv.org/abs/1605.07146 
This paper derives an algorithm for passing gradients through a sample from a mixture of Gaussians. While the reparameterization trick allows to get the gradients with respect to the Gaussian means and covariances, the same trick cannot be invoked for the mixing proportions parameters (essentially because they are the parameters of a multinomial discrete distribution over the Gaussian components, and the reparameterization trick doesn't extend to discrete distributions). One can think of the derivation as proceeding in 3 steps: 1. Deriving an estimator for gradients a sample from a 1dimensional density $f(x)$ that is such that $f(x)$ is differentiable and its cumulative distribution function (CDF) $F(x)$ is tractable: $\frac{\partial \hat{x}}{\partial \theta} =  \frac{1}{f(\hat{x})}\int_{t=\infty}^{\hat{x}} \frac{\partial f(t)}{\partial \theta} dt$ where $\hat{x}$ is a sample from density $f(x)$ and $\theta$ is any parameter of $f(x)$ (the above is a simplified version of Equation 6). This is probably the most important result of the paper, and is based on a really clever use of the general form of the Leibniz integral rule. 2. Noticing that one can sample from a $D$dimensional Gaussian mixture by decomposing it with the product rule $f({\bf x}) = \prod_{d=1}^D f(x_d{\bf x}_{<d})$ and using ancestral sampling, where each $f(x_d{\bf x}_{<d})$ are themselves 1dimensional mixtures (i.e. with differentiable densities and tractable CDFs) 3. Using the 1dimensional gradient estimator (of Equation 6) and the chain rule to backpropagate through the ancestral sampling procedure. This requires computing the integral in the expression for $\frac{\partial \hat{x}}{\partial \theta}$ above, where $f(x)$ is one of the 1D conditional Gaussian mixtures and $\theta$ is a mixing proportion parameter $\pi_j$. As it turns out, this integral has an analytical form (see Equation 22). **My two cents** This is a really surprising and neat result. The author mentions it could be applicable to variational autoencoders (to support posteriors that are mixtures of Gaussians), and I'm really looking forward to read about whether that can be successfully done in practice. The paper provides the derivation only for mixtures of Gaussians with diagonal covariance matrices. It is mentioned that extending to nondiagonal covariances is doable. That said, ancestral sampling with nondiagonal covariances would become more computationally expensive, since the conditionals under each Gaussian involves a matrix inverse. Beyond the case of Gaussian mixtures, Equation 6 is super interesting in itself as its application could go beyond that case. This is probably why the paper also derived a samplingbased estimator for Equation 6, in Equation 9. However, that estimator might be inefficient, since it involves sampling from Equation 10 with rejection, and it might take a lot of time to get an accepted sample if $\hat{x}$ is very small. Also, a good estimate of Equation 6 might require *multiple* samples from Equation 10. Finally, while I couldn't find any obvious problem with the mathematical derivation, I'd be curious to see whether using the same approach to derive a gradient on one of the Gaussian mean or standard deviation parameters gave a gradient that is consistent with what the reparameterization trick provides.
3 Comments

This paper proposes a variant of Neural Turing Machine (NTM) for metalearning or "learning to learn", in the specific context of fewshot learning (i.e. learning from few examples). Specifically, the proposed model is trained to ingest as input a training set of examples and improve its output predictions as examples are processed, in a purely feedforward way. This is a form of metalearning because the model is trained so that its forward pass effectively executes a form of "learning" from the examples it is fed as input. During training, the model is fed multiples sequences (referred to as episodes) of labeled examples $({\bf x}_1, {\rm null}), ({\bf x}_2, y_1), \dots, ({\bf x}_T, y_{T1})$, where $T$ is the size of the episode. For instance, if the model is trained to learn how to do 5class classification from 10 examples per class, $T$ would be $5 \times 10 = 50$. Mainly, the paper presents experiments on the Omniglot dataset, which has 1623 classes. In these experiments, classes are separated into 1200 "training classes" and 423 "test classes", and each episode is generated by randomly selecting 5 classes (each assigned some arbitrary vector representation, e.g. a onehot vector that is consistent within the episode, but not across episodes) and constructing a randomly ordered sequence of 50 examples from within the chosen 5 classes. Moreover, the correct label $y_t$ of a given input ${\bf x}_t$ is always provided only at the next time step, but the model is trained to be good at its prediction of the label of ${\bf x}_t$ at the current time step. This is akin to the scenario of online learning on a stream of examples, where the label of an example is revealed only once the model has made a prediction. The proposed NTM is different from the original NTM of Alex Graves, mostly in how it writes into its memory. The authors propose to focus writing to either the least recently used memory location or the most recently used memory location. Moreover, the least recently used memory location is reset to zero before every write (an operation that seems to be ignored when backpropagating gradients). Intuitively, the proposed NTM should learn a strategy by which, given a new input, it looks into its memory for information from other examples earlier in the episode (perhaps similarly to what a nearest neighbor classifier would do) to predict the class of the new input. The paper presents experiments in learning to do multiclass classification on the Omniglot dataset and regression based on functions synthetically generated by a GP. The highlights are that: 1. The proposed model performs much better than an LSTM and better than an NTM with the original write mechanism of Alex Graves (for classification). 2. The proposed model even performs better than a 1st nearest neighbor classifier. 3. The proposed model is even shown to outperform human performance, for the 5class scenario. 4. The proposed model has decent performance on the regression task, compared to GP predictions using the groundtruth kernel. **My two cents** This is probably one of my favorite ICML 2016 papers. I really think metalearning is a problem that deserves more attention, and this paper presents both an interesting proposal for how to do it and an interesting empirical investigation of it. Much like previous work [\[1\]][1] [\[2\]][2], learning is based on automatically generating a metalearning training set. This is clever I think, since a very large number of such "metalearning" examples (the episodes) can be constructed, thus transforming what is normally a "small data problem" (few shot learning) into a "big data problem", for which deep learning is more effective. I'm particularly impressed by how the proposed model outperforms a 1nearest neighbor classifier. That said, the proposed NTM actually performs 4 reads at each time step, which suggests that a fairer comparison might be with a 4nearest neighbor classifier. I do wonder how this baseline would compare. I'm also impressed with the observation that the proposed model surpassed humans. The paper also proposes to use 5letter words to describe classes, instead of onehot vectors. The motivation is that this should make it easier for the model to scale to much more than 5 classes. However, I don't entirely follow the logic as to why onehot vectors are problematic. In fact, I would think that arbitrarily assigning 5letter words to classes would instead imply some similarity between classes that share letters that is arbitrary and doesn't reflect true class similarity. Also, while I find it encouraging that the performance for regression of the proposed model is decent, I'm curious about how it would compare with a GP approach that incrementally learns the kernel's hyperparameter (instead of using the groundtruth values, which makes this baseline unrealistically strong). Finally, I'm still not 100% sure how exactly the NTM is able to implement the type of feedforward inference I'd expect to be required. I would expect it to learn a memory representation of examples that combines information from the input vector ${\bf x}_t$ *and* its label $y_t$. However, since the label of an input is presented at the following time step in an episode, it is not intuitive to me then how the read/write mechanisms are able to deal with this misalignment. My only guess is that since the controller is an LSTM, then it can somehow remember ${\bf x}_t$ until it gets $y_t$ and appropriately include the combined information into the memory. This could be supported by the fact that using a nonrecurrent feedforward controller is much worse than using an LSTM controller. But I'm not 100% sure of this either. All the above being said, this is still a really great paper, which I hope will help stimulate more research on metalearning. Hopefully code for this paper can eventually be released, which would help in popularizing the topic. [1]: http://snowedin.net/tmp/Hochreiter2001.pdf [2]: http://www.thespermwhale.com/jaseweston/ram/papers/paper_16.pdf 
Originally posted on my Github [papernotes](https://github.com/karpathy/papernotes/blob/master/matching_networks.md) repo. # Matching Networks for One Shot Learning By DeepMind crew: **Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra** This is a paper on **oneshot** learning, where we'd like to learn a class based on very few (or indeed, 1) training examples. E.g. it suffices to show a child a single giraffe, not a few hundred thousands before it can recognize more giraffes. This paper falls into a category of *"duh of course"* kind of paper, something very interesting, powerful, but somehow obvious only in retrospect. I like it. Suppose you're given a single example of some class and would like to label it in test images.  **Observation 1**: a standard approach might be to train an Exemplar SVM for this one (or few) examples vs. all the other training examples  i.e. a linear classifier. But this requires optimization.  **Observation 2:** known nonparameteric alternatives (e.g. kNearest Neighbor) don't suffer from this problem. E.g. I could immediately use a Nearest Neighbor to classify the new class without having to do any optimization whatsoever. However, NN is gross because it depends on an (arbitrarilychosen) metric, e.g. L2 distance. Ew.  **Core idea**: lets train a fully endtoend nearest neighbor classifer!![Screen Shot 20160807 at 10.08.44 PM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/matching_networks/Screen%20Shot%2020160807%20at%2010.08.44%20PM.png) ## The training protocol As the authors amusingly point out in the conclusion (and this is the *duh of course* part), *"oneshot learning is much easier if you train the network to do oneshot learning"*. Therefore, we want the testtime protocol (given N novel classes with only k examples each (e.g. k = 1 or 5), predict new instances to one of N classes) to exactly match the training time protocol. To create each "episode" of training from a dataset of examples then: 1. Sample a task T from the training data, e.g. select 5 labels, and up to 5 examples per label (i.e. 525 examples). 2. To form one episode sample a label set L (e.g. {cats, dogs}) and then use L to sample the support set S and a batch B of examples to evaluate loss on. The idea on high level is clear but the writing here is a bit unclear on details, of exactly how the sampling is done. ## The model I find the paper's model description slightly wordy and unclear, but basically we're building a **differentiable nearest neighbor++**. The output \hat{y} for a test example \hat{x} is computed very similar to what you might see in Nearest Neighbors:![Screen Shot 20160807 at 11.14.26 PM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/matching_networks/Screen%20Shot%2020160807%20at%2011.14.26%20PM.png) where **a** acts as a kernel, computing the extent to which \hat{x} is similar to a training example x_i, and then the labels from the training examples (y_i) are weightblended together accordingly. The paper doesn't mention this but I assume for classification y_i would presumbly be onehot vectors. Now, we're going to embed both the training examples x_i and the test example \hat{x}, and we'll interpret their inner products (or here a cosine similarity) as the "match", and pass that through a softmax to get normalized mixing weights so they add up to 1. No surprises here, this is quite natural: ![Screen Shot 20160807 at 11.20.29 PM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/matching_networks/Screen%20Shot%2020160807%20at%2011.20.29%20PM.png) Here **c()** is cosine distance, which I presume is implemented by normalizing the two input vectors to have unit L2 norm and taking a dot product. I assume the authors tried skipping the normalization too and it did worse? Anyway, now all that's left to define is the function **f** (i.e. how do we embed the test example into a vector) and the function **g** (i.e. how do we embed each training example into a vector?). **Embedding the training examples.** This (the function **g**) is a bidirectional LSTM over the examples: ![Screen Shot 20160807 at 11.57.10 PM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/matching_networks/Screen%20Shot%2020160807%20at%2011.57.10%20PM.png) i.e. encoding of i'th example x_i is a function of its "raw" embedding g'(x_i) and the embedding of its friends, communicated through the bidirectional network's hidden states. i.e. each training example is a function of not just itself but all of its friends in the set. This is part of the ++ above, because in a normal nearest neighbor you wouldn't change the representation of an example as a function of the other data points in the training set. It's odd that the **order** is not mentioned, I assume it's random? This is a bit gross because order matters to a bidirectional LSTM; you'd get different embeddings if you permute the examples. **Embedding the test example.** This (the function **f**) is a an LSTM that processes for a fixed amount (K time steps) and at each point also *attends* over the examples in the training set. The encoding is the last hidden state of the LSTM. Again, this way we're allowing the network to change its encoding of the test example as a function of the training examples. Nifty: ![Screen Shot 20160808 at 12.11.15 AM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/matching_networks/Screen%20Shot%2020160808%20at%2012.11.15%20AM.png) That looks scary at first but it's really just a vanilla LSTM with attention where the input at each time step is constant (f'(\hat{x}), an encoding of the test example all by itself) and the hidden state is a function of previous hidden state but also a concatenated readout vector **r**, which we obtain by attending over the encoded training examples (encoded with **g** from above). Oh and I assume there is a typo in equation (5), it should say r_k = … without the 1 on LHS. ## Experiments **Task**: Nway kshot learning task. i.e. we're given k (e.g. 1 or 5) labelled examples for N classes that we have not previously trained on and asked to classify new instances into he N classes. **Baselines:** an "obvious" strategy of using a pretrained ConvNet and doing nearest neighbor based on the codes. An option of finetuning the network on the new examples as well (requires training and careful and strong regularization!). **MANN** of Santoro et al. [21]: Also a DeepMind paper, a fun NTMlike MetaLearning approach that is fed a sequence of examples and asked to predict their labels. **Siamese network** of Koch et al. [11]: A siamese network that takes two examples and predicts whether they are from the same class or not with logistic regression. A test example is labeled with a nearest neighbor: with the class it matches best according to the siamese net (requires iteration over all training examples one by one). Also, this approach is less endtoend than the one here because it requires the adhoc nearest neighbor matching, while here the *exact* end task is optimized for. It's beautiful. ### Omniglot experiments ### ![Screen Shot 20160808 at 10.21.45 AM](https://github.com/karpathy/papernotes/raw/master/img/matching_networks/Screen%20Shot%2020160808%20at%2010.21.45%20AM.png) Omniglot of [Lake et al. [14]](http://www.cs.toronto.edu/~rsalakhu/papers/LakeEtAl2015Science.pdf) is a MNISTlike scribbles dataset with 1623 characters with 20 examples each. Image encoder is a CNN with 4 modules of [3x3 CONV 64 filters, batchnorm, ReLU, 2x2 max pool]. The original image is claimed to be so resized from original 28x28 to 1x1x64, which doesn't make sense because factor of 2 downsampling 4 times is reduction of 16, and 28/16 is a noninteger >1. I'm assuming they use VALID convs? Results: ![Screen Shot 20160808 at 10.27.46 AM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/matching_networks/Screen%20Shot%2020160808%20at%2010.27.46%20AM.png) Matching nets do best. Fully Conditional Embeddings (FCE) by which I mean they the "Full Context Embeddings" of Section 2.1.2 instead are not used here, mentioned to not work much better. Finetuning helps a bit on baselines but not with Matching nets (weird). The comparisons in this table are somewhat confusing:  I can't find the MANN numbers of 82.8% and 94.9% in their paper [21]; not clear where they come from. E.g. for 5 classes and 5shot they seem to report 88.4% not 94.9% as seen here. I must be missing something.  I also can't find the numbers reported here in the Siamese Net [11] paper. As far as I can tell in their Table 2 they report oneshot accuracy, 20way classification to be 92.0, while here it is listed as 88.1%?  The results of Lake et al. [14] who proposed Omniglot are also missing from the table. If I'm understanding this correctly they report 95.2% on 1shot 20way, while matching nets here show 93.8%, and humans are estimated at 95.5%. That is, the results here appear weaker than those of Lake et al., but one should keep in mind that the method here is significantly more generic and does not make any assumptions about the existence of strokes, etc., and it's a simple, single fullydifferentiable blob of neural stuff. (skipping ImageNet/LM experiments as there are few surprises) ## Conclusions Good paper, effectively develops a differentiable nearest neighbor trained endtoend. It's something new, I like it! A few concerns:  A bidirectional LSTMs (not orderinvariant compute) is applied over sets of training examples to encode them. The authors don't talk about the order actually used, which presumably is random, or mention this potentially unsatisfying feature. This can be solved by using a recurrent attentional mechanism instead, as the authors are certainly aware of and as has been discussed at length in [ORDER MATTERS: SEQUENCE TO SEQUENCE FOR SETS](https://arxiv.org/abs/1511.06391), where Oriol is also the first author. I wish there was a comment on this point in the paper somewhere.  The approach also gets quite a bit slower as the number of training examples grow, but once this number is large one would presumable switch over to a parameteric approach.  It's also potentially concerning that during training the method uses a specific number of examples, e.g. 525, so this is the number of that must also be used at test time. What happens if we want the size of our training set to grow online? It appears that we need to retrain the network because the encoder LSTM for the training data is not "used to" seeing inputs of more examples? That is unless you fall back to iteratively subsampling the training data, doing multiple inference passes and averaging, or something like that. If we don't use FCE it can still be that the attention mechanism LSTM can still not be "used to" attending over many more examples, but it's not clear how much this matters. An interesting experiment would be to not use FCE and try to use 100 or 1000 training examples, while only training on up to 25 (with and fithout FCE). Discussion surrounding this point would be interesting.  Not clear what happened with the Omniglot experiments, with incorrect numbers for [11], [21], and the exclusion of Lake et al. [14] comparison.  A baseline that is missing would in my opinion also include training of an [Exemplar SVM](https://www.cs.cmu.edu/~tmalisie/projects/iccv11/), which is a much more powerful approach than encodewithacnnandnearestneighbor.
4 Comments

This paper can be thought as proposing a variational autoencoder applied to a form of metalearning, i.e. where the input is not a single input but a dataset of inputs. For this, in addition to having to learn an approximate inference network over the latent variable $z_i$ for each input $x_i$ in an input dataset $D$, approximate inference is also learned over a latent variable $c$ that is global to the dataset $D$. By using Gaussian distributions for $z_i$ and $c$, the reparametrization trick can be used to train the variational autoencoder. The generative model factorizes as $p(D=(x_1,\dots,x_N), (z_1,\dots,z_N), c) = p(c) \prod_i p(z_ic) p(x_iz_i,c)$ and learning is based on the following variational posterior decomposition: $q((z_1,\dots,z_N), cD=(x_1,\dots,x_N)) = q(cD) \prod_i q(z_ix_i,c)$. Moreover, latent variable $z_i$ is decomposed into multiple ($L$) layers $z_i = (z_{i,1}, \dots, z_{i,L})$. Each layer in the generative model is directly connected to the input. The layers are generated from $z_{i,L}$ to $z_{i,1}$, each layer being conditioned on the previous (see Figure 1 *Right* for the graphical model), with the approximate posterior following a similar decomposition. The architecture for the approximate inference network $q(cD)$ first maps all inputs $x_i\in D$ into a vector representation, then performs mean pooling of these representations to obtain a single vector, followed by a few more layers to produce the parameters of the Gaussian distribution over $c$. Training is performed by stochastic gradient descent, over minibatches of datasets (i.e. multiple sets $D$). The model has multiple applications, explored in the experiments. One is of summarizing a dataset $D$ into a smaller subset $S\in D$. This is done by initializing $S\leftarrow D$ and greedily removing elements of $S$, each time minimizing the KL divergence between $q(cD)$ and $q(cS)$ (see the experiments on a synthetic Spatial MNIST problem of section 5.3). Another application is fewshot classification, where very few examples of a number of classes are given, and a new test example $x'$ must be assigned to one of these classes. Classification is performed by treating the small set of examples of each class $k$ as its own dataset $D_k$. Then, test example $x$ is classified into class $k$ for which the KL divergence between $q(cx')$ and $q(cD_k)$ is smallest. Positive results are reported when training on OMNIGLOT classes and testing on either the MNIST classes or unseen OMNIGLOT datasets, when compared to a 1nearest neighbor classifier based on the raw input or on a representation learned by a regular autoencoder. Finally, another application is that of generating new samples from an input dataset of examples. The approximate posterior is used to compute $q(cD)$. Then, $c$ is assigned to its posterior mean, from which a value for the hidden layers $z$ and finally a sample $x$ can be generated. It is shown that this procedure produces convincing samples that are visually similar from those in the input set $D$. **My two cents** Another really nice example of deep learning applied to a form of metalearning, i.e. learning a model that is trained to take *new* datasets as input and generalize even if confronted to datasets coming from an unseen data distribution. I'm particularly impressed by the many tasks explored successfully with the same approach: fewshot classification and generative sampling, as well as a form of summarization (though this last probably isn't really metalearning). Overall, the approach is quite elegant and appealing. The very simple, synthetic experiments of section 5.1 and 5.2 are also interesting. Section 5.2 presents the notion of a *priorinterpolation layer*, which is well motivated but seems to be used only in that section. I wonder how important it is, outside of the specific case of section 5.2. Overall, very excited by this work, which further explores the theme of metalearning in an interesting way. 
Fast RCNN is a proposal detection net for object detection tasks. ##### Input & Output The input to a Fast RCNN would be the input image and the region proposals (generated using Selective Search). There are 2 outputs of the net, probability map of all possible objects & background ( e.g. 21 classes for Pascal VOC'12) and corresponding bounding box parameters for each object classes. ##### Architecture The Fast RCNN version of any deep net would need 3 major modifications. For e.g. for VGG'16 1. A ROI pooling layer needs to be added after the final maxpool output before fully connected layers 2. The final FC layer is replaced by 2 sibling branched layers  one for giving a softmax output for probability classes, other one is for predicting an encoding of 4 bounding box parameters (x,y, width,height) w.r.t. region proposals 3. Modifying the input 2 take 2 input. images and corresponding prposals **ROI Pooling layer**  The most notable contribution from the paper is designed to maxpool the features inside a proposed region into a fixed size (for VGG'16 version of FCNN it was 7 x 7) . The intuition behind the layer is make it faster as compared to SPPNets, (which used spatial pyramidal pooling) and RCNN. ##### Results The net is trained with dual loss (log loss on probability output + squared error loss on bounding box parameters) . The results were very impressive, on the VOC '07, '10 & '12 datasets with Fast RCNN outperforming the rest of the nets, in terms of mAp accuracy 
Generates abstractive summaries from news articles. Also see [blog](https://metamind.io/research/yourtldrbyanaiadeepreinforcedmodelforabstractivesummarization) * Input: * vocab size 150K * start with $W_\text{emb}$ Glove 100 * Seq2Seq: * bidirectional LSTM, `size=200` in each direction. Final hidden states are concatenated and feed as initial hidden state of the decoder an LSTM of `size=400`. surprising it's only one layer. * Attention: * Add standard attention mechanism between each new hidden state of the decoder and all the hidden states of the encoder * A new kind of attention mechanism is done between the new hidden state of the decoder and all previous hidden states of the decoder * the new hidden state is concatenated with the two attention outputs and feed to dense+softmax to model next word in summary (output vocab size 50K). The weight matrix $W_h$ is reduced to $W_h = \tanh \left( W_\text{emb} W_\text{proj} \right) $ resulting in faster converges, see [1](arXiv:1611.01462) and [2](https://arxiv.org/abs/1608.05859) * Pointer mechanism: * The concatenated values are also feed to logistic classifier to decide if the softmax output should be used or one of the words in the article should be copied to the output. The article word to be copied is selected using same weights computed in the attention mechanism * Loss * $L_\text{ml}$: NLL of the example summary $y^*$. If only $L_\text{ml}$ is used then 25% of the times use generated instead of given sample as input to next step. * $L_\text{rl}$: sample an entire summary from the model $y^s$ (temperature=1) and the loss is the NLL of the sample multiplied by a reward. The reward is $r(y^s)r(\hat{y})$ where $r$ is ROUGEL and $\hat{y}$ is a generated greedy sequences * $L=\gamma L_\text{rl} + (1\gamma)L_\text{ml}$ where $\gamma=0.9984$ * Training * `batch=50`, Adam, `LR=1e4` for RL/ML+RL training * The training labels are summary examples and an indication if copy was used in the pointer mechanism and which word was copied. This is indicated when the summary word is OOV or if it appears in the article and its NER is one of PERSON, LOCATION, ORGANIZATION or MISC * Generation * 5 beams * force trigrams not to appear twice in the same beam 
Facebook has [released a series of papers](https://research.facebook.com/blog/learningtosegment/) for object segmentation and detection. This paper is the first in that series. This is how modern object detection works (think [RCNN](https://arxiv.org/abs/1311.2524), [Fast RCNN](http://arxiv.org/abs/1504.08083)): 1. A rich set of object proposals (i.e., a set of image regions which are likely to contain an object) is generated using a fast (but possibly imprecise) algorithm. 2. A CNN classifier is applied on each of the proposals. The current paper improves the step 1, i.e., region/object proposals. Most object proposals approaches fall into three categories: * Objectness scoring * Seed Segmentation * Superpixel Merging Current method is different from these three. It share similarities with [Faster RCNN](https://arxiv.org/abs/1506.01497) in that proposals are generated using a CNN. The method predicts a segmentation mask given an input *patch* and assigns a score corresponding to how likely the patch is to contain an object. ## Model and Training Both mask and score predictions are achieved with a single convolutional network but with multiple outputs. All the convolutional layers except the last few are from VGGA pretrained model. Each training sample is a triplet of RGB input patch, the binary mask corresponding to the input patch, a label which specifies whether the patch contains an object. A patch is given label 1 only if it satisfies the following constraints: * the patch contains an object roughly centered in the input patch * the object is fully contained in the patch and in a given scale range Note that the network must output a mask for a single object at the center even when multiple objects are present. Figure 1 shows the architecture and sampling for training. ![figure1](https://i.imgur.com/zSyP0ij.png) Model is then jointly trained for segmentation and objectness. Negative samples are not used for segmentation. ## Inference During full image inference, model is applied densely at multiple locations and scales. This can be done efficiently since all computations are convolutional like in a fully convolutional network (FCN). ![figure2](https://i.imgur.com/dQWfy8R.png) This approach surpasses the previous state of the art by a large margin in both box and segmentation proposal generation. 
#### Idea Reverse Classification Accuracy (RCA) models are aims to answer the question on how to estimate performance of models (semantic segmentation models were explained in the paper) in cases where ground truth is not available. #### Why is it important Before deployment, performance is quantified using different metrics, for which the predicted segmentation is compared to a reference segmentation, often obtained manually by an expert. But little is known about the real performance after deployment when a reference is unavailable. RCA aims to quantify the performance in those deployment scenarios #### Methodology The RCA model pipeline follows a simple enough pipeline for the same: 1. Train a model M on training dataset T containing input images and ground truth {**I**,**G**} 2. Use M to predict segmentation map for an input image II to get segmentation map SS 3. Train a RCA model that uses input image II to predict SS. As it's a single datapoint for the model it would overfit. There's no validation set for the RCA model 4. Test the performance of RCA model on Images which have ground truth G and the best performance of the model is an indicator of the performance (DSC  Dice Similarity Coefficient) of how the original image would perform on a new image whose ground truth is not available to compute segmentation accuracy (DSC) #### Observation For validation of the RCA method, the predicted DSC and the real DSC were compared and the correlation between the 2 was calculated. For all calculations 3 types of methods of segmentation were used and 3 slightly different types methods for RCA were used for comparison. The predicted DSC and real DSC were highly correlated for most of the cases. Here's a snap of the results that they obtained ![](http://i.imgur.com/2ra0wQm.png) 
**Summary** Representation (or feature) learning with unsupervised learning has yet really to yield the type of results that many believe to be achievable. For example, we’d like to unleash an unsupervised learning algorithm on all web images and then obtain a representation that captures the various factors of variation we know to be present (e.g. objects and people). One popular approach for this is to train a model that assumes a highlevel vector representation with independent components. However, despite a large body of literature on such models by now, such socalled disentangling of these factors of variation still seems beyond our reach. In this short paper, the authors propose an alternative to this approach. They propose that disentangling might be achievable by learning a representation whose dimensions are each separately **controllable**, i.e. that each have an associated policy which changes the value of that dimension **while letting other dimensions fixed**. Specifically, the authors propose to minimize the following objective: $\mathop{\mathbb{E}}_s\left[\frac{1}{2}sg(f(s))^2_2 \right]  \lambda \sum_k \mathbb{E}_{a,s}\left[\sum_a \pi_k(as) \log sel(s,a,k)\right]$ where  $s$ is an agent’s state (e.g. frame image) which encoder $f$ and decoder $g$ learn to autoencode  $k$ iterates over all dimensions of the representation space (output of encoder)  $a$ iterates over actions that the agent can take  $\pi_k(as)$ is the policy that is meant to control the $k^{\rm th}$ dimension of the representation space $f(s)_k$  $sel(s,a,k)$ is the selectivity of $f(s)_k$ relative to other dimensions in the representation, at state $s$: $sel(s,a,k) = \mathop{\mathbb{E}}_{s’\sim {\cal P}_{ss’}^a}\left[\frac{f_k(s’)f_k(s)}{\sum_{k’} f_{k’}(s’)f_{k’}(s) }\right]$ ${\cal P}_{ss’}^a$ is the conditional distribution over the next step state $s’$ given that you are at state $s$ and are taking action $a$ (i.e. the environment transition distribution). One can see that selectivity is higher when the change $f_k(s’)f_k(s)$ in dimension $k$ is much larger than the change $f_{k’}(s’)f_{k’}(s)$ in the other dimensions $k’$. A directed version of selectivity is also proposed (and I believe was used in the experiments), where the absolute value function is removed and $\log sel$ is replaced with $\log(1+sel)$ in the objective. The learning objective will thus encourage the discovery of a representation that is informative of the input (in that you can reconstruct it) and for which there exists policies that separately control these dimensions. Algorithm 1 in the paper describes a learning procedure for optimizing this objective. In brief, for every update, a state $s$ is sampled from which an update for the autoencoder part of the loss can be made. Then, iterating over each dimension $k$, REINFORCE is used to get a gradient estimate of the selectivity part of the loss, to update both the policy $\pi_k$ and the encoder $f$ by using the policy to reach a next state $s’$. **My two cents** I find this concept very appealing and thought provoking. Intuitively, I find the idea that valuable features are features which reflect an aspect of our environment that we can control more sensible and possibly less constraining than an assumption of independent features. It also has an interesting analogy of an infant learning about the world by interacting with it. The caveat is that unfortunately, this concept is currently fairly impractical, since it requires an interactive environment where an agent can perform actions, something we can’t easily have short of deploying a robot with sensors. Moreover, the proposed algorithm seems to assume that each state $s$ is sampled independently for each update, whereas a robot would observe a dependent stream of states. Accordingly, the experiments in this short paper are mostly “proof of concept”, on simplistic synthetic environments. Yet they do a good job at illustrating the idea. To me this means that there’s more interesting work worth doing in what seems to be a promising direction!
6 Comments

3D Image Classification is one of the utmost important requirements in healthcare. But due to huge size of such images and less amount of data, it has become difficult to train traditional endtoend models of classification such as AlexNet, Resnet, VGG. Major approaches considered for 3D images use slices of the image and not the image as whole. This approach at times leads to loss of information across third axis. The paper uses the complete 3D image for training, and since the number of images are less we use unsupervised techniques for weights initialization and fine tune the fully connected layers using supervised learning for classification. The major contribution of the paper is the extension of 2D Convolutional Autoencoders to 3D Domain and taking it further for MRI Classification which can be useful in numerous other applications. **Algorithm** 1. Train 3D Convolution Autoencoders using [CADDementia dataset](https://grandchallenge.org/site/caddementia/home/) 2. Extract the encoder layer from the trained autoencoders and add fully connected layers to it, followed by softmax 3. Train and test the classification model using [ADNI data](http://adni.loni.usc.edu/) **3D CAN Model** This model is 3D extension of [2D Convolutional Autoencoders](http://people.idsia.ch/~ciresan/data/icann2011.pdf) (2D CAN) ![3D Convolutional Autoencoders](http://i.imgur.com/66y52uQ.png) Each convolution in encoder is a 3D Convolution followed by ReLU and maxpool, while in decoder, each convolution is a maxunpool followed by 3D Full Convolution and ReLU. In the last decoder block, instead of ReLU, Sigmoid will be used. Here the weights were shared in the decoder and encoder as specified in [2D CAN](http://people.idsia.ch/~ciresan/data/icann2011.pdf) **3D Classification model** ![Classification of MRIs using 3D Convolutional Autoencoders](http://i.imgur.com/3hOeR8o.png) The weights of the classification model are initialized using the trained 3DCAN model as mentioned in algorithm. **Loss Function** Weighted negative log likelihood for Classification and Mean Squared Error for Unsupervised learning **Training algorithm** [Adadelta](https://arxiv.org/abs/1212.5701) **Results** Obtained 89.1% on 10fold cross validation on dataset of 270 patients for classification of MRI into Mild Cognitive Impairment (MCI), Normal Control (NC) and Alzheimer's disease (AD) **Image Source** All images are taken from the [paper](https://arxiv.org/abs/1607.00455) itself. 
Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. Advantages: * Learning the identity becomes learning 0 which is simpler * Loss in information flow in the forward pass is not a problem anymore * No vanishing / exploding gradient * Identities don't have parameters to be learned ## Evaluation The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{4}$), momentum of 0.9. They use minibatches of size 128. * ImageNet ILSVRC 2015: 3.57% (ensemble) * CIFAR10: 6.43% * MS COCO: 59.0% mAp@0.5 (ensemble) * PASCAL VOC 2007: 85.6% mAp@0.5 * PASCAL VOC 2012: 83.8% mAp@0.5 ## See also * [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993) 
(See also a more thorough summary in [a LaTeX PDF][1].) This paper has some nice clear theory which bridges maximum likelihood (supervised) learning and standard reinforcement learning. It focuses on *structured prediction* tasks, where we want to learn to predict $p_\theta(y \mid x)$ where $y$ is some object with complex internal structure. We can agree on some deficiencies of maximum likelihood learning:  ML training fails to assign **partial credit**. Models are trained to maximize the likelihood of the groundtruth outputs in the dataset, and all other outputs are equally wrong. This is an increasingly important problem as the space of possible solutions grows.  ML training is potentially disconnected from **downstream task reward**. In machine translation, we usually want to optimize relatively complex metrics like BLEU or TER. Since these metrics are nondifferentiable, we have to settle for optimizing proxy losses that we hope are related to the metric of interest. Reinforcement learning offers an attractive alternative in theory. RL algorithms are designed to optimize nondifferentiable (even stochastic) reward functions, which sounds like just what we want. But RL algorithms have their own problems with this sort of structured output space:  Standard RL algorithms rely on samples from the model we are learning, $p_\theta(y \mid x)$. This becomes intractable when our output space is very complex (e.g. 80token sequences where each word is drawn from a vocabulary of 80,000 words).  The reward spaces for problems of interest are extremely sparse. Our metrics will assign 0 reward to most of the 80^80K possible outputs in the translation problem in the paper.  Vanilla RL doesn't take into account the groundtruth outputs available to us in structured prediction. This paper designs a solution which combines supervised learning with a reinforcement learninginspired smoothing method. Concretely, the authors design an **exponentiated payoff distribution** $q(y \mid y^*; \tau)$ which assigns high mass to highreward outputs $y$ and low mass elsewhere. This distribution is used to effectively smooth the loss function established by the groundtruth outputs in the supervised data. We end up optimizing the following objective: $$\mathcal L_\text{RML} =  \mathbb E_{x, y^* \sim \mathcal D}\left[ \sum_y q(y \mid y^*; \tau) \log p_\theta(y \mid x) \right]$$ This optimization depends on samples from our dataset $\mathcal D$ and, more importantly, the stationary payoff distribution $q$. This contrasts strongly with standard RL training, where the objective depends on samples from the nonstationary model distribution $p_\theta$. To make that clear, we can rewrite the above with another expectation: $$\mathcal L_\text{RML} =  \mathbb E_{x, y^* \sim \mathcal D, y \sim q(y \mid y^*; \tau)}\left[ \log p_\theta(y \mid x) \right]$$ ### Model details If you're interested in the lowlevel details, I wrote up the gist of the math in [this PDF][1]. ### Analysis #### Relationship to label smoothing This training approach is mathematically equivalent to label smoothing, applied here to structured output problems. In nextword prediction language modeling, a popular trick involves smoothing the target distributions by combining the groundtruth output with some simple base model, e.g. a unigram word frequency distribution. (This just means we take a weighted sum of the onehot vector from our supervised data and a normalized frequency vector calculated on some corpus.) Mathematically, the cross entropy with label smoothing is $$\mathcal L_\text{MLsmooth} =  \mathbb E_{x, y^* \sim \mathcal D} \left[ \sum_y p_\text{smooth}(y; y^*) \log p_\theta(y \mid x) \right]$$ (The equation above leaves out a constant entropy term.) The gradient of this objective looks exactly the same as the rewardaugmented ML gradient from the paper: $$\nabla_\theta \mathcal L_\text{MLsmooth} = \mathbb E_{x, y^* \sim \mathcal D, y \sim p_\text{smooth}} \left[ \log p_\theta(y \mid x) \right]$$ So rewardaugmented likelihood is equivalent to label smoothing, where our smoothing distribution is logproportional to our downstream reward function. #### Relationship to distillation Optimizing the rewardaugmented maximum likelihood is equivalent to minimizing the KL divergence $$D_\text{KL}(q(y \mid y^*; \tau) \mid\mid p_\theta(y \mid x))$$ This divergence reaches zero iff $q = p$. We can say, then, that the effect of optimizing on $\mathcal L_\text{RML}$ is to **distill** the reward function (which parameterizes $q$) into the model parameters $\theta$ (which parameterize $p_\theta$). It's exciting to think about other sorts of more complex models that we might be able to distill in this framework. The unfortunate (?) restriction is that the "source" model of the distillation ($q$ in this paper) must admit to efficient sampling. #### Relationship to adversarial training We can also view rewardaugmented maximum likelihood training as a data augmentation technique: it synthesizes new "partially correct" examples using the reward function as a guide. We then train on all of the original and synthesized data, again weighting the gradients based on the reward function. Adversarial training is a similar data augmentation technique which generates examples that force the model to be robust to changes in its input space (robust to changes of $x$). Both adversarial training and the RML objective encourage the model to be robust "near" the groundtruth supervised data. A highlevel comparison:  Adversarial training can be seen as data augmentation in the input space; RML training performs data augmentation in the output space.  Adversarial training is a **modelbased data augmentation**: the samples are generated from a process that depends on the current parameters during training. RML training performs **databased augmentation**, which could in theory be done independent of the actual training process.  Thanks to Andrej Karpathy, Alec Radford, and Tim Salimans for interesting discussion which contributed to this summary. [1]: https://drive.google.com/file/d/0B3Rdm_P3VbRDVUQ4SVBRYW82dU0/view 
This paper presents a novel neural network approach (though see [here](https://www.facebook.com/hugo.larochelle.35/posts/172841743130126?pnref=story) for a discussion on prior work) to density estimation, with a focus on image modeling. At its core, it exploits the following property on the densities of random variables. Let $x$ and $z$ be two random variables of equal dimensionality such that $x = g(z)$, where $g$ is some bijective and deterministic function (we'll note its inverse as $f = g^{1}$). Then the change of variable formula gives us this relationship between the densities of $x$ and $z$: $p_X(x) = p_Z(z) \left{\rm det}\left(\frac{\partial g(z)}{\partial z}\right)\right^{1}$ Moreover, since the determinant of the Jacobian matrix of the inverse $f$ of a function $g$ is simply the inverse of the Jacobian of the function $g$, we can also write: $p_X(x) = p_Z(f(x)) \left{\rm det}\left(\frac{\partial f(x)}{\partial x}\right)\right$ where we've replaced $z$ by its deterministically inferred value $f(x)$ from $x$. So, the core of the proposed model is in proposing a design for bijective functions $g$ (actually, they design its inverse $f$, from which $g$ can be derived by inversion), that have the properties of being easily invertible and having an easytocompute determinant of Jacobian. Specifically, the authors propose to construct $f$ from various modules that all preserve these properties and allows to construct highly nonlinear $f$ functions. Then, assuming a simple choice for the density $p_Z$ (they use a multidimensional Gaussian), it becomes possible to both compute $p_X(x)$ tractably and to sample from that density, by first samples $z\sim p_Z$ and then computing $x=g(z)$. The building blocks for constructing $f$ are the following: **Coupling layers**: This is perhaps the most important piece. It simply computes as its output $b\odot x + (1b) \odot (x \odot \exp(l(b\odot x)) + m(b\odot x))$, where $b$ is a binary mask (with half of its values set to 0 and the others to 1) over the input of the layer $x$, while $l$ and $m$ are arbitrarily complex neural networks with input and output layers of equal dimensionality. In brief, for dimensions for which $b_i = 1$ it simply copies the input value into the output. As for the other dimensions (for which $b_i = 0$) it linearly transforms them as $x_i * \exp(l(b\odot x)_i) + m(b\odot x)_i$. Crucially, the bias ($m(b\odot x)_i$) and coefficient ($\exp(l(b\odot x)_i)$) of the linear transformation are nonlinear transformations (i.e. the output of neural networks) that only have access to the masked input (i.e. the nontransformed dimensions). While this layer might seem odd, it has the important property that it is invertible and the determinant of its Jacobian is simply $\exp(\sum_i (1b_i) l(b\odot x)_i)$. See Section 3.3 for more details on that. **Alternating masks**: One important property of coupling layers is that they can be stacked (i.e. composed), and the resulting composition is still a bijection and is invertible (since each layer is individually a bijection) and has a tractable determinant for its Jacobian (since the Jacobian of the composition of functions is simply the multiplication of each function's Jacobian matrix, and the determinant of the product of square matrices is the product of the determinant of each matrix). This is also true, even if the mask $b$ of each layer is different. Thus, the authors propose using masks that alternate across layer, by masking a different subset of (half of) the dimensions. For images, they propose using masks with a checkerboard pattern (see Figure 3). Intuitively, alternating masks are better because then after at least 2 layers, all dimensions have been transformed at least once. **Squeezing operations**: Squeezing operations corresponds to a reorganization of a 2D spatial layout of dimensions into 4 sets of features maps with spatial resolutions reduced by half (see Figure 3). This allows to expose multiple scales of resolutions to the model. Moreover, after a squeezing operation, instead of using a checkerboard pattern for masking, the authors propose to use a per channel masking pattern, so that "the resulting partitioning is not redundant with the previous checkerboard masking". See Figure 3 for an illustration. Overall, the models used in the experiments usually stack a few of the following "chunks" of layers: 1) a few coupling layers with alternating checkboard masks, 2) followed by squeezing, 3) followed by a few coupling layers with alternating channelwise masks. Since the output of each layerschunk must technically be of the same size as the input image, this could become expensive in terms of computations and space when using a lot of layers. Thus, the authors propose to explicitly pass on (copy) to the very last layer ($z$) half of the dimensions after each layerschunk, adding another chunk of layers only on the other half. This is illustrated in Figure 4b. Experiments on CIFAR10, and 32x32 and 64x64 versions of ImageNet show that the proposed model (coined the realvalued nonvolume preserving or Real NVP) has competitive performance (in bits per dimension), though slightly worse than the Pixel RNN. **My Two Cents** The proposed approach is quite unique and thought provoking. Most interestingly, it is the only powerful generative model I know that combines A) a tractable likelihood, B) an efficient / onepass sampling procedure and C) the explicit learning of a latent representation. While achieving this required a model definition that is somewhat unintuitive, it is nonetheless mathematically really beautiful! I wonder to what extent Real NVP is penalized in its results by the fact that it models pixels as realvalued observations. First, it implies that its estimate of bits/dimensions is an upper bound on what it could be if the uniform subpixel noise was integrated out (see Equations 345 of [this paper](http://arxiv.org/pdf/1511.01844v3.pdf)). Moreover, the authors had to apply a nonlinear transformation (${\rm logit}(\alpha + (1\alpha)\odot x)$) to the pixels, to spread the $[0,255]$ interval further over the reals. Since the Pixel RNN models pixels as discrete observations directly, the Real NVP might be at a disadvantage. I'm also curious to know how easy it would be to do conditional inference with the Real NVP. One could imagine doing approximate MAP conditional inference, by clamping the observed dimensions and doing gradient descent on the loglikelihood with respect to the value of remaining dimensions. This could be interesting for image completion, or for structured output prediction with realvalued outputs in general. I also wonder how expensive that would be. In all cases, I'm looking forward to saying interesting applications and variations of this model in the future! 
This very new paper, is currently receiving quite a bit of attention by the [community](https://www.reddit.com/r/MachineLearning/comments/5qxoaz/r_170107875_wasserstein_gan/). The paper describes a new training approach, which solves the two major practical problems with current GAN training: 1) The training process comes with a meaningful loss. This can be used as a (soft) performance metric and will help debugging, tune parameters and so on. 2) The training process does not suffer from all the instability problems. In particular the paper reduces mode collapse significantly. On top of that, the paper comes with quite a bit mathematical theory, explaining why there approach works and other approachs have failed. This paper is a must read for anyone interested in GANs. 
This paper describes how rank pooling, a very recent approach for pooling representations organized in a sequence $\\{{\bf v}_t\\}_{t=1}^T$, can be used in an endtoend trained neural network architecture. Rank pooling is an alternative to average and max pooling for sequences, but with the distinctive advantage of maintaining some order information from the sequence. Rank pooling first solves a regularized (linear) support vector regression (SVR) problem where the inputs are the vector representations ${\bf v}_t$ in the sequence and the target is the corresponding index $t$ of that representation in the sequence (see Equation 5). The output of rank pooling is then simply the linear regression parameters $\bf{u}$ learned for that sequence. Because of the way ${\bf u}$ is trained, we can see that ${\bf u}$ will capture order information, as successful training would imply that ${\bf u}^\top {\bf v}_t < {\bf u}^\top {\bf v}_{t'} $ if $t < t'$. See [this paper](https://www.robots.ox.ac.uk/~vgg/rg/papers/videoDarwin.pdf) for more on rank pooling. While previous work has focused on using rank pooling on handdesigned and fixed representations, this paper proposes to use ConvNet features (pretrained on ImageNet) for the representation and backpropagate through rank pooling to finetune the ConvNet features. Since the output of rank pooling corresponds to an argmin operation, passing gradients through this operation is not as straightforward as for average or max pooling. However, it turns out that if the objective being minimized (in our case regularized SVR) is twice differentiable, gradients with respect to its argmin can be computed (see Lemmas 1 and 2). The authors derive the gradient for rank pooling (Equation 21). Finally, since its gradient requires inverting a matrix (corresponding to a hessian), the authors propose to either use an efficient procedure for computing it by exploiting properties of sums of rankone matrices (see Lemma 3) or to simply use an approximation based on using a diagonal hessian. In experiments on two small scale video activity recognition datasets (UCFSports and Hollywood2), the authors show that finetuning the ConvNet features significantly improves the performance of rank pooling and makes it superior to max and average pooling. **My two cents** This paper was eye opening for me, first because I did not realize that one could backpropagate through an operation corresponding to an argmin that doesn't have a closed form solution (though apparently this paper isn't the first to make that observation). Moreover, I did not know about rank pooling, which itself is a really thought provoking approach to pooling representations in a way that preserves some organizational information about the original representations. I wonder how sensitive the results are to the value of the regularization constant of the SVR problem. The authors mention some theoretical guaranties on the stability of the solution found by SVR in general, but intuitively I would expect that the regularization constant would play a large role in the stability. I'll be looking forward to any future attempts to increase the speed of rank pooling (or any similar method). Indeed, as the authors mention, it is currently too slow to be used on the larger video datasets that are currently available. Code for computing rank pooling (though not for computing its gradients) seems to be available [here](https://bitbucket.org/bfernando/videodarwin).
2 Comments

This paper presents a method to train a neural network to make predictions for *counterfactual* questions. In short, such questions are questions about what the result of an intervention would have been, had a different choice for the intervention been made (e.g. *Would this patient have lower blood sugar had she received a different medication?*). One approach to tackle this problem is to collect data of the form $(x_i, t_i, y_i^F)$ where $x_i$ describes a situation (e.g. a patient), $t_i$ describes the intervention made (in this paper $t_i$ is binary, e.g. $t_i = 1$ if a new treatment is used while $t_i = 0$ would correspond to using the current treatment) and $y_i^F$ is the factual outcome of the intervention $t_i$ for $x_i$. From this training data, a predictor $h(x,t)$ taking the pair $(x_i, t_i)$ as input and outputting a prediction for $y_i^F$ could be trained. From this predictor, one could imagine answering counterfactual questions by feeding $(x_i, 1t_i)$ (i.e. a description of the same situation $x_i$ but with the opposite intervention $1t_i$) to our predictor and comparing the prediction $h(x_i, 1t_i)$ with $y_i^F$. This would give us an estimate of the change in the outcome, had a different intervention been made, thus providing an answer to our counterfactual question. The authors point out that this scenario is related to that of domain adaptation (more specifically to the special case of covariate shift) in which the input training distribution (here represented by inputs $(x_i,t_i)$) is different from the distribution of inputs that will be fed at test time to our predictor (corresponding to the inputs $(x_i, 1t_i)$). If the choice of intervention $t_i$ is evenly spread and chosen independently from $x_i$, the distributions become the same. However, in observational studies, the choice of $t_i$ for some given $x_i$ is often not independent of $x_i$ and made according to some unknown policy. This is the situation of interest in this paper. Thus, the authors propose an approach inspired by the domain adaptation literature. Specifically, they propose to have the predictor $h(x,t)$ learn a representation of $x$ that is indiscriminate of the intervention $t$ (see Figure 2 for the proposed neural network architecture). Indeed, this is a notion that is [well established][1] in the domain adaptation literature and has been exploited previously using regularization terms based on [adversarial learning][2] and [maximum mean discrepancy][3]. In this paper, the authors used instead a regularization (noted in the paper as $disc(\Phi_{t=0},\Phi_ {t=1})$) based on the socalled discrepancy distance of [Mansour et al.][4], adapting its use to the case of a neural network. As an example, imagine that in our dataset, a new treatment ($t=1$) was much more frequently used than not ($t=0$) for men. Thus, for men, relatively insufficient evidence for counterfactual inference is expected to be found in our training dataset. Intuitively, we would thus want our predictor to not rely as much on that "feature" of patients when inferring the impact of the treatment. In addition to this term, the authors also propose incorporating an additional regularizer where the prediction $h(x_i,1t_i)$ on counterfactual inputs is pushed to be as close as possible to the target $y_{j}^F$ of the observation $x_j$ that is closest to $x_i$ **and** actually had the counterfactual intervention $t_j = 1t_i$. The paper first shows a bound relating the counterfactual generalization error to the discrepancy distance. Moreover, experiments simulating counterfactual inference tasks are presented, in which performance is measured by comparing the predicted treatment effects (as estimated by the difference between the observed effect $y_i^F$ for the observed treatment and the predicted effect $h(x_i, 1t_i)$ for the opposite treatment) with the real effect (known here because the data is simulated). The paper shows that the proposed approach using neural networks outperforms several baselines on this task. **My two cents** The connection with domain adaptation presented here is really clever and enlightening. This sounds like a very compelling approach to counterfactual inference, which can exploit a lot of previous work on domain adaptation. The paper mentions that selecting the hyperparameters (such as the regularization terms weights) in this scenario is not a trivial task. Indeed, measuring performance here requires knowing the true difference in intervention outcomes, which in practice usually cannot be known (e.g. two treatments usually cannot be given to the same patient once). In the paper, they somewhat "cheat" by using the ground truth difference in outcomes to measure outofsample performance, which the authors admit is unrealistic. Thus, an interesting avenue for future work would be to design practical hyperparameter selection procedures for this scenario. I wonder whether the *reverse crossvalidation* approach we used in our work on our adversarial approach to domain adaptation (see [Section 5.1.2][5]) could successfully be used here. Finally, I command the authors for presenting such a nicely written description of counterfactual inference problem setup in general, I really enjoyed it! [1]: https://papers.nips.cc/paper/2983analysisofrepresentationsfordomainadaptation.pdf [2]: http://arxiv.org/abs/1505.07818 [3]: http://ijcai.org/Proceedings/09/Papers/200.pdf [4]: http://www.cs.nyu.edu/~mohri/pub/nadap.pdf [5]: http://arxiv.org/pdf/1505.07818v4.pdf#page=16 