Welcome to ShortScience.org! 
[link]
This paper presents a recurrent neural network architecture in which some of the recurrent weights dynamically change during the forward pass, using a hebbianlike rule. They correspond to the matrices $A(t)$ in the figure below: ![Fast weights RNN figure](http://i.imgur.com/DCznSf4.png) These weights $A(t)$ are referred to as *fast weights*. Comparatively, the recurrent weights $W$ are referred to as slow weights, since they are only changing due to normal training and are otherwise kept constant at test time. More specifically, the proposed fast weights RNN compute a series of hidden states $h(t)$ over time steps $t$, but, unlike regular RNNs, the transition from $h(t)$ to $h(t+1)$ consists of multiple ($S$) recurrent layers $h_1(t+1), \dots, h_{S1}(t+1), h_S(t+1)$, defined as follows: $$h_{s+1}(t+1) = f(W h(t) + C x(t) + A(t) h_s(t+1))$$ where $f$ is an elementwise nonlinearity such as the ReLU activation. The next hidden state $h(t+1)$ is simply defined as the last "inner loop" hidden state $h_S(t+1)$, before moving to the next time step. As for the fast weights $A(t)$, they too change between time steps, using the hebbianlike rule: $$A(t+1) = \lambda A(t) + \eta h(t) h(t)^T$$ where $\lambda$ acts as a decay rate (to partially forget some of what's in the past) and $\eta$ as the fast weight's "learning rate" (not to be confused with the learning rate used during backprop). Thus, the role played by the fast weights is to rapidly adjust to the recent hidden states and remember the recent past. In fact, the authors show an explicit relation between these fast weights and memoryaugmented architectures that have recently been popular. Indeed, by recursively applying and expending the equation for the fast weights, one obtains $$A(t) = \eta \sum_{\tau = 1}^{\tau = t1}\lambda^{t\tau1} h(\tau) h(\tau)^T$$ *(note the difference with Equation 3 of the paper... I think there was a typo)* which implies that when computing the $A(t) h_s(t+1)$ term in the expression to go from $h_s(t+1)$ to $h_{s+1}(t+1)$, this term actually corresponds to $$A(t) h_s(t+1) = \eta \sum_{\tau =1}^{\tau = t1} \lambda^{t\tau1} h(\tau) (h(\tau)^T h_s(t+1))$$ i.e. $A(t) h_s(t+1)$ is a weighted sum of all previous hidden states $h(\tau)$, with each hidden states weighted by an "attention weight" $h(\tau)^T h_s(t+1)$. The difference with many recent memoryaugmented architectures is thus that the attention weights aren't computed using a softmax nonlinearity. Experimentally, they find it beneficial to use [layer normalization](https://arxiv.org/abs/1607.06450). Good values for $\eta$ and $\lambda$ seem to be 0.5 and 0.9 respectively. I'm not 100% sure, but I also understand that using $S=1$, i.e. using the fast weights only once per time steps, was usually found to be optimal. Also see Figure 3 for the architecture used on the image classification datasets, which is slightly more involved. The authors present a series 4 experiments, comparing with regular RNNs (IRNNs, which are RNNs with ReLU units and whose recurrent weights are initialized to a scaled identity matrix) and LSTMs (as well as an associative LSTM for a synthetic associative retrieval task and ConvNets for the two image datasets). Generally, experiments illustrate that the fast weights RNN tends to train faster (in number of updates) and better than the other recurrent architectures. Surprisingly, the fast weights RNN can even be competitive with a ConvNet on the two image classification benchmarks, where the RNN traverses glimpses from the image using a fixed policy. **My two cents** This is a very thought provoking paper which, based on the comparison with LSTMs, suggests that fast weights RNNs might be a very good alternative. I'd be quite curious to see what would happen if one was to replace LSTMs with them in the myriad of papers using LSTMs (e.g. all the Seq2Seq work). Intuitively, LSTMs seem to be able to do more than just attending to the recent past. But, for a given task, if one was to observe that fast weights RNNs are competitive to LSTMs, it would suggests that the LSTM isn't doing something that much more complex. So it would be interesting to determine what are the tasks where the extra capacity of an LSTM is actually valuable and exploitable. Hopefully the authors will release some code, to facilitate this exploration. The discussion at the end of Section 3 on how exploiting the "memory augmented" view of fast weights is useful to allow the use of minibatches is interesting. However, it also suggests that computations in the fast weights RNN scales quadratically with the sequence size (since in this view, the RNN technically must attend to all previous hidden states, since the beginning of the sequence). This is something to keep in mind, if one was to consider applying this to very long sequences (i.e. much longer than the hidden state dimensionality). Also, I don't quite get the argument that the "memory augmented" view of fast weights is more amenable to minibatch training. I understand that having an explicit weight matrix $A(t)$ for each minibatch sequence complicates things. However, in the memory augmented view, we also have a "memory matrix" that is different for each sequence, and yet we can handle that fine. The problem I can imagine is that storing a *sequence of arbitrary weight matrices* for each sequence might be storage demanding (and thus perhaps make it impossible to store a forward/backward pass for more than one sequence at a time), while the implicit memory matrix only requires appending a new row at each time step. Perhaps the argument to be made here is more that there's already minibatch compatible code out there for dealing with the use of a memory matrix of stored previous memory states. This work strikes some (partial) resemblance to other recent work, which may serve as food for thought here. The use of possibly multiple computation layers between time steps reminds me of [Adaptive Computation Time (ACT) RNN]( http://www.shortscience.org/paper?bibtexKey=journals/corr/Graves16). Also, expressing a backpropable architecture that involves updates to weights (here, hebbianlike updates) reminds me of recent work that does backprop through the updates of a gradient descent procedure (for instance as in [this work]( http://www.shortscience.org/paper?bibtexKey=conf/icml/MaclaurinDA15)). Finally, while I was familiar with the notion of fast weights from the work on [Using Fast Weights to Improve Persistent Contrastive Divergence](http://people.ee.duke.edu/~lcarin/FastGibbsMixing.pdf), I didn't realize that this concept dated as far back as the late 80s. So, for young researchers out there looking for inspiration for research ideas, this paper confirms that looking at the older neural network literature for inspiration is probably a very good strategy :) To sum up, this is really nice work, and I'm looking forward to the NIPS 2016 oral presentation of it! 
[link]
This paper describes how to apply the idea of batch normalization (BN) successfully to recurrent neural networks, specifically to LSTM networks. The technique involves the 3 following ideas: **1) Careful initialization of the BN scaling parameter.** While standard practice is to initialize it to 1 (to have unit variance), they show that this situation creates problems with the gradient flow through time, which vanishes quickly. A value around 0.1 (used in the experiments) preserves gradient flow much better. **2) Separate BN for the "hiddens to hiddens preactivation and for the "inputs to hiddens" preactivation.** In other words, 2 separate BN operators are applied on each contributions to the preactivation, before summing and passing through the tanh and sigmoid nonlinearities. **3) Use of largest timestep BN statistics for longer testtime sequences.** Indeed, one issue with applying BN to RNNs is that if the input sequences have varying length, and if one uses pertimestep mean/variance statistics in the BN transformation (which is the natural thing to do), it hasn't been clear how do deal with the last time steps of longer sequences seen at test time, for which BN has no statistics from the training set. The paper shows evidence that the preactivation statistics tend to gradually converge to stationary values over time steps, which supports the idea of simply using the training set's last time step statistics. Among these ideas, I believe the most impactful idea is 1). The papers mentions towards the end that improper initialization of the BN scaling parameter probably explains previous failed attempts to apply BN to recurrent networks. Experiments on 4 datasets confirms the method's success. **My two cents** This is an excellent development for LSTMs. BN has had an important impact on our success in training deep neural networks, and this approach might very well have a similar impact on the success of LSTMs in practice. 
[link]
How can we learn causal relationships that explain data? We can learn from nonstationary distributions. If we experiment with different factorizations of relationships between variables we can observe which ones provide better sample complexity when adapting to distributional shift and therefore are likely to be causal. If we consider the variables A and B we can factor them in two ways: $P(A,B) = P(A)P(BA)$ representing a causal graph like $A\rightarrow B$ $P(A,B) = P(AB)P(B)$ representing a causal graph like $A \leftarrow B$ The idea is if we train a model with one of these structures; when adapting to a new shifted distribution of data it will take longer to adapt if the model does not have the correct inductive bias. For example let's say that the true relationship is $A$=Raining causes $B$=Open Umbrella (and not viceversa). Changing the marginal probability of Raining (say because the weather changed) does not change the mechanism that relates $A$ and $B$ (captured by $P(BA)$), but will have an impact on the marginal $P(B)$. So after this distributional shift the function that modeled $P(BA)$ will not need to change because the relationship is the same. Only the function that modeled $P(A)$ will need to change. Under the incorrect factorization $P(B)P(AB)$, adaptation to the change will be slow because both $P(B)$ and $P(AB)$ need to be modified to account for the change in $P(A)$ (due to Bayes rule). Here a difference in sample complexity can be observed when modeling the joint of the shifted distribution. $B\rightarrow A$ takes longer to adapt: https://i.imgur.com/B9FEmA7.png Here the idea is that sample complexity when adapting to a new distribution of data is a heuristic to inform us which causal graph inductive bias is correct. Experimentally this works and they also observe that when models have more capacity it seems that the difference between the models grows. This summary was written with the help of Yoshua Bengio. 
[link]
This paper proposes a variant of Neural Turing Machine (NTM) for metalearning or "learning to learn", in the specific context of fewshot learning (i.e. learning from few examples). Specifically, the proposed model is trained to ingest as input a training set of examples and improve its output predictions as examples are processed, in a purely feedforward way. This is a form of metalearning because the model is trained so that its forward pass effectively executes a form of "learning" from the examples it is fed as input. During training, the model is fed multiples sequences (referred to as episodes) of labeled examples $({\bf x}_1, {\rm null}), ({\bf x}_2, y_1), \dots, ({\bf x}_T, y_{T1})$, where $T$ is the size of the episode. For instance, if the model is trained to learn how to do 5class classification from 10 examples per class, $T$ would be $5 \times 10 = 50$. Mainly, the paper presents experiments on the Omniglot dataset, which has 1623 classes. In these experiments, classes are separated into 1200 "training classes" and 423 "test classes", and each episode is generated by randomly selecting 5 classes (each assigned some arbitrary vector representation, e.g. a onehot vector that is consistent within the episode, but not across episodes) and constructing a randomly ordered sequence of 50 examples from within the chosen 5 classes. Moreover, the correct label $y_t$ of a given input ${\bf x}_t$ is always provided only at the next time step, but the model is trained to be good at its prediction of the label of ${\bf x}_t$ at the current time step. This is akin to the scenario of online learning on a stream of examples, where the label of an example is revealed only once the model has made a prediction. The proposed NTM is different from the original NTM of Alex Graves, mostly in how it writes into its memory. The authors propose to focus writing to either the least recently used memory location or the most recently used memory location. Moreover, the least recently used memory location is reset to zero before every write (an operation that seems to be ignored when backpropagating gradients). Intuitively, the proposed NTM should learn a strategy by which, given a new input, it looks into its memory for information from other examples earlier in the episode (perhaps similarly to what a nearest neighbor classifier would do) to predict the class of the new input. The paper presents experiments in learning to do multiclass classification on the Omniglot dataset and regression based on functions synthetically generated by a GP. The highlights are that: 1. The proposed model performs much better than an LSTM and better than an NTM with the original write mechanism of Alex Graves (for classification). 2. The proposed model even performs better than a 1st nearest neighbor classifier. 3. The proposed model is even shown to outperform human performance, for the 5class scenario. 4. The proposed model has decent performance on the regression task, compared to GP predictions using the groundtruth kernel. **My two cents** This is probably one of my favorite ICML 2016 papers. I really think metalearning is a problem that deserves more attention, and this paper presents both an interesting proposal for how to do it and an interesting empirical investigation of it. Much like previous work [\[1\]][1] [\[2\]][2], learning is based on automatically generating a metalearning training set. This is clever I think, since a very large number of such "metalearning" examples (the episodes) can be constructed, thus transforming what is normally a "small data problem" (few shot learning) into a "big data problem", for which deep learning is more effective. I'm particularly impressed by how the proposed model outperforms a 1nearest neighbor classifier. That said, the proposed NTM actually performs 4 reads at each time step, which suggests that a fairer comparison might be with a 4nearest neighbor classifier. I do wonder how this baseline would compare. I'm also impressed with the observation that the proposed model surpassed humans. The paper also proposes to use 5letter words to describe classes, instead of onehot vectors. The motivation is that this should make it easier for the model to scale to much more than 5 classes. However, I don't entirely follow the logic as to why onehot vectors are problematic. In fact, I would think that arbitrarily assigning 5letter words to classes would instead imply some similarity between classes that share letters that is arbitrary and doesn't reflect true class similarity. Also, while I find it encouraging that the performance for regression of the proposed model is decent, I'm curious about how it would compare with a GP approach that incrementally learns the kernel's hyperparameter (instead of using the groundtruth values, which makes this baseline unrealistically strong). Finally, I'm still not 100% sure how exactly the NTM is able to implement the type of feedforward inference I'd expect to be required. I would expect it to learn a memory representation of examples that combines information from the input vector ${\bf x}_t$ *and* its label $y_t$. However, since the label of an input is presented at the following time step in an episode, it is not intuitive to me then how the read/write mechanisms are able to deal with this misalignment. My only guess is that since the controller is an LSTM, then it can somehow remember ${\bf x}_t$ until it gets $y_t$ and appropriately include the combined information into the memory. This could be supported by the fact that using a nonrecurrent feedforward controller is much worse than using an LSTM controller. But I'm not 100% sure of this either. All the above being said, this is still a really great paper, which I hope will help stimulate more research on metalearning. Hopefully code for this paper can eventually be released, which would help in popularizing the topic. [1]: http://snowedin.net/tmp/Hochreiter2001.pdf [2]: http://www.thespermwhale.com/jaseweston/ram/papers/paper_16.pdf 
[link]
This paper presents Swapout, a simple dropout method applied to Residual Networks (ResNets). In a ResNet, a layer $Y$ is computed from the previous layer $X$ as $Y = X + F(X)$ where $F(X)$ is essentially the composition of a few convolutional layers. Swapout simply applies dropout separately on both terms of a layer's equation: $Y = \Theta_1 \odot X + \Theta_2 \odot F(X)$ where $\Theta_1$ and $\Theta_2$ are independent dropout masks for each term. The paper shows that this form of dropout is at least as good or superior as other forms of dropout, including the recently proposed [stochastic depth dropout][1]. Much like in the stochastic depth paper, better performance is achieved by linearly increasing the dropout rate (from 0 to 0.5) from the first hidden layer to the last. In addition to this observation, I also note the following empirical observations: 1. At test time, averaging the output layers of multiple dropout mask samples (referenced to as stochastic inference) is better than replacing the masks by their expectation (deterministic inference), the latter being the usual standard. 2. Comparable performance is achieved by making the ResNet wider (e.g. 4 times) and with fewer layers (e.g. 32) than the orignal ResNet work with thin but very deep (more than 1000 layers) ResNets. This would confirm a similar observation from [this paper][2]. Overall, these are useful observations to be aware of for anyone wanting to use ResNets in practice. [1]: http://arxiv.org/abs/1603.09382v1 [2]: https://arxiv.org/abs/1605.07146 
[link]
This paper derives an algorithm for passing gradients through a sample from a mixture of Gaussians. While the reparameterization trick allows to get the gradients with respect to the Gaussian means and covariances, the same trick cannot be invoked for the mixing proportions parameters (essentially because they are the parameters of a multinomial discrete distribution over the Gaussian components, and the reparameterization trick doesn't extend to discrete distributions). One can think of the derivation as proceeding in 3 steps: 1. Deriving an estimator for gradients a sample from a 1dimensional density $f(x)$ that is such that $f(x)$ is differentiable and its cumulative distribution function (CDF) $F(x)$ is tractable: $\frac{\partial \hat{x}}{\partial \theta} =  \frac{1}{f(\hat{x})}\int_{t=\infty}^{\hat{x}} \frac{\partial f(t)}{\partial \theta} dt$ where $\hat{x}$ is a sample from density $f(x)$ and $\theta$ is any parameter of $f(x)$ (the above is a simplified version of Equation 6). This is probably the most important result of the paper, and is based on a really clever use of the general form of the Leibniz integral rule. 2. Noticing that one can sample from a $D$dimensional Gaussian mixture by decomposing it with the product rule $f({\bf x}) = \prod_{d=1}^D f(x_d{\bf x}_{<d})$ and using ancestral sampling, where each $f(x_d{\bf x}_{<d})$ are themselves 1dimensional mixtures (i.e. with differentiable densities and tractable CDFs) 3. Using the 1dimensional gradient estimator (of Equation 6) and the chain rule to backpropagate through the ancestral sampling procedure. This requires computing the integral in the expression for $\frac{\partial \hat{x}}{\partial \theta}$ above, where $f(x)$ is one of the 1D conditional Gaussian mixtures and $\theta$ is a mixing proportion parameter $\pi_j$. As it turns out, this integral has an analytical form (see Equation 22). **My two cents** This is a really surprising and neat result. The author mentions it could be applicable to variational autoencoders (to support posteriors that are mixtures of Gaussians), and I'm really looking forward to read about whether that can be successfully done in practice. The paper provides the derivation only for mixtures of Gaussians with diagonal covariance matrices. It is mentioned that extending to nondiagonal covariances is doable. That said, ancestral sampling with nondiagonal covariances would become more computationally expensive, since the conditionals under each Gaussian involves a matrix inverse. Beyond the case of Gaussian mixtures, Equation 6 is super interesting in itself as its application could go beyond that case. This is probably why the paper also derived a samplingbased estimator for Equation 6, in Equation 9. However, that estimator might be inefficient, since it involves sampling from Equation 10 with rejection, and it might take a lot of time to get an accepted sample if $\hat{x}$ is very small. Also, a good estimate of Equation 6 might require *multiple* samples from Equation 10. Finally, while I couldn't find any obvious problem with the mathematical derivation, I'd be curious to see whether using the same approach to derive a gradient on one of the Gaussian mean or standard deviation parameters gave a gradient that is consistent with what the reparameterization trick provides.
3 Comments

[link]
This paper presents an unsupervised generative model, based on the variational autoencoder framework, but where the encoder is a recurrent neural network that sequentially infers the identity, pose and number of objects in some input scene (2D image or 3D scene). In short, this is done by extending the DRAW model to incorporate discrete latent variables that determine whether an additional object is present or not. Since the reparametrization trick cannot be used for discrete variables, the authors estimate the gradient through the sampling operation using a likelihood ratio estimator. Another innovation over DRAW is the application to 3D scenes, in which the decoder is a graphics renderer. Since it is not possible to backpropagate through the renderer, gradients are estimated using finitedifference estimates (which require going through the renderer several times). Experiments are presented where the evaluation is focused on the ability of the model to detect and count the number of objects in the image or scene. **My two cents** This is a nice, natural extension of DRAW. I'm particularly impressed by the results for the 3D scene setting. Despite the fact that setup is obviously synthetic and simplistic, I really surprised that estimating the decoder gradients using finitedifferences worked at all. It's also interesting to see that the proposed model does surprisingly well compared to a CNN supervised approach that directly predicts the objects identity and pose. Quite cool! To see the model in action, see [this cute video][1]. [1]: https://www.youtube.com/watch?v=4tc84kKdpY4 
[link]
**Summary** Representation (or feature) learning with unsupervised learning has yet really to yield the type of results that many believe to be achievable. For example, we’d like to unleash an unsupervised learning algorithm on all web images and then obtain a representation that captures the various factors of variation we know to be present (e.g. objects and people). One popular approach for this is to train a model that assumes a highlevel vector representation with independent components. However, despite a large body of literature on such models by now, such socalled disentangling of these factors of variation still seems beyond our reach. In this short paper, the authors propose an alternative to this approach. They propose that disentangling might be achievable by learning a representation whose dimensions are each separately **controllable**, i.e. that each have an associated policy which changes the value of that dimension **while letting other dimensions fixed**. Specifically, the authors propose to minimize the following objective: $\mathop{\mathbb{E}}_s\left[\frac{1}{2}sg(f(s))^2_2 \right]  \lambda \sum_k \mathbb{E}_{a,s}\left[\sum_a \pi_k(as) \log sel(s,a,k)\right]$ where  $s$ is an agent’s state (e.g. frame image) which encoder $f$ and decoder $g$ learn to autoencode  $k$ iterates over all dimensions of the representation space (output of encoder)  $a$ iterates over actions that the agent can take  $\pi_k(as)$ is the policy that is meant to control the $k^{\rm th}$ dimension of the representation space $f(s)_k$  $sel(s,a,k)$ is the selectivity of $f(s)_k$ relative to other dimensions in the representation, at state $s$: $sel(s,a,k) = \mathop{\mathbb{E}}_{s’\sim {\cal P}_{ss’}^a}\left[\frac{f_k(s’)f_k(s)}{\sum_{k’} f_{k’}(s’)f_{k’}(s) }\right]$ ${\cal P}_{ss’}^a$ is the conditional distribution over the next step state $s’$ given that you are at state $s$ and are taking action $a$ (i.e. the environment transition distribution). One can see that selectivity is higher when the change $f_k(s’)f_k(s)$ in dimension $k$ is much larger than the change $f_{k’}(s’)f_{k’}(s)$ in the other dimensions $k’$. A directed version of selectivity is also proposed (and I believe was used in the experiments), where the absolute value function is removed and $\log sel$ is replaced with $\log(1+sel)$ in the objective. The learning objective will thus encourage the discovery of a representation that is informative of the input (in that you can reconstruct it) and for which there exists policies that separately control these dimensions. Algorithm 1 in the paper describes a learning procedure for optimizing this objective. In brief, for every update, a state $s$ is sampled from which an update for the autoencoder part of the loss can be made. Then, iterating over each dimension $k$, REINFORCE is used to get a gradient estimate of the selectivity part of the loss, to update both the policy $\pi_k$ and the encoder $f$ by using the policy to reach a next state $s’$. **My two cents** I find this concept very appealing and thought provoking. Intuitively, I find the idea that valuable features are features which reflect an aspect of our environment that we can control more sensible and possibly less constraining than an assumption of independent features. It also has an interesting analogy of an infant learning about the world by interacting with it. The caveat is that unfortunately, this concept is currently fairly impractical, since it requires an interactive environment where an agent can perform actions, something we can’t easily have short of deploying a robot with sensors. Moreover, the proposed algorithm seems to assume that each state $s$ is sampled independently for each update, whereas a robot would observe a dependent stream of states. Accordingly, the experiments in this short paper are mostly “proof of concept”, on simplistic synthetic environments. Yet they do a good job at illustrating the idea. To me this means that there’s more interesting work worth doing in what seems to be a promising direction!
6 Comments

[link]
This paper can be thought as proposing a variational autoencoder applied to a form of metalearning, i.e. where the input is not a single input but a dataset of inputs. For this, in addition to having to learn an approximate inference network over the latent variable $z_i$ for each input $x_i$ in an input dataset $D$, approximate inference is also learned over a latent variable $c$ that is global to the dataset $D$. By using Gaussian distributions for $z_i$ and $c$, the reparametrization trick can be used to train the variational autoencoder. The generative model factorizes as $p(D=(x_1,\dots,x_N), (z_1,\dots,z_N), c) = p(c) \prod_i p(z_ic) p(x_iz_i,c)$ and learning is based on the following variational posterior decomposition: $q((z_1,\dots,z_N), cD=(x_1,\dots,x_N)) = q(cD) \prod_i q(z_ix_i,c)$. Moreover, latent variable $z_i$ is decomposed into multiple ($L$) layers $z_i = (z_{i,1}, \dots, z_{i,L})$. Each layer in the generative model is directly connected to the input. The layers are generated from $z_{i,L}$ to $z_{i,1}$, each layer being conditioned on the previous (see Figure 1 *Right* for the graphical model), with the approximate posterior following a similar decomposition. The architecture for the approximate inference network $q(cD)$ first maps all inputs $x_i\in D$ into a vector representation, then performs mean pooling of these representations to obtain a single vector, followed by a few more layers to produce the parameters of the Gaussian distribution over $c$. Training is performed by stochastic gradient descent, over minibatches of datasets (i.e. multiple sets $D$). The model has multiple applications, explored in the experiments. One is of summarizing a dataset $D$ into a smaller subset $S\in D$. This is done by initializing $S\leftarrow D$ and greedily removing elements of $S$, each time minimizing the KL divergence between $q(cD)$ and $q(cS)$ (see the experiments on a synthetic Spatial MNIST problem of section 5.3). Another application is fewshot classification, where very few examples of a number of classes are given, and a new test example $x'$ must be assigned to one of these classes. Classification is performed by treating the small set of examples of each class $k$ as its own dataset $D_k$. Then, test example $x$ is classified into class $k$ for which the KL divergence between $q(cx')$ and $q(cD_k)$ is smallest. Positive results are reported when training on OMNIGLOT classes and testing on either the MNIST classes or unseen OMNIGLOT datasets, when compared to a 1nearest neighbor classifier based on the raw input or on a representation learned by a regular autoencoder. Finally, another application is that of generating new samples from an input dataset of examples. The approximate posterior is used to compute $q(cD)$. Then, $c$ is assigned to its posterior mean, from which a value for the hidden layers $z$ and finally a sample $x$ can be generated. It is shown that this procedure produces convincing samples that are visually similar from those in the input set $D$. **My two cents** Another really nice example of deep learning applied to a form of metalearning, i.e. learning a model that is trained to take *new* datasets as input and generalize even if confronted to datasets coming from an unseen data distribution. I'm particularly impressed by the many tasks explored successfully with the same approach: fewshot classification and generative sampling, as well as a form of summarization (though this last probably isn't really metalearning). Overall, the approach is quite elegant and appealing. The very simple, synthetic experiments of section 5.1 and 5.2 are also interesting. Section 5.2 presents the notion of a *priorinterpolation layer*, which is well motivated but seems to be used only in that section. I wonder how important it is, outside of the specific case of section 5.2. Overall, very excited by this work, which further explores the theme of metalearning in an interesting way. 
[link]
Originally posted on my Github [papernotes](https://github.com/karpathy/papernotes/blob/master/matching_networks.md) repo. # Matching Networks for One Shot Learning By DeepMind crew: **Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra** This is a paper on **oneshot** learning, where we'd like to learn a class based on very few (or indeed, 1) training examples. E.g. it suffices to show a child a single giraffe, not a few hundred thousands before it can recognize more giraffes. This paper falls into a category of *"duh of course"* kind of paper, something very interesting, powerful, but somehow obvious only in retrospect. I like it. Suppose you're given a single example of some class and would like to label it in test images.  **Observation 1**: a standard approach might be to train an Exemplar SVM for this one (or few) examples vs. all the other training examples  i.e. a linear classifier. But this requires optimization.  **Observation 2:** known nonparameteric alternatives (e.g. kNearest Neighbor) don't suffer from this problem. E.g. I could immediately use a Nearest Neighbor to classify the new class without having to do any optimization whatsoever. However, NN is gross because it depends on an (arbitrarilychosen) metric, e.g. L2 distance. Ew.  **Core idea**: lets train a fully endtoend nearest neighbor classifer!![Screen Shot 20160807 at 10.08.44 PM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/matching_networks/Screen%20Shot%2020160807%20at%2010.08.44%20PM.png) ## The training protocol As the authors amusingly point out in the conclusion (and this is the *duh of course* part), *"oneshot learning is much easier if you train the network to do oneshot learning"*. Therefore, we want the testtime protocol (given N novel classes with only k examples each (e.g. k = 1 or 5), predict new instances to one of N classes) to exactly match the training time protocol. To create each "episode" of training from a dataset of examples then: 1. Sample a task T from the training data, e.g. select 5 labels, and up to 5 examples per label (i.e. 525 examples). 2. To form one episode sample a label set L (e.g. {cats, dogs}) and then use L to sample the support set S and a batch B of examples to evaluate loss on. The idea on high level is clear but the writing here is a bit unclear on details, of exactly how the sampling is done. ## The model I find the paper's model description slightly wordy and unclear, but basically we're building a **differentiable nearest neighbor++**. The output \hat{y} for a test example \hat{x} is computed very similar to what you might see in Nearest Neighbors:![Screen Shot 20160807 at 11.14.26 PM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/matching_networks/Screen%20Shot%2020160807%20at%2011.14.26%20PM.png) where **a** acts as a kernel, computing the extent to which \hat{x} is similar to a training example x_i, and then the labels from the training examples (y_i) are weightblended together accordingly. The paper doesn't mention this but I assume for classification y_i would presumbly be onehot vectors. Now, we're going to embed both the training examples x_i and the test example \hat{x}, and we'll interpret their inner products (or here a cosine similarity) as the "match", and pass that through a softmax to get normalized mixing weights so they add up to 1. No surprises here, this is quite natural: ![Screen Shot 20160807 at 11.20.29 PM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/matching_networks/Screen%20Shot%2020160807%20at%2011.20.29%20PM.png) Here **c()** is cosine distance, which I presume is implemented by normalizing the two input vectors to have unit L2 norm and taking a dot product. I assume the authors tried skipping the normalization too and it did worse? Anyway, now all that's left to define is the function **f** (i.e. how do we embed the test example into a vector) and the function **g** (i.e. how do we embed each training example into a vector?). **Embedding the training examples.** This (the function **g**) is a bidirectional LSTM over the examples: ![Screen Shot 20160807 at 11.57.10 PM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/matching_networks/Screen%20Shot%2020160807%20at%2011.57.10%20PM.png) i.e. encoding of i'th example x_i is a function of its "raw" embedding g'(x_i) and the embedding of its friends, communicated through the bidirectional network's hidden states. i.e. each training example is a function of not just itself but all of its friends in the set. This is part of the ++ above, because in a normal nearest neighbor you wouldn't change the representation of an example as a function of the other data points in the training set. It's odd that the **order** is not mentioned, I assume it's random? This is a bit gross because order matters to a bidirectional LSTM; you'd get different embeddings if you permute the examples. **Embedding the test example.** This (the function **f**) is a an LSTM that processes for a fixed amount (K time steps) and at each point also *attends* over the examples in the training set. The encoding is the last hidden state of the LSTM. Again, this way we're allowing the network to change its encoding of the test example as a function of the training examples. Nifty: ![Screen Shot 20160808 at 12.11.15 AM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/matching_networks/Screen%20Shot%2020160808%20at%2012.11.15%20AM.png) That looks scary at first but it's really just a vanilla LSTM with attention where the input at each time step is constant (f'(\hat{x}), an encoding of the test example all by itself) and the hidden state is a function of previous hidden state but also a concatenated readout vector **r**, which we obtain by attending over the encoded training examples (encoded with **g** from above). Oh and I assume there is a typo in equation (5), it should say r_k = … without the 1 on LHS. ## Experiments **Task**: Nway kshot learning task. i.e. we're given k (e.g. 1 or 5) labelled examples for N classes that we have not previously trained on and asked to classify new instances into he N classes. **Baselines:** an "obvious" strategy of using a pretrained ConvNet and doing nearest neighbor based on the codes. An option of finetuning the network on the new examples as well (requires training and careful and strong regularization!). **MANN** of Santoro et al. [21]: Also a DeepMind paper, a fun NTMlike MetaLearning approach that is fed a sequence of examples and asked to predict their labels. **Siamese network** of Koch et al. [11]: A siamese network that takes two examples and predicts whether they are from the same class or not with logistic regression. A test example is labeled with a nearest neighbor: with the class it matches best according to the siamese net (requires iteration over all training examples one by one). Also, this approach is less endtoend than the one here because it requires the adhoc nearest neighbor matching, while here the *exact* end task is optimized for. It's beautiful. ### Omniglot experiments ### ![Screen Shot 20160808 at 10.21.45 AM](https://github.com/karpathy/papernotes/raw/master/img/matching_networks/Screen%20Shot%2020160808%20at%2010.21.45%20AM.png) Omniglot of [Lake et al. [14]](http://www.cs.toronto.edu/~rsalakhu/papers/LakeEtAl2015Science.pdf) is a MNISTlike scribbles dataset with 1623 characters with 20 examples each. Image encoder is a CNN with 4 modules of [3x3 CONV 64 filters, batchnorm, ReLU, 2x2 max pool]. The original image is claimed to be so resized from original 28x28 to 1x1x64, which doesn't make sense because factor of 2 downsampling 4 times is reduction of 16, and 28/16 is a noninteger >1. I'm assuming they use VALID convs? Results: ![Screen Shot 20160808 at 10.27.46 AM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/matching_networks/Screen%20Shot%2020160808%20at%2010.27.46%20AM.png) Matching nets do best. Fully Conditional Embeddings (FCE) by which I mean they the "Full Context Embeddings" of Section 2.1.2 instead are not used here, mentioned to not work much better. Finetuning helps a bit on baselines but not with Matching nets (weird). The comparisons in this table are somewhat confusing:  I can't find the MANN numbers of 82.8% and 94.9% in their paper [21]; not clear where they come from. E.g. for 5 classes and 5shot they seem to report 88.4% not 94.9% as seen here. I must be missing something.  I also can't find the numbers reported here in the Siamese Net [11] paper. As far as I can tell in their Table 2 they report oneshot accuracy, 20way classification to be 92.0, while here it is listed as 88.1%?  The results of Lake et al. [14] who proposed Omniglot are also missing from the table. If I'm understanding this correctly they report 95.2% on 1shot 20way, while matching nets here show 93.8%, and humans are estimated at 95.5%. That is, the results here appear weaker than those of Lake et al., but one should keep in mind that the method here is significantly more generic and does not make any assumptions about the existence of strokes, etc., and it's a simple, single fullydifferentiable blob of neural stuff. (skipping ImageNet/LM experiments as there are few surprises) ## Conclusions Good paper, effectively develops a differentiable nearest neighbor trained endtoend. It's something new, I like it! A few concerns:  A bidirectional LSTMs (not orderinvariant compute) is applied over sets of training examples to encode them. The authors don't talk about the order actually used, which presumably is random, or mention this potentially unsatisfying feature. This can be solved by using a recurrent attentional mechanism instead, as the authors are certainly aware of and as has been discussed at length in [ORDER MATTERS: SEQUENCE TO SEQUENCE FOR SETS](https://arxiv.org/abs/1511.06391), where Oriol is also the first author. I wish there was a comment on this point in the paper somewhere.  The approach also gets quite a bit slower as the number of training examples grow, but once this number is large one would presumable switch over to a parameteric approach.  It's also potentially concerning that during training the method uses a specific number of examples, e.g. 525, so this is the number of that must also be used at test time. What happens if we want the size of our training set to grow online? It appears that we need to retrain the network because the encoder LSTM for the training data is not "used to" seeing inputs of more examples? That is unless you fall back to iteratively subsampling the training data, doing multiple inference passes and averaging, or something like that. If we don't use FCE it can still be that the attention mechanism LSTM can still not be "used to" attending over many more examples, but it's not clear how much this matters. An interesting experiment would be to not use FCE and try to use 100 or 1000 training examples, while only training on up to 25 (with and fithout FCE). Discussion surrounding this point would be interesting.  Not clear what happened with the Omniglot experiments, with incorrect numbers for [11], [21], and the exclusion of Lake et al. [14] comparison.  A baseline that is missing would in my opinion also include training of an [Exemplar SVM](https://www.cs.cmu.edu/~tmalisie/projects/iccv11/), which is a much more powerful approach than encodewithacnnandnearestneighbor.
4 Comments
