[link]
Recently, DeepMind released a new paper showing strong performance on board game tasks using a mechanism similar to the Value Prediction Network one in this paper, which inspired me to go back and get a grounding in this earlier work. A goal of this paper is to design a modelbased RL approach that can scale to complex environment spaces, but can still be used to run simulations and do explicit planning. Traditional, modelbased RL has worked by learning a dynamics model of the environment  predicting the next observation state given the current one and an action, and then using that model of the world to learn values and plan with. In addition to the advantages of explicit planning, a hope is that modelbased systems generalize better to new environments, because they predict onestep changes in local dynamics in a way that can be more easily separated from longterm dynamics or reward patterns. However, a downside of MBRL is that it can be hard to train, especially when your observation space is highdimensional, and learning a straight model of your environment will lead to you learning details that aren't actually unimportant for planning or creating policies. The synthesis proposed by this paper is the Value Prediction Network. Rather than predicting observed state at the next step, it learns a transition model in latent space, and then learns to predict nextstep reward and future value from that latent space vector. Because it learns to encode latentspace state from observations, and also learns a transition model from one latent state to another, the model can be used for planning, by simulating multiple transitions between latent state. However, unlike a normal dynamics model, whose training signal comes from a loss against observational prediction, the signal for training both latent → reward/value/discount predictions, and latent → latent transitions comes from using this pipeline to predict reward values. This means that if an aspect of the environment isn't useful for predicting reward, it won't generally be encoded into latent state, meaning you don't waste model capacity predicting irrelevant detail. https://i.imgur.com/4bJylms.png Once this model exists, it can be used for generating a policy through a treesearch planning approach: simulating future trajectories and aggregating the predicted reward along those trajectories, and then taking the highestvalue one. The authors find that their model is able to do better than both modelfree and modelbased methods on the tasks they tested on. In particular, they find that it has many of the benefits of a model that predicts full observations, but that the Value Prediction Network learns more quickly, and is more robust to stochastic environments where there's an inherent ceiling on how well a nextstep observation prediction can work. My main question coming into this paper is: how is this different from simply a value estimator like those used in DQN or A2C, and my impression is that the difference comes from this model's ability to do explicit state simulation in latent space, and then predict a value off of the *latent* state, whereas a value network predicts value from observational state. 
[link]
In modern machine learning, gradient descent has diversified into a zoo of subtly distinct techniques, all designed, analytically, heuristically, or practically, to ease or accelerate our model’s path through multidimensional loss space. A solid contingent of these methods are Adaptive Gradient methods, which scale the size of gradient updates according to variously calculated historical averages or variances of the vector update, which has the effect of scaling down the updates along feature dimensions that have experienced large updates in the past. The intuition behind this is that we may want to effectively reduce the learning rate (by dividing by a larger number) along dimensions where there have been large or highly variable updates. These methods are commonly used because, as the name suggests, they update to the scale of your dataset and particular loss landscape, avoiding what might otherwise be a lengthy process of hyperparameter tuning. But this paper argues that, at least on a simplified problem, adaptive methods can reach overly simplified and overfit solutions that generalize to test data less well than a nonadaptive, more standard gradient descent method. The theoretical core of the paper is a proof showing limitations of the solution reached by adaptive gradient on a simple toy regression problem, on linearly separable data. It’s a little dodgy to try to recapitulate a mathematical proof in verbal form, but I’ll do my best, on the understanding that you should really read the fully thing to fully verify the logic. The goal of the proof is to characterize the solution weight vector learned by different optimization systems. In this simplified environment, a core informational unit of your equations is X^T(y), which (in a world where labels are either 1 or 1), goes through each feature, and for each feature, takes a dot product between that feature vector (across examples) and the label vector, which has the effect of adding up a positive sum of all the feature values attached to positive examples, and then subtracting out (because of the multiply by 1) all the feature values attached to positive examples. When this is summed, we get a per feature value that will be positive if positive values of the feature tend to indicate positive labels, and negative if the opposite is true, in each case with a magnitude relating to the strength of that relationship. The claim made by the paper, supported by a lot of variable transformations, is that the solution learned by Adaptive Gradient methods reduces to a sign() operation on top of that vector, where magnitude information is lost. This happens because the running gradients that you divide out happen to correspond to the absolute value of this vector, and dividing a vector (which would be the core of the solution in the nonadaptive case) by its absolute value gives you a simple sign. The paper then goes on to show that this edge case can lead to cases of pathological overfitting in cases of high feature dimensionality relative to data points. (I wish I could give more deep insight on why this is the case, but I wasn’t really able to translate the math into intuition, outside of this fact of scaling by gradient magnitudes having the effect of losing potentially useful gradient information. The big question from all this is...does this matter? Does it matter, in particular, beyond a toy dataset, and an artificially simple problem? The answer seems to be a pretty strong maybe. The authors test adaptive methods against hyperparameteroptimized SGD and momentum SGD (a variant, but without the adaptive aspects), and find that, while adaptive methods often learn more quickly at first, SGD approaches pick up later in training, first in terms of test set error at a time when adaptive methods’ training set error still seems to be decreasing, and later even in training set error. So there seems to be evidence that solutions learned by adaptive methods generalize worse than ones learned by SGD, at least on some image recognition and languageRNN models. (Though, interestingly, RMSProp comes close to the SGD test set levels, doing the best out of the adaptive methods). Overall, this suggests to me that doing fully hyperparameter optimized SGD might be a stronger design choice, but that adaptive methods retain popularity because of their (very appealing, practically) lack of need for hyperparameter tuning to at least to a *reasonable* job, even if their performance might have more of a ceiling than that of vanilla SGD. 
[link]
The method they use basically tells the robot to reason as follows: 1. The human gave me a reward function $\tilde{r}$, selected in order to get me to behave the way they wanted. 2. So I should favor reward functions which produce that kind of behavior. This amounts to doing RL (step 1) followed by IRL on the learned policy (step 2); see the final paragraph of section 4. 
[link]
Ratner et al. Train an adversarial generative network to learn domainspecific sequences of transformations useful for data augmentation. In particular, as indicated in Figure 1, the generator learns to predict sequences of userspecified transformations and the classifier is intended to distinguish the original images from the transformed ones. For training, the authors use reinforcement learning, because the transformations are not necessarily differentiable – which makes usage of the proposed method very convenient. https://i.imgur.com/hHQkhIk.png Figure 1: Highlevel illustration of the proposed method for learning data augmentation. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Residual Networks (ResNets) have greatly advanced the stateoftheart in Deep Learning by making it possible to train much deeper networks via the addition of skip connections. However, in order to compute gradients during the backpropagation pass, all the units' activations have to be stored during the feedforward pass, leading to high memory requirements for these very deep networks. Instead, the authors propose a **reversible architecture** based on ResNets, in which activations at one layer can be computed from the ones of the next. Leveraging this invertibility property, they design a more efficient implementation of backpropagation, effectively trading compute power for memory storage. * **Pros (+): ** The change does not negatively impact model accuracy (for equivalent number of model parameters) and it only requires a small change in the backpropagation algorithm. * **Cons (): ** Increased number of parameters, thus need to change the unit depth to match the "equivalent" ResNet  # Proposed Architecture ## RevNet This paper proposes to incorporate idea from previous reversible architectures, such as NICE [1], into a standard ResNet. The resulting model is called **RevNet** and is composed of reversible blocks, inspired from *additive coupling* [1, 2]: $ \begin{array}{rr} \texttt{RevNet Block} & \texttt{Inverse Transformation}\\ \hline \mathbf{input }\ x & \mathbf{input }\ y \\ x_1, x_2 = \mbox{split}(x) & y1, y2 = \mbox{split}(y)\\ y_1 = x_1 + \mathcal{F}(x_2) & x_2 = y_2  \mathcal{G}(y_1) \\ y_2 = x_2 + \mathcal{G}(y_1) & x_1 = y_1  \mathcal{F}(x_2)\\ \mathbf{output}\ y = (y_1, y_2) & \mathbf{output}\ x = (x_1, x_2) \end{array} $ where $\mathcal F$ and $\mathcal G$ are residual functions, composed of sequences of convolutions, ReLU and Batch Normalization layers, analoguous to the ones in a standard ResNet block, although operations in the reversible blocks need to have a stride of 1 to avoid information loss and preserve invertibility. Finally, for the `split` operation, the authors consider spliting the input Tensor across the channel dimension as in [1, 2]. Similarly to ResNet, the final RevNet architecture is composed of these invertible residual blocks, as well as nonreversible subsampling operations (e.g., pooling) for which activations have to be stored. However the number of such operations is much smaller than the number of residual blocks in a typical ResNet architecture. ## Backpropagation ### Standard The backpropagaton algorithm is derived from the chain rule and is used to compute the total gradients of the loss with respect to the parameters in a neural network: given a loss function $L$, we want to compute the gradients of $L$ with respect to the parameters of each layer, indexed by $n \in [1, N]$, i.e., the quantities $ \overline{\theta_{n}} = \partial L /\ \partial \theta_n$. (where $\forall x, \bar{x} = \partial L / \partial x$). We roughly summarize the algorithm in the left column of **Table 1**: In order to compute the gradients for the $n$th block, backpropagation requires the input and output activation of this block, $y_{n  1}$ and $y_{n}$, which have been stored, and the derivative of the loss respectively to the output, $\overline{y_{n}}$, which has been computed in the backpropagation iteration of the upper layer; Hence the name backpropagation ### RevNet Since activations are not stored in RevNet, the algorithm needs to be slightly modified, which we describe in the right column of **Table 1**. In summary, we first need to recover the input activations of the RevNet block using its invertibility. These activations will be propagated to the earlier layers for further backpropagation. Secondly, we need to compute the gradients of the loss with respect to the inputs, i.e. $\overline{y_{n  1}} = (\overline{y_{n 1, 1}}, \overline{y_{n  1, 2}})$, using the fact that: $ \begin{align} \overline{y_{n  1, i}} = \overline{y_{n, 1}}\ \frac{\partial y_{n, 1}}{y_{n  1, i}} + \overline{y_{n, 2}}\ \frac{\partial y_{n, 2}}{y_{n  1, i}} \end{align} $ Once again, this result will be propagated further down the network. Finally, once we have computed both these quantities we can obtain the gradients with respect to the parameters of this block, $\theta_n$. $ \begin{array}{cll} \hline & \mathbf{ResNet} & \mathbf{RevNet} \\ \hline \mathbf{Block} & y_{n} = y_{n  1} + \mathcal F(y_{n  1}) & y_{n  1, 1}, y_{n  1, 2} = \mbox{split}(y_{n  1})\\ && y_{n, 1} = y_{n  1, 1} + \mathcal{F}(y_{n  1, 2})\\ && y_{n, 2} = y_{n  1, 2} + \mathcal{G}(y_{n, 1})\\ && y_{n} = (y_{n, 1}, y_{n, 2})\\ \hline \mathbf{Params} & \theta = \theta_{\mathcal F} & \theta = (\theta_{\mathcal F}, \theta_{\mathcal G})\\ \hline \mathbf{Backprop} & \mathbf{in:}\ y_{n  1}, y_{n}, \overline{ y_{n}} & \mathbf{in:}\ y_{n}, \overline{y_{n }}\\ & \overline{\theta_n} =\overline{y_n} \frac{\partial y_n}{\partial \theta_n} &\texttt{# recover activations} \\ &\overline{y_{n  1}} = \overline{y_{n}}\ \frac{\partial y_{n}}{\partial y_{n1}} &y_{n, 1}, y_{n, 2} = \mbox{split}(y_{n}) \\ &\mathbf{out:}\ \overline{\theta_n}, \overline{y_{n 1}} & y_{n  1, 2} = y_{n, 2}  \mathcal{G}(y_{n, 1})\\ &&y_{n  1, 1} = y_{n, 1}  \mathcal{F}(y_{n  1, 2})\\ &&\texttt{# gradients wrt. inputs} \\ &&\overline{y_{n 1, 1}} = \overline{y_{n, 1}} + \overline{y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \\ &&\overline{y_{n 1, 2}} = \overline{y_{n, 1}} \frac{\partial \mathcal F}{\partial y_{n,2}} + \overline{y_{n,2}} \left(1 + \frac{\partial \mathcal F}{\partial y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \right) \\ &&\texttt{ gradients wrt. parameters} \\ &&\overline{\theta_{n, \mathcal G}} = \overline{y_{n, 2}} \frac{\partial \mathcal G}{\partial \theta_{n, \mathcal G}}\\ &&\overline{\theta_{n, \mathcal F}} = \overline{y_{n,1}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} + \overline{y_{n, 2}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} \frac{\partial \mathcal G}{\partial y_{n,1}}\\ &&\mathbf{out:}\ \overline{\theta_{n}}, \overline{y_{n 1}}, y_{n  1}\\ \hline \end{array} $ **Table 1:** Backpropagation in the standard case and for Reversible blocks  ## Experiments ** Computational Efficiency.** RevNets trade off memory requirements, by avoiding storing activations, against computations. Compared to other methods that focus on improving memory requirements in deep networks, RevNet provides the best tradeoff: no activations have to be stored, the spatial complexity is $O(1)$. For the computation complexity, it is linear in the number of layers, i.e. $O(L)$. One small disadvantage is that RevNets introduces additional parameters, as each block is composed of two residuals, $\mathcal F$ and $\mathcal G$, and their number of channels is also halved as the input is first split into two. **Results.** In the experiments section, the author compare ResNet architectures to their RevNets "counterparts": they build a RevNet with roughly the same number of parameters by halving the number of residual units and doubling the number of channels. Interestingly, RevNets achieve **similar performances** to their ResNet counterparts, both in terms of final accuracy, and in terms of training dynamics. The authors also analyze the impact of floating errors that might occur when reconstructing activations rather than storing them, however it appears these errors are of small magnitude and do not seem to negatively impact the model. To summarize, reversible networks seems like a very promising direction to efficiently train very deep networks with memory budget constraints.  ## References * [1] NICE: Nonlinear Independent Components Estimation, Dinh et al., ICLR 2015 * [2] Density estimation using Real NVP, Dinh et al., ICLR 2017 
[link]
Hein and Andriushchenko give a intuitive bound on the robustness of neural networks based on the local Lipschitz constant. With robustness, the authors refer a small $\epsilon$ball around each sample; this ball is supposed to describe the region where the neural network predicts a constant class. This means that adversarial examples have to compute changes large enough to leave these robust areas. Larger $\epsilon$balls imply higher robustness to adversarial examples. When considering a single example $x$, and a classifier $f = (f_1, \ldots, f_K)^T$ (i.e. in a multiclass setting), the bound can be stated as follows. For $q$ and $p$ such that $\frac{1}{q} + \frac{1}{p} = 1$ and $c$ being the class predicted for $x$, the it holds $x = \arg\max_j f_j(x + \delta)$ for all $\delta$ with $\\delta\_p \leq \max_{R > 0}\min \left\{\min_{j \neq c} \frac{f_c(x) – f_j(x)}{\max_{y \in B_p(x, R)} \\nabla f_c(y)  \nabla f_j(y)\_q}, R\right\}$. Here, $B_p(x, R)$ describes the $R$ball around $x$ measured using the $p$norm. Based on the local Lipschitz constant (in the denominator), the bound essentially measures how far we can deviate from the sample $x$ (measured in the $p$norm) until $f_j(x) > f_c(x)$ for some $j \neq c$. The higher the local Lipschitz constant, the smaller deviations are allowed, i.e. adversarial examples are easier to find. Note that the bound also depends on the confidence, i.e. the edge $f_c(x)$ has in comparison to all other $f_j(x)$. In the remaining paper, the authors also provide bounds for simple classifiers including linear classifiers, kernel methods and twolayer perceptrons (i.e. one hidden layer). For the latter, they also propose a new type of regularization called crossLipschitz regularization: $P(f) = \frac{1}{nK^2} \sum_{i = 1}^n \sum_{l,m = 1}^K \\nabla f_l(x_i)  \nabla f_m(x_i)\_2^2$. This regularization term is intended to reduce the Lipschitz constant locally around training examples. They show experimental results using this regularization on MNIST and CIFAR, see the paper for details. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
This paper describes an architecture designed for generating class predictions based on a set of features in situations where you may only have a few examples per class, or, even where you see entirely new classes at test time. Some prior work has approached this problem in ridiculously complex fashion, up to and including training a network to predict the gradient outputs of a metanetwork that it thinks would best optimize loss, given a new class. The method of Prototypical Networks prides itself on being much simpler, and more intuitive, so I hope I’ll be able to convey that in this explanation. In order to think about this problem properly, it makes sense to take a few steps back, and think about some fundamental assumptions that underly machine learning. https://i.imgur.com/Q45w0QT.png One very basic one is that you need some notion of similarity between observations in your training set, and potential new observations in your test set, in order to properly generalize. To put it very simplistically, if a test example is very similar to examples of class A that we saw in training, we might predict it to be of class A at testing. But what does it *mean* for two observations to be similar to one another? If you’re using a method like K Nearest Neighbors, you calculate a point’s class identity based on the closest trainingset observations to it in Euclidean space, and you assume that nearness in that space corresponds to likelihood of two data points having come the same class. This is useful for the use case of having new classes show up after training, since, well, there isn’t really a training period: the strategy for KNN is just carrying your whole training set around, and, whenever a new test point comes along, calculating it’s closest neighbors among those trainingset points. If you see a new class in the wild, all you need to do is add the examples of that class to your group of training set points, and then after a few examples, if your assumptions hold, you’ll be able to predict that class by (hopefully) finding those two or three points as neighbors. But what if some dimensions of your feature space matter much more than others for differentiating between classes? In a simplistic example, you could have twenty features, but, unbeknownst to you, only one is actually useful for separating out your classes, and the other 19 are random. If you use the naive KNN assumption, you wouldn’t expect to perform well here, because you will have distances in these 19 meaningless directions spreading out your points, due to randomness, more than the meaningful dimension spread them out due to belonging to different classes. And what if you want to be able to learn nonlinear relationships between your features, which the composability of multilayer neural networks lends itself well to? In cases like those, the features you were handed may be a woefully suboptimal metric space in which to calculate a kind of similarity that corresponds to differences in class identity, so you’ll just have to strike out for the territories and create a metric space for yourself. That is, at a very high level, what this paper seeks to do: learn a transformation between input features and some vector space, such that distances in that vector space correspond as well as possible to probabilities of belonging to a given output class. You may notice me using “vector space” and “embedding” similarity; they are the same idea: the result of that learned transformation, which represents your input observations as dense vectors in some pdimensional space, where p is a chosen hyperparameter. What are the concrete learning steps this architecture goes through? 1. During each training episode, sample a subset of classes, and then divide those classes into training examples, and query examples 2. Using a set of weights that are being learned by the network, map the input features of each training example into a vector space. 3. Once all training examples are mapped into the space, calculate a “mean vector” for class A by averaging all of the embeddings of training examples that belong to class A. This is the “prototype” for class A, and once we have it, we can forget the values of the embedded examples that were averaged to create it. This is a nice update on the KNN approach, since the number of parameters we need to carry around to evaluate is only (numdimensions) * (numclasses), rather than (numdimensions) * (numtrainingexamples). 4. Then, for each query example, map it into the embedding space, and use a distance metric in that space to create a softmax over possible classes. (You can just think of a softmax as a network’s predicted probability, it’s a set of floats that add up to 1). 5. Then, you can calculate the (crossentropy) error between the true output and that softmax prediction vector in the same way as you would for any classification network 6. Add up the prediction loss for all the query examples, and then backpropogate through the network to update your weights The overall effect of this process is to incentivize your network to learn, not necessarily a good prediction function, but a good metric space. The idea is that, if the metric space is good enough, and the classes are conceptually similar to each other (i.e. car vs chair, as opposed to car vs themeaningoflife), a space that does well at causing similar observed classes to be close to one another will do the same for classes not seen during training. I admit to not being sufficiently familiar with the datasets used for testing to have a sense for how well this method compares to more fully supervised classification schemes; if anyone does, definitely let me know! But the paper claims to get state of the art results compared to other approaches in this domain of fewshot learning (matching networks, and the aforementioned metalearning). One interesting note is that the authors found that squared Euclidean distance, when applied within the embedded space, worked meaningfully better than cosine distance (which is a more standard way of measuring distances between vectors, since it measures only angle, rather than magnitude). They suspect that this is because Euclidean distance, but not cosine distance belongs to a category of divergence/distance metrics (called Bregman Divergences) that have a special set of properties such that the point closest on aggregate to all points in a cluster is the average of all those points. If you want to dive way deep into the minutia on this point, I found this blog post quite good: http://mark.reid.name/blog/meetthebregmandivergences.html
1 Comments

[link]
This paper’s approach goes a step further away from the traditional word embedding approach  of training embeddings as the lookuptable first layer of an unsupervised monolingual network  and proposes a more holistic form of transfer learning that involves not just transferring over learned knowledge contained in a set of vectors, but a fully trained model. Transfer learning is the general idea of using part or all of a network trained on one task to perform a different task. The most common kind of transfer learning is in the image domain, where models are first trained on the enormous ImageNet dataset, and then several of the lower layers of the network (where more local, smallpixelrange patterns are detected) are transferred, with their weights fixed in place to a new network. The modeler then attaches a few more layers to the top, connects it to a new target, and then is able to much more quickly learn their new target, because the pretraining has gotten them into a useful region of parameterspace. https://i.imgur.com/wjloHdi.png Within NLP, the most common form of transfer learning is initializing the lookup table of vectors that’s used to convert discrete words in to vectors (also known as an embedding) with embeddings pretrained on huge unsupervised datasets, like GloVe, trained on all of English Wikipedia. Again, this makes your overall task easier to train, because you’ve already converted words from their unuseful binary representation (where the word cat is just as far from Peru as it is from kitten) to a meaningful realvalued representation. The approach suggested in this paper goes beyond simply learning the vector input representation of words. Instead, the authors suggest using as word vectors the sequence of encodings produced by an encoderdecoder bidirectional recurrent model. An encoderdecoder model means that you have one part of the network that maps from input sentence to an “encoded” representation of the sentence, and then another part that maps that encoded representation into the proper tokens in the target language. Historically, this encoding had been a single vector for the whole sentence, which tried to conceptually capture all of the words into one vector. More recently, a different approach has grown popular, where the RNN produces a number of encodings equal to the number of input words. Then, when the decoder is producing words in the target sentence, it uses something called “attention” to select a weighted combination of these encodings at each point in time. Under this scheme, the decoder might pull out information about verbs when its own hidden state suggests it needs a verb, and might pull out information about pronoun referents when its own hidden state asks for that. The upshot of all of this is that you end up with a sequence of encoded vectors equal in length to your number of inputs. Because the RNN is bidirectional, which means the encoding is a concatenation of the forward RNN and backward RNN, that means that each of these encodings captures both information about its corresponding word, and contextual information about the rest of the sentence. The proposal of the authors is to train the encoderdecoder outlined above, and, once it is trained, lop off the decoder, and use the encoded sequence of words as your representation of the input sequence of words. An important note in all this is that recurrent encoderdecoder model was itself trained using a lookup table initialized with learned GloVe vectors, so in a sense they’re not substituting for the unsupervised embeddings so much as learning marginal information on top of them. The authors went on to test this approach on a few problems  question answering, logical entailment, and sentiment classification. They compared their use of the RNN encoded word vectors (which they call Context Vectors, or CoVE) with models initialized just using the fixed GloVE word vectors. One important note here is that, because each word vector is learned fully in context, the same word will have a different vector in each sentence it appears in. That’s why you can’t transfer one single vector per word, but instead have to transfer the recurrent model that can produce the vectors. All in all, the authors found that concatenating CoVe vectors to GloVe vectors, and using the concatenated version as input, produced sizable gains on the problems where it was tried. That said, it’s a pretty heavy lift to integrate someone else’s learned weights into your own model, just in terms of getting all the code to play together nicely. I’m not sure if this is a compelling enough result, a la ImageNet pretraining, for practitioners to want to go to the trouble of tacking a nontraining RNN onto the bottom of all their models. If I ever get a chance, I’d be interested to play with the vectors you get out of this model, and look at how much variance you see in the vectors learned for different words across different sentences. Do you see clusters that correspond to sense disambiguation, (a la state of mind, vs a rogue state)? And, how does this contextual approach to the paper I reviewed yesterday, that also learns embeddings on a machine translation task, but does so in terms of training a lookup table, rather than using trained encodings? All in all, I enjoyed this paper: it was a simple idea, and I’m not sure whether it was a compelling one, but it did leave me with some interesting questions. 
[link]
There are mathematicians, still today, who look at deep learning, and get real salty over the lack of convex optimization. That is to say: convex functions are ones where you have an actual guarantees that gradient descent will converge, and mathematicians of olden times (i.e. 2006) spent reams of paper arguing that this or that function had convex properties, and thus could be guaranteed to converge, under this or that set of arcane conditions. And then, Deep Learning came along, with its huge, nonlinear, very much nonconvex objective functions, that it was nonetheless trying to optimize via gradient descent. From the perspective of an optimization theorist, this had the whiff of heresy, but exceptionally effective heresy. And, so, the field of DL has halfexploded, halfstumbled along, showcasing a portfolio of very impressive achievements, but with theory very much a secondary priority relative to performance. Something else that gradient descent isn’t supposed to be able to do is learn models that include discrete (i.e. noncontinuous) operators. Without continuous gradients, the functions don’t have an obvious way to “push” in a certain direction, to modulate the loss at the end of the network. Discrete nodes mean that the value just jumps from being in one state, to being in the other, with no intermediate values. This has historically posed a problem for algorithms fueled by gradient descent. The authors of this paper came up with a solution that is 60% cleverness, and 40% just guessing that “even if we ignore the theory, things will probably work well enough”. But, first, their overall goal: to create a Variational Auto Encoder where the latent states, the compressed internal representation that is typically an array of continuous values, is instead an array of categorical values. The goal of this was 1) to have a representation type that was a better match for the discrete nature of data types like speech (which has distinct phonemes we might like to discretely capture), and, 2) to have a more compressed latent space that would (of necessity) focus on more global information, and leave local pixellevel information to be learned by the expressive PixelCNN decoder. The way they do this is remarkably simple. First, they learn a typical VAE encoder, mapping from the input pixels to a continuous z space. (An interesting sidenote here is that this paper uses spatially organized z; instead of using one single z vector to represent the whole image, they may have 32x32 spatial locations, each of which has its own z vector, to represent at 128x128 image). Then, for each of the spatial regions, they take the continuous vector produced by the network, and compare it to a fixed set of “embedding” vectors, of the same shape. That spatial location is then lumped into the category of the embedding that it’s closest to, meaning that you end up with a compressed layer of 32x32 (in this case) spatial regions, each of which is represented by a categorical number between 0 and maxnumcategories. Then, the network passes forward the embedding that this input vector was just “snapped” to being, Then, the decoder uses the full spatial location set of embeddings to do its decoding. https://i.imgur.com/P8LQRYJ.png The clever thing here comes when you ask how to train the encoder to produce a different embedding, when there was this discrete “jump” that happened. The authors choose to just avoid the problem, more or less. They do that by just taking the gradient signals that come back from the end of the network to the embedding, and just pass those directly to the vector that was used to nearestneighborslookup the embedding. Basically, they pretend that they passed the vector through the rest of the network, rather than the embedding. The embeddings are then trained in a K Means Clustering kind of way; with the embeddings being iteratively updated to be closer to the points that were assigned to their embedding in each round of training. This is the “Vector Quantization” part of VQVAE Overall, this seems to perform quite well: with the low capacity of the latente space meaning that it is incentivized to handle more global structure, while leaving low level pixel details to the decoder. It is also much easier to fit afterthefact distributions over; once we’ve trained a VQVAE, we can easily learn a global model that represents the location by location dependencies between the categories (i.e. a 1 in this corner means at 5 in this other corner is more probable). This gives us the ability to have an analytically specified distribution, in latent space, that actually represents the structure of how these “concept level categories” relate to each other. By contrast, with most continuous latent spaces, it’s intractable to learn an explicit density function after the fact, and thus if we want to be able to sample we need to specify and enforce a prior distribution over z ahead of time. 
[link]
 explore RL systems with (nonexpert) human preferences between pairs of trajectory segments;  run experiments on some RL tasks, namely **Atari** and **MuJoCo**, and show effectiveness of this approach;  advantages mentioned:  no need to access to the reward function;  less than 1% feedback needed > reduce the cost of human oversight;  can learn complex novel behaviors. ## Introduction **Challenges**  goals complex, poolydefined or hard to specify;  reward function > behaviors that optimize reward function without achieving goals; > This difficulty underlines recent concerns about misalignment between reward values and the objectives of RL systems. **Alternatives**  inverse RL: extract a reward function from demonstrations of desired tasks;  imitation learning: clone the demonstrated behavior; *con: not applicable to behaviors that are hard to demonstrate for humans*  use human feedback as a reward function; *con: require thousands of hours of experience, prohibitively expensive* **Basic idea** https://i.imgur.com/3R7tO7R.png **Contributions** 1. solve tasks for which we can only recognize but not demonstrate the desired behaviors; 2. allow nonexpert agent training; 3. scale to larger problems; 4. economical with user feedback. **Related work** two lines of work: (1) RL from human ratings or rankings; (2) general problemn of RL from preferences rather than absolute reward values. *closerelated paper:* (1) [Active preference learningbased reinforcement learning](https://arxiv.org/abs/1208.0984); (2) [Programming by feedback](http://proceedings.mlr.press/v32/schoenauer14.pdf); (3) [A Bayesian approach for policy learning from trajectory preference queries](https://papers.nips.cc/paper/4805abayesianapproachforpolicylearningfromtrajectorypreferencequeries). *diffs with (1)(2)*: a) elicit preferences over whole trajectories rather than short clips; b) change training procedure to cope with nonlinear reward models and modern deep RL. *diffs with (3)*: a) fit reward function by Bayesian inference; b) produce trajectories using MAP estimate of the target policy instead of RL > involve 'synthetic' human feedback drawn from Bayesian model. ## Preliminaries and Method **Agent goal** to produce trajectories which are preferred by the human, while making as few queries as possible to the human. **Work flow** at each point maintains two deep NNs  policy *pi*: O > A; reward estimate *r\_hat*: O x A > R. *Update procedure (asyn):* 1. policy *pi* => env => trajectories *tau* = {*tau^1*,..., *tau^i*}. Then update *pi* by a traditional RL algorithm to maximize the sum of predicted rewards *r\_t* = *r\_hat*(*o\_t*, *a\_t*); 2. select segment pairs *sigma* = (*sigma^1*, *sigma^2*) from *tau*. *sigma* => human comparison => labeled data; 3. update *r\_hat* with labeled data by supervised learning. > step 1 => trajectory *tau* => step 2 => human comparison => step 3 => parameters for *r\_hat* => step 1 => .... **Policy optimization (step 1)** *subtlety:* nonstationary reward function *r\_hat* > methods robust to changes in reward function. A2C => Atari, TRPO => MuJoCo. use parameter settings that work well for traditional RL tasks; only adjust the entropy bonus for TRPO (improve inadequate exploration); normalize rewards to zero mean and constand std. **Preference eliciation (step 2)** clips of trajectory segments for 1 to 2 seconds long. *data struct:* triples (*sigma^1*, *sigma^2*, *mu*), *mu*  distribution over {1, 2}. one preferable over the others > *mu* puts all mass on that choice; equally preferable > *mu* uniform; incomparable > skip saving triples. **Fitting the reward function (step 3)** *assumption:* human’s probability of preferring a segment *sigma^i* depends exponentially on the value of the latent reward summed over the length of the clip. https://i.imgur.com/2ViIWcL.png (no discount of reward < human being indifferent about when things happen in the trajectory segment; could consider discounting.) https://i.imgur.com/cpdTZW6.png *modifications:* 1. ensemble of predictors > independently normalize base predictors and then average results; 2. validation set ratio 1/e; employ *l\_2* regularization, and tune regularization coefficient > validation loss = 1.1~1.5 training loss; dropout in some domains; 3. assume 10% chance that human responds uniformly at random; **Selecting queries** *pipeline:* sample trajectory segments of length k > predict preference by base reward predictor in our ensemble > select trajectories with the highest variance across ensemble members *future work:* query based on the expected value of information of query. *Related articles:* 1. [APRIL: Active Preferencelearning based Reinforcement Learning](https://arxiv.org/abs/1208.0984) 2. [Active reinforcement learning: Observing rewards at a cost](http://www.filmnips.com/wpcontent/uploads/2016/11/FILMNIPS2016_paper_30.pdf) > At each timestep, the agent chooses both an action and whether to observe the reward in the next timestep. If the agent chooses to observe the reward, then it pays the “query cost” c > 0. The agent’s objective is to maximize total reward minus total query cost. 
[link]
An attentionbased architecture for combining information from different convolutional layers. The attention values are calculated using an iterative process, making use of a custom squashing function. The evaluations on MNIST show robustness to affine transformations. 
[link]
* Semisupervised method * There is a teacher net and a student net, with identical architecture. * The teacher makes predictions on unlabeled data, which are used as groundtruth for training the student net. * After each gradient descent update on the student, the teacher's weights are updated so that it becomes an exponential moving average of the weights of the student at previous timesteps. It's called a "mean teacher" because of this moving average. 
[link]
The idea in this paper is to develop a version of attention that will incorporate similarity in neighboring bins. This aligned with the work \cite{conf/icml/BeckhamP17} which presented a different approach to deal with consistency between classes of predictions. In this work the closed form softmax function is replaced by a small optimization problem with this regularizer: $$ +\lambda \sum_{i=1}^{d1} y_{i+1}y_i$$ Because of this, many of the neighboring probabilities are exactly the same resulting in attention that can be seen as blocks. https://i.imgur.com/oue0x4V.png Poster: https://i.imgur.com/gclMjzR.png 
[link]
This paper introduces triangleGAN ($\triangle$GAN) that aims at crossdomain joint distribution matching: The model is shown below https://i.imgur.com/boIDOMu.png Having two domains of data $x$ and $y$, there are two generators: 1 $G_x(y)$ which takes $y$ and generates $\tilde{x}$ 2 $G_y(x)$ which takes $x$ and generates $\tilde{y}$ There are two discriminators in the model: 1 $D_1 (x,y)$ a discriminator that distinguishes between $(x, y)$ and either of $(x, \tilde{y})$ or $(\tilde{x}, y)$. 2 $D_2 (x,y)$ a discriminator that distinguishes between $(x, \tilde{y})$ and $(\tilde{x}, y)$. The second discriminator is ALI and can be used on unpaired sets of data. The first discriminator is equivalent to a conditional discriminator where the true paired data $(x, y)$ is compared to either $(x, \tilde{y})$ or $(\tilde{x}, y)$, where one element in the pair is sampled. This discriminator needs paired $(x, y)$ data for training. This model can be used for semisupervised settings, where a small set of paired data is provided. In this paper it is used for:  semisupervised image classification, where a small subset of CIFAR10 is labelled. $x$ and $y$ are images and class labels here.  image to image translation on edge2shoes dataset, where only a subset of dataset is paired.  attribute conditional image generation where $x$ and $y$ domains are image and attributes. CelebA and COCO datasets are used here. In one experiment testset images are projected to attributes and then given those attributes new images are generated: On celebA: https://i.imgur.com/EX5tDZ0.png On COCO: https://i.imgur.com/GRpvjGx.png In another experiment some attributes are chosen (as samples shown below in the first row with different noise) and then another feature is added (using the same noise) to generate the samples in the second row: https://i.imgur.com/KeHL8Ye.png The triangle gan demonstrates improved performance compared to triple gan in the experiments shown in the paper. It has been also compared with Disco gan (a model that can be trained on unpaired data) and shows improved performance when some percentage of paired data is provided. In an experiment they pair each MNIST digit with its transposed (as $x$, $y$ pairs). DiscoGAN cannot learn correct mapping between them, while triangleGAN can learn correct mapping since it leverages paired data. https://i.imgur.com/Vz9Zfhu.png In general this model is a useful approach for semisupervised crossdomain matching and can leverage unpaired data (using ALI) as well as paired data (using conditional discriminator). 
[link]
This paper aims at changing the attributes of a face, without manipulating other aspects of the image, such as add/remove glasses, make a person young/old, changed the gender, and hence the name Fader Networks, similar to sliders of audio mixing tools that can change a value linearly to increase/decrease a feature. The model is shown below: https://i.imgur.com/fntPmNu.png An image $x$ is passed to the encoder and the output of the encoder $E(x)$ is passed to the discriminator to distinguish whether a feature $y$ is in the latent space or not. The encoded features $E(x)$ and the feature $y$ is passed to the decoder to reconstruct the image $D(E(x))$. The AE therefore has two loss: 1 The reconstruction loss between $x$ and $D(E(x))$, and 2 The gan loss to fool the discriminator on the feature $y$ in the encoded space $E(x)$. The discriminator tries to distinguish whether a feature $y$ is in the encoded space $E(x)$ or not, while the encoder tries to fool the discriminator. This process leads to removal of the feature $y$ from the $E(x)$ by encoder. The encoded feature $E(x)$ therefore does not have any information on $y$. However, since the decoder needs to reconstruct the same input image, $E(x)$ has to maintain all information, except the feature $y$ and the decoder should get the feature $y$ from the input of the decoder. The model is trained on binary $y$ features such as: male/female, young/old, glasses Yes/No, mouth open Yes/No, eyes open Yes/No (some samples from test set below): https://i.imgur.com/bj9wu6B.png At test time, they can change the features continuously and show transition in the features: https://i.imgur.com/XUD3ZTu.png The performance of the model is measured using mechanical turks on two metrics: Naturalness of the images and the accuracy of swapping features on the image. In both FadNet shows better results compared to IcGAN, and FadNet shows very good results on accuracy, however on naturalness the performance drops when some features are swapped. On Flowers dataset, FadNet can change colors of the flowers: https://i.imgur.com/7nvBSEY.png I find the following positive aspects about FadNet: 1 It can change some features while maintaining other features of the image such as identity of the person, background information, etc. 2 The model does not need paired data. In some cases it is impossible to gather paired data (e.g. male/female) or very difficult (young/old). 3 The gan loss is used to remove a feature in the latent space, where that feature can be later specified for reconstruction by decoder. Since GAN is applied to latent space, it can be used to remove features on the data that is discrete (where direct usage of disc on those data is not trivial). I think these aspects need further work for improvement:  When multiple features are changed the blurriness of the image shows up: https://i.imgur.com/LD5cVbg.png When only one feature changes the blurriness affect is much less, despite the fact that they use L2loss for AE reconstruction. I guess also using a high resolution of 256*256 helps make the bluriness of the images less noticeable.  The model should be first trained only on AE (no gan loss) and then the gan loss in AE is linearly increased to remove a feature. So, it requires a bit of care in training it properly. Overall, I find it an interesting paper on how to change a feature in an image when one wants to keep other features unchanged. 
[link]
This paper describes using Relation Networks (RN) for reasoning about relations between objects/entities. RN is a plugandplay module and although expects object representations as input, the semantics of what an object is need not be specified, so object representations can be convolutional layer feature vectors or entity embeddings from text, or something else. And the feedforward network is free to discover relations between objects (as opposed to being handassigned specific relations).  At its core, RN has two parts:  a feedforward network `g` that operates on pairs of object representations, for all possible pairs, all pairwise computations pooled via elementwise addition  a feedforward network `f` that operates on pooled features for downstream task, everything being trained endtoend  When dealing with pixels (as in CLEVR experiment), individual object representations are spatially distinct convolutional layer features (196 512d object representations for VGG conv5 say). The other experiment on CLEVR uses explicit factored object state representations with 3D coordinates, shape, material, color, size.  For bAbI, object representations are LSTM encodings of supporting sentences.  For VQA tasks, `g` conditions its processing on question encoding as well, as relations that are relevant for figuring out the answer would be questiondependent. ## Strengths  Very simple idea, clearly explained, performs well. Somewhat shocked that it hasn't been tried before. ## Weaknesses / Notes Fairly simple idea — let a feedforward network operate on all pairs of object representations and figure out relations necessary for downstream task with endtoend training. And it is fairly general in its design, relations aren't handdesigned and neither are object representations — for RGB images, these are spatially distinct convolutional layer features, for text, these are LSTM encodings of supporting facts, and so on. This module can be dropped in and combined with more sophisticated networks to improve performance at VQA. RNs also offer an alternative design choice to prior works on CLEVR, that have this explicit notion of programs or modules with specialized roles (that need to be predefined), as opposed to letting these relations emerge, reducing dependency on handdesigning modules and adding in inductive biases from an architectural pointofview for the network to reason about relations (earlier endtoend VQA models didn't have the capacity to figure out relations). 
[link]
This paper described an algorithm of parametrically adding noise and applying a variational regulariser similar to that in ["Variational Dropout Sparsifies Deep Neural Networks"][vardrop]. Both have the same goal: make neural networks more efficient by removing parameters (and therefore the computation applied with those parameters). Although, this paper also has the goal of giving a prescription of how many bits to store each parameter with as well. There is a very nice derivation of the hierarchical variational approximation being used here, which I won't try to replicate here. In practice, the difference to prior work is that the stochastic gradient variational method uses hierarchical samples; ie it samples from a prior, then incorporates these samples when sampling over the weights (both applied through local reparameterization tricks). It's a powerful method, which allows them to test two different priors (although they are clearly not limited to just these), and compare both against competing methods. They are comparable, and the choice of prior offers some tradeoffs in terms of sparsity versus quantization. [vardrop]: http://www.shortscience.org/paper?bibtexKey=journals/corr/1701.05369 
[link]
"Using the "SELU" activation function, you get better results than any other activation function, and you don't have to do batch normalization. The "SELU" activation function is: if x<0, 1.051\*(1.673\*e^x1.673) if x>0, 1.051\*x" Source: narfon2, reddit ``` import numpy as np def selu(x): alpha = 1.6732632423543772848170429916717 scale = 1.0507009873554804934193349852946 return scale*np.where(x>=0.0, x, alpha*np.exp(x)alpha) ``` Source: CaseOfTuesday, reddit Discussion here: https://www.reddit.com/r/MachineLearning/comments/6g5tg1/r_selfnormalizing_neural_networks_improved_elu/ 
[link]
[Batch Normalization Ioffe et. al 2015](Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift) is one of the remarkable ideas in the era of deep learning that sits with the likes of Dropout and Residual Connections. Nonetheless, last few years have shown a few shortcomings of the idea, which two years later Ioffe has tried to solve through the concept that he calls Batch Renormalization. Issues with Batch Normalization  Different parameters used to compute normalized output during training and inference  Using Batch Norm with small minibatches  Noni.i.d minibatches can have a detrimental effect on models with batchnorm. For e.g. in a metric learning scenario, for a minibatch of size 32, we may randomly select 16 labels then choose 2 examples for each of these labels, the examples interact at every layer and may cause model to overfit to the specific distribution of minibatches and suffer when used on individual examples. The problem with using moving averages in training, is that it causes gradient optimization and normalization in opposite direction and leads to model blowing up. Idea of Batch Renormalization We know that, ${\frac{x_i  \mu}{\sigma} = \frac{x_i  \mu_B}{\sigma_B}.r + d}$ where, ${r = \frac{\sigma_B}{\sigma}, d = \frac{\mu_B  \mu}{\sigma}}$ So the batch renormalization algorithm is defined as follows ![Batch Renorm Algo](https://fractalanalyticmy.sharepoint.com/personal/shubham_jain_fractalanalytics_com/_layouts/15/guestaccess.aspx?docid=0c2c627424786442f8de65367755e1fd1&authkey=ARSCi3QfpM_uBVuWCYARKNg) Ioffe writes further that for practical purposes, > In practice, it is beneficial to train the model for a certain number of iterations with batchnorm alone, without the correction, then ramp up the amount of allowed correction. We do this by imposing bounds on r and d, which initially constrain them to 1 and 0, respectively, and then are gradually relaxed. In experiments, For Batch Renorm, author used $r_{max}$ = 1, $d_{max}$ = 0 (i.e. simply batchnorm) for the first 5000 training steps, after which these were gradually relaxed to reach $r_{max}$ = 3 at 40k steps, and $d_{max}$ = 5 at 25k steps. A training step means, an update to the model.
2 Comments
