Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1546 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks

Shiyu Liang and Yixuan Li and R. Srikant

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG, stat.ML

**First published:** 2017/06/08 (3 years ago)

**Abstract:** We consider the problem of detecting out-of-distribution images in neural
networks. We propose ODIN, a simple and effective method that does not require
any change to a pre-trained neural network. Our method is based on the
observation that using temperature scaling and adding small perturbations to
the input can separate the softmax score distributions between in- and
out-of-distribution images, allowing for more effective detection. We show in a
series of experiments that ODIN is compatible with diverse network
architectures and datasets. It consistently outperforms the baseline approach
by a large margin, establishing a new state-of-the-art performance on this
task. For example, ODIN reduces the false positive rate from the baseline 34.7%
to 4.3% on the DenseNet (applied to CIFAR-10) when the true positive rate is
95%.
more
less

Shiyu Liang and Yixuan Li and R. Srikant

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG, stat.ML

[link]
Liang et al. propose a perturbation-based approach for detecting out-of-distribution examples using a network’s confidence predictions. In particular, the approaches based on the observation that neural network’s make more confident predictions on images from the original data distribution, in-distribution examples, than on examples taken from a different distribution (i.e., a different dataset), out-distribution examples. This effect can further be amplified by using a temperature-scaled softmax, i.e., $ S_i(x, T) = \frac{\exp(f_i(x)/T)}{\sum_{j = 1}^N \exp(f_j(x)/T)}$ where $f_i(x)$ are the predicted logits and $T$ a temperature parameter. Based on these softmax scores, perturbations $\tilde{x}$ are computed using $\tilde{x} = x - \epsilon \text{sign}(-\nabla_x \log S_{\hat{y}}(x;T))$ where $\hat{y}$ is the predicted label of $x$. This is similar to “one-step” adversarial examples; however, in contrast of minimizing the confidence of the true label, the confidence in the predicted label is maximized. This, applied to in-distribution and out-distribution examples is illustrated in Figure 1 and meant to emphasize the difference in confidence. Afterwards, in- and out-distribution examples can be distinguished using simple thresholding on the predicted confidence, as shown in various experiment, e.g., on Cifar10 and Cifar100. https://i.imgur.com/OjDVZ0B.png Figure 1: Illustration of the proposed perturbation to amplify the difference in confidence between in- and out-distribution examples. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Critic Regularized Regression

Ziyu Wang and Alexander Novikov and Konrad Zolna and Jost Tobias Springenberg and Scott Reed and Bobak Shahriari and Noah Siegel and Josh Merel and Caglar Gulcehre and Nicolas Heess and Nando de Freitas

arXiv e-Print archive - 2020 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

**First published:** 2021/01/15 (just now)

**Abstract:** Offline reinforcement learning (RL), also known as batch RL, offers the
prospect of policy optimization from large pre-recorded datasets without online
environment interaction. It addresses challenges with regard to the cost of
data collection and safety, both of which are particularly pertinent to
real-world applications of RL. Unfortunately, most off-policy algorithms
perform poorly when learning from a fixed dataset. In this paper, we propose a
novel offline RL algorithm to learn policies from data using a form of
critic-regularized regression (CRR). We find that CRR performs surprisingly
well and scales to tasks with high-dimensional state and action spaces --
outperforming several state-of-the-art offline RL algorithms by a significant
margin on a wide range of benchmark tasks.
more
less

Ziyu Wang and Alexander Novikov and Konrad Zolna and Jost Tobias Springenberg and Scott Reed and Bobak Shahriari and Noah Siegel and Josh Merel and Caglar Gulcehre and Nicolas Heess and Nando de Freitas

arXiv e-Print archive - 2020 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

[link]
Offline reinforcement learning is potentially high-value thing for the machine learning community learn to do well, because there are many applications where it'd be useful to generate a learnt policy for responding to a dynamic environment, but where it'd be too unsafe or expensive to learn in an on-policy or online way, where we continually evaluate our actions in the environment to test their value. In such settings, we'd like to be able to take a batch of existing data - collected from a human demonstrator, or from some other algorithm - and be able to learn a policy from those pre-collected transitions, without being able to query the environment further by taking arbitrary actions. There are two broad strategies for learning a policy from precollected transitions. One is to simply learn to mimic the action policy used by the demonstrator, predicting the action the demonstrator would take in a given state, without making use of reward data at all. This is Behavioral Cloning, and has the advantage of being somewhat more conservative (in terms of not experimenting with possibly-unsafe-or-low-reward actions the demonstrator never took), but this is also a disadvantage, because it's not possible to get higher reward than the demonstrator themselves got if you're simply copying their behavior. Another approach is to learn a Q function - estimating the value of a given action in a given state - using the reward data from the precollected transitions. This can also have some downsides, mostly in the direction of overconfidence. Q value Temporal Difference learning works by using the current reward added to the max Q value over possible next actions as the target for the current-state Q estimate. This tends to lead to overestimates, because regression to the mean effects mean that the highest value Q estimates are disproportionately likely to be noisy (possibly because they correspond to an action with little data in the demonstrator dataset). In on-policy Q learning, this is less problematic, because the agent can take the action associated with their noisily inaccurate estimate, and as a result get more data for that action, and get an estimate that is less noisy in future. But when we're in a fully offline setting, all our learning is completed before we actually start taking actions with our policy, so taking high-uncertainty actions isn't a valuable source of new information, but just risky. The approach suggested by this DeepMind paper - Critic Regularized Regression, or CRR - is essentially a synthesis of these two possible approaches. The method learns a Q function as normal, using temporal difference methods. The distinction in this method comes from how to get a policy, given a learned Q function. Rather than simply taking the action your Q estimate says is highest-value at a particular point, CRR optimizes a policy according to the formula shown below. The f() function is a stand-in for various potential functions, all of which are monotonic with respect to the Q function, meaning they increase when the Q function does. https://i.imgur.com/jGmhYdd.png This basically amounts to a form of a behavioral cloning loss (with the part that maximizes the probability under your policy of the actions sampled from the demonstrator dataset), but weighted or, as the paper terms it, filtered, by the learned Q function. The higher the estimated q value for a transition, the more weight is placed on that transition from the demo dataset having high probability under your policy. Rather than trying to mimic all of the actions of the demonstrator, the policy preferentially tries to mimic the demonstrator actions that it estimates were particularly high-quality. Different f() functions lead to different kinds of filtration. The `binary`version is an indicator function for the Advantage of an action (the Q value for that action at that state minus some reference value for the state, describing how much better the action is than other alternatives at that state) being greater than zero. Another, `exp`, uses exponential weightings which do a more "soft" upweighting or downweighting of transitions based on advantage, rather than the sharp binary of whether an actions advantage is above 1. The authors demonstrate that, on multiple environments from three different environment suites, CRR outperforms other off-policy baselines - either more pure behavioral cloning, or more pure RL - and in many cases does so quite dramatically. They find that the sharper binary weighting scheme does better on simpler tasks, since the trade-off of fewer but higher-quality samples to learn from works there. However, on more complex tasks, the policy benefits from the exp weighting, which still uses and learns from more samples (albeit at lower weights), which introduces some potential mimicking of lower-quality transitions, but at the trade of a larger effective dataset size to learn from. |

WAIC, but Why? Generative Ensembles for Robust Anomaly Detection

Hyunsun Choi and Eric Jang and Alexander A. Alemi

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2018/10/02 (2 years ago)

**Abstract:** Machine learning models encounter Out-of-Distribution (OoD) errors when the
data seen at test time are generated from a different stochastic generator than
the one used to generate the training data. One proposal to scale OoD detection
to high-dimensional data is to learn a tractable likelihood approximation of
the training distribution, and use it to reject unlikely inputs. However,
likelihood models on natural data are themselves susceptible to OoD errors, and
even assign large likelihoods to samples from other datasets. To mitigate this
problem, we propose Generative Ensembles, which robustify density-based OoD
detection by way of estimating epistemic uncertainty of the likelihood model.
We present a puzzling observation in need of an explanation -- although
likelihood measures cannot account for the typical set of a distribution, and
therefore should not be suitable on their own for OoD detection, WAIC performs
surprisingly well in practice.
more
less

Hyunsun Choi and Eric Jang and Alexander A. Alemi

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

[link]
### Summary Knowing when a model is qualified to make a prediction is critical to safe deployment of ML technology. Model-independent / Unsupervised Out-of-Distribution (OoD) detection is appealing mostly because it doesn't require task-specific labels to train. It is tempting to suggest a simple one-tailed test in which lower likelihoods are OoD (assigned by a Likelihood Model), but the intuition that In-Distribution (ID) inputs should have highest likelihoods _does not hold in higher dimension_. The authors propose to use the Watanabe-Akaike Information Criterion (WAIC) to circumvent this problem and empirically show the robustness of the approach. ### Counterintuitive Properties of Likelihood Models: https://i.imgur.com/4vo0Ff5.png So a GLOW model with Gaussian prior maps SVHN closer to the origin than Cifar (but never actually generates SVHN because Gaussian samples are on the shell). This is bad news for OoD detection. ### Proposed Methodology: Use the WAIC criterion for OoD detection which gives an asymptotically correct estimate of the gap between the training set and test set expectations: https://i.imgur.com/vasSxuk.png Basically, the correction term subtracts the variance in likelihoods across independent samples from the posterior. This acts to robustify the estimate, ensuring that points that are sensitive to the particular choice of posterior are penalized. They use an ensemble of generative models as a proxy for posterior samples i.e. the ensembles acts as approximate posterior samples. Now, OoD can be detected with a Likelihood Model: https://i.imgur.com/M3CDKOA.png ### Discussion Interestingly, GLOW maps Cifar and other datasets INSIDE the gaussian shell (which is an annulus of radius $\sqrt{dim} = \sqrt{3072} \approx 55.4$ https://i.imgur.com/ERdgOaz.png This is in itself quite disturbing, as it suggests that better flow-based generative models (for sampling) can be obtained by encouraging the training distribution to overlap better with the typical set in latent space. |

Meta-learners' learning dynamics are unlike learners'

Neil C. Rabinowitz

arXiv e-Print archive - 2019 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

**First published:** 2019/05/03 (1 year ago)

**Abstract:** Meta-learning is a tool that allows us to build sample-efficient learning
systems. Here we show that, once meta-trained, LSTM Meta-Learners aren't just
faster learners than their sample-inefficient deep learning (DL) and
reinforcement learning (RL) brethren, but that they actually pursue
fundamentally different learning trajectories. We study their learning dynamics
on three sets of structured tasks for which the corresponding learning dynamics
of DL and RL systems have been previously described: linear regression (Saxe et
al., 2013), nonlinear regression (Rahaman et al., 2018; Xu et al., 2018), and
contextual bandits (Schaul et al., 2019). In each case, while
sample-inefficient DL and RL Learners uncover the task structure in a staggered
manner, meta-trained LSTM Meta-Learners uncover almost all task structure
concurrently, congruent with the patterns expected from Bayes-optimal inference
algorithms. This has implications for research areas wherever the learning
behaviour itself is of interest, such as safety, curriculum design, and
human-in-the-loop machine learning.
more
less

Neil C. Rabinowitz

arXiv e-Print archive - 2019 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

[link]
Meta learning, or, the idea of training models on some distribution of tasks, with the hope that they can then learn more quickly on new tasks because they have “learned how to learn” similar tasks, has become a more central and popular research field in recent years. Although there is a veritable zoo of different techniques (to an amusingly literal degree; there’s an emergent fad of naming new methods after animals), the general idea is: have your inner loop consist of training a model on some task drawn from a distribution over tasks (be that maze learning with different wall configurations, letter identification from different languages, etc), and have the outer loop that updates some structural part of your model be based on improving generalization error on each task within the distribution. It’s been demonstrated that meta-learned systems can in fact learn more quickly (at least when their tasks are “in distribution” relative to the distribution they were trained on, which is an important point to be cognizant of), but this paper is less interested with how much better or faster they’re learning, and more interested in whether there are qualitative differences in the way normal learning systems and meta-trained learning systems go about learning a new task. The author (oddly for DeepMind, which typically goes in for super long author lists, there’s only the one on this paper) goes about this by studying simple learning tasks where it’s easier for us to introspect into what each model is learning over time. https://i.imgur.com/ceycq46.png In the first test, he looks at linear regression in a simple setting: where for each individual “task” data is generated according a known true weight matrix (sampled from a prior over weight matrices), with some noise added in. Given this weight matrix, he takes the singular value decomposition (think: PCA), and so ends up with a factorized representation of the weights, where higher eigenvalues on the factors, or “modes”, represent that those factors represent larger-scale patterns that explain more variance, and lower eigenvalues are smaller scale refinements on top of that. He can apply this same procedure to the weights the network has learned at any given point in training, and compare, to see how close the network is to having correctly captured each of these different modes. When normal learners (starting from a raw initialization) approach the task, they start by matching the large scale (higher eigenvalue) factors of variation, and then over the course of training improve performance on the higher-precision factors. By contrast, meta learners, in addition to learning faster, also learn large scale and small scale modes at the same rate. Similar analysis was performed and similar results found for nonlinear regression, where instead of PCA-style components, the function generating data were decomposed into different Fourier frequencies, and the normal learner learned the broad, low-frequency patterns first, where the meta learner learned them all at the same rate. The paper finds intuition for this by showing that the behavior of the meta learners matches quite well against how a Bayes-optimal learner would update on new data points, in the world where that learner had a prior over the data-generating weights that matched the true generating process. So, under this framing, the process of meta learning is roughly equivalent to your model learning a prior correspondant with the task distribution it was trained on. This is, at a high level, what I think we all sort of thought was happening with meta learning, but it’s pretty neat to see it laid out in a small enough problem where we can actually validate against an analytic model. A bit of a meta (heh) point: I wish this paper had more explanation of why the author chose to use the specific eigenvalue-focused metrics of progression on task learning that he did. They seem reasonable, but I’d have been curious to see an explication of what is captured by these, and what might be captured by alternative metrics of task progress. (A side note: the paper also contained a reinforcement learning experiment, but I both understood that one less well and also feel like it wasn’t really that analogous to the other tests) |

An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution

Rosanne Liu and Joel Lehman and Piero Molino and Felipe Petroski Such and Eric Frank and Alex Sergeev and Jason Yosinski

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG, stat.ML

**First published:** 2018/07/09 (2 years ago)

**Abstract:** Few ideas have enjoyed as large an impact on deep learning as convolution.
For any problem involving pixels or spatial representations, common intuition
holds that convolutional neural networks may be appropriate. In this paper we
show a striking counterexample to this intuition via the seemingly trivial
coordinate transform problem, which simply requires learning a mapping between
coordinates in (x,y) Cartesian space and one-hot pixel space. Although
convolutional networks would seem appropriate for this task, we show that they
fail spectacularly. We demonstrate and carefully analyze the failure first on a
toy problem, at which point a simple fix becomes obvious. We call this solution
CoordConv, which works by giving convolution access to its own input
coordinates through the use of extra coordinate channels. Without sacrificing
the computational and parametric efficiency of ordinary convolution, CoordConv
allows networks to learn either perfect translation invariance or varying
degrees of translation dependence, as required by the task. CoordConv solves
the coordinate transform problem with perfect generalization and 150 times
faster with 10--100 times fewer parameters than convolution. This stark
contrast raises the question: to what extent has this inability of convolution
persisted insidiously inside other tasks, subtly hampering performance from
within? A complete answer to this question will require further investigation,
but we show preliminary evidence that swapping convolution for CoordConv can
improve models on a diverse set of tasks. Using CoordConv in a GAN produced
less mode collapse as the transform between high-level spatial latents and
pixels becomes easier to learn. A Faster R-CNN detection model trained on MNIST
detection showed 24% better IOU when using CoordConv, and in the RL domain
agents playing Atari games benefit significantly from the use of CoordConv
layers.
more
less

Rosanne Liu and Joel Lehman and Piero Molino and Felipe Petroski Such and Eric Frank and Alex Sergeev and Jason Yosinski

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG, stat.ML

[link]
This is a paper where I keep being torn between the response of “this is so simple it’s brilliant; why haven’t people done it before,” and “this is so simple it’s almost tautological, and the results I’m seeing aren’t actually that surprising”. The basic observation this paper makes is one made frequently before, most recently to my memory by Geoff Hinton in his Capsule Net paper: sometimes the translation invariance of convolutional networks can be a bad thing, and lead to worse performance. In a lot of ways, translation invariance is one of the benefits of using a convolutional architecture in the first place: instead of having to learn separate feature detectors for “a frog in this corner” and “a frog in that corner,” we can instead use the same feature detector, and just move it over different areas of the image. However, this paper argues, this makes convolutional networks perform worse than might naively be expected at tasks that require them to remember or act in accordance with coordinates of elements within an image. For example, they find that normal convolutional networks take nearly an hour and 200K worth of parameters to learn to “predict” the one-hot encoding of a pixel, when given the (x,y) coordinates of that pixel as input, and only get up to about 80% accuracy. Similarly, trying to take an input image with only one pixel active, and predict the (x,y) coordinates as output, is something the network is able to do successfully, but only when the test points are sampled from the same spatial region as the training points: if the test points are from a held-out quadrant, the model can’t extrapolate to the (x, y) coordinates there, and totally falls apart. https://i.imgur.com/x6phN4p.png The solution proposed by the authors is a really simple one: at one or more layers within the network, in addition to the feature channels sent up from the prior layer, add two addition channels: one with a with deterministic values going from -1 (left) to 1 (right), and the other going top to bottom. This essentially adds two fixed “features” to each pixel, which jointly carry information about where it is in space. Just by adding this small change, we give the network the ability to use spatial information or not, as it sees fit. If these features don’t prove useful, their weights will stay around their initialization values of expectation-zero, and the behavior should be much like a normal convolutional net. However, if it proves useful, convolution filters at the next layer can take position information into account. It’s easy to see how this would be useful for this paper’s toy problems: you can just create a feature detector for “if this pixel is active, pass forward information about it’s spatial position,” and predict the (x, y) coordinates out easily. You can also imagine this capability helping with more typical image classification problems, by having feature filters that carry with them not only content information, but information about where a pattern was found spatially. The authors do indeed find comparable performance or small benefits to ImageNet, MNIST, and Atari RL, when applying their layers in lieu of normal convolutional layer. On GANs in particular, they find less mode collapse, though I don’t yet 100% follow the intuition of why this would be the case. https://i.imgur.com/wu7wQZr.png |

Do Deep Generative Models Know What They Don't Know?

Eric Nalisnick and Akihiro Matsukawa and Yee Whye Teh and Dilan Gorur and Balaji Lakshminarayanan

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2018/10/22 (2 years ago)

**Abstract:** A neural network deployed in the wild may be asked to make predictions for
inputs that were drawn from a different distribution than that of the training
data. A plethora of work has demonstrated that it is easy to find or synthesize
inputs for which a neural network is highly confident yet wrong. Generative
models are widely viewed to be robust to such mistaken confidence as modeling
the density of the input features can be used to detect novel,
out-of-distribution inputs. In this paper we challenge this assumption. We find
that the density learned by flow-based models, VAEs, and PixelCNNs cannot
distinguish images of common objects such as dogs, trucks, and horses (i.e.
CIFAR-10) from those of house numbers (i.e. SVHN), assigning a higher
likelihood to the latter when the model is trained on the former. Moreover, we
find evidence of this phenomenon when pairing several popular image data sets:
FashionMNIST vs MNIST, CelebA vs SVHN, ImageNet vs CIFAR-10 / CIFAR-100 / SVHN.
To investigate this curious behavior, we focus analysis on flow-based
generative models in particular since they are trained and evaluated via the
exact marginal likelihood. We find such behavior persists even when we restrict
the flow models to constant-volume transformations. These transformations admit
some theoretical analysis, and we show that the difference in likelihoods can
be explained by the location and variances of the data and the model curvature.
Our results caution against using the density estimates from deep generative
models to identify inputs similar to the training distribution until their
behavior for out-of-distribution inputs is better understood.
more
less

Eric Nalisnick and Akihiro Matsukawa and Yee Whye Teh and Dilan Gorur and Balaji Lakshminarayanan

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

[link]
CNNs predictions are known to be very sensitive to adversarial examples, which are samples generated to be wrongly classifiied with high confidence. On the other hand, probabilistic generative models such as `PixelCNN` and `VAEs` learn a distribution over the input domain hence could be used to detect ***out-of-distribution inputs***, e.g., by estimating their likelihood under the data distribution. This paper provides interesting results showing that distributions learned by generative models are not robust enough yet to employ them in this way. * **Pros (+):** convincing experiments on multiple generative models, more detailed analysis in the invertible flow case, interesting negative results. * **Cons (-):** It would be interesting to provide further results for different datasets / domain shifts to observe if this property can be quanitfied as a characteristics of the model or of the input data. --- ## Experimental negative result Three classes of generative models are considered in this paper: * **Auto-regressive** models such as `PixelCNN` [1] * **Latent variable** models, such as `VAEs` [2] * Generative models with **invertible flows** [3], in particular `Glow` [4]. The authors train a generative model $G$ on input data $\mathcal X$ and then use it to evaluate the likelihood on both the training domain $\mathcal X$ and a different domain $\tilde{\mathcal X}$. Their main (negative) result is showing that **a model trained on the CIFAR-10 dataset yields a higher likelihood when evaluated on the SVHN test dataset than on the CIFAR-10 test (or even train) split**. Interestingly, the converse, when training on SVHN and evaluating on CIFAR, is not true. This result was consistantly observed for various architectures including [1], [2] and [4], although it is of lesser effect in the `PixelCNN` case. Intuitively, this could come from the fact that both of these datasets contain natural images and that CIFAR-10 is strictly more diverse than SVHN in terms of semantic content. Nonetheless, these datasets vastly differ in appearance, and this result is counter-intuitive as it goes against the direction that generative models can reliably be use to detect out-of-distribution samples. Furthermore, this observation also confirms the general idea that higher likelihoods does not necessarily coincide with better generated samples [5]. --- ## Further analysis for invertible flow models The authors further study this phenomenon in the invertible flow models case as they provide a more rigorous analytical framework (exact likelihood inference unlike VAE which only provide a bound on the true likelihood). More specifically invertible flow models are characterized with a ***diffeomorphism*** (invertible function), $f(x; \phi)$, between input space $\mathcal X$ and latent space $\mathcal Z$, and choice of the latent distribution $p(z; \psi)$. The ***change of variable formula*** links the density of $x$ and $z$ as follows: $$ \int_x p_x(x)d_x = \int_x p_z(f(x)) \left| \frac{\partial f}{\partial x} \right| dx $$ And the training objective under this transformation becomes $$ \arg\max_{\theta} \log p_x(\mathbf{x}; \theta) = \arg\max_{\phi, \psi} \sum_i \log p_z(f(x_i; \phi); \psi) + \log \left| \frac{\partial f_{\phi}}{\partial x_i} \right| $$ Typically, $p_z$ is chosen to be Gaussian, and samples are build by inverting $f$, i.e.,$z \sim p(\mathbf z),\ x = f^{-1}(z)$. And $f_{\phi}$ is build such that computing the log determinant of the Jacabian in the previous equation can be done efficiently. First, they observe that contribution of the flow can be decomposed in a ***density*** element (left term) and a ***volume*** element (right term), resulting from the change of variables formula. Experiment results with Glow [4] show that the higher density on SVHN mostly comes from the ***volume element contribution***. Secondly, they try to directly analyze the difference in likelihood between two domains $\mathcal X$ and $\tilde{\mathcal X}$; which can be done by a second-order expansion of the log-likelihood locally around the expectation of the distribution (assuming $\mathbb{E} (\mathcal X) \sim \mathbb{E}(\tilde{\mathcal X})$). For the constant volume Glow module, the resulting analytical formula indeed confirms that the log-likelihood of SVHN should be higher than CIFAR's, as observed in practice. --- ## References * [1] Conditional Image Generation with PixelCNN Decoders, van den Oord et al, 2016 * [2] Auto-Encoding Variational Bayes, Kingma and Welling, 2013 * [3] Density estimation using Real NVP, Dinh et al., ICLR 2015 * [4] Glow: Generative Flow with Invertible 1x1 Convolutions, Kingma and Dhariwal * [5] A Note on the Evaluation of Generative Models, Theis et al., ICLR 2016 |

On the Robustness of Convolutional Neural Networks to Internal Architecture and Weight Perturbations

Cheney, Nicholas and Schrimpf, Martin and Kreiman, Gabriel

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

Cheney, Nicholas and Schrimpf, Martin and Kreiman, Gabriel

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

[link]
Cheney et al. study the robustness of deep neural networks, especially AlexNet, with regard to randomly dropping or perturbing weights. In particular, the authors consider three types of perturbations: synapse knockouts set random weights to zero, node knockouts set all weights corresponding to a set of neurons to zero, and weight perturbations add random Gaussian noise to the weights of a specific layer. These perturbations are studied on AlexNet, considering the top-5 accuracy on ImageNet; perturbations are considered per layer. For example, Figure 1 (left) shows the influence on accuracy when knocking out synapses. As can be seen, the lower layers, especially the first convolutional layer, are impacted significantly by these perturbations. Similar observations, Figure 1 (right) are made for random perturbations of weights; although the impact is less significant. Especially high-level features, i.e., the corresponding layers, seem to be robust to these kind of perturbations. The authors also provide evidence that these results extend to the top-1 accuracy, as well as other architectures. For VGG, however, the impact is significantly less pronounced which may also be due to the employed dropout layers. https://i.imgur.com/78T6Gg2.png Figure 1: Left: Influence of setting weights in the corresponding layers to zero. Right: Influence of randomly perturbing weights of specific layers. Experiments are on ImageNet using AlexNet. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Neural Ordinary Differential Equations

Ricky T. Q. Chen and Yulia Rubanova and Jesse Bettencourt and David Duvenaud

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

**First published:** 2018/06/19 (2 years ago)

**Abstract:** We introduce a new family of deep neural network models. Instead of
specifying a discrete sequence of hidden layers, we parameterize the derivative
of the hidden state using a neural network. The output of the network is
computed using a black-box differential equation solver. These continuous-depth
models have constant memory cost, adapt their evaluation strategy to each
input, and can explicitly trade numerical precision for speed. We demonstrate
these properties in continuous-depth residual networks and continuous-time
latent variable models. We also construct continuous normalizing flows, a
generative model that can train by maximum likelihood, without partitioning or
ordering the data dimensions. For training, we show how to scalably
backpropagate through any ODE solver, without access to its internal
operations. This allows end-to-end training of ODEs within larger models.
more
less

Ricky T. Q. Chen and Yulia Rubanova and Jesse Bettencourt and David Duvenaud

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

[link]
Summary by senior author [duvenaud on hackernews](https://news.ycombinator.com/item?id=18678078). A few years ago, everyone switched their deep nets to "residual nets". Instead of building deep models like this: h1 = f1(x) h2 = f2(h1) h3 = f3(h2) h4 = f3(h3) y = f5(h4) They now build them like this: h1 = f1(x) + x h2 = f2(h1) + h1 h3 = f3(h2) + h2 h4 = f4(h3) + h3 y = f5(h4) + h4 Where f1, f2, etc are neural net layers. The idea is that it's easier to model a small change to an almost-correct answer than to output the whole improved answer at once. In the last couple of years a few different groups noticed that this looks like a primitive ODE solver (Euler's method) that solves the trajectory of a system by just taking small steps in the direction of the system dynamics and adding them up. They used this connection to propose things like better training methods. We just took this idea to its logical extreme: What if we _define_ a deep net as a continuously evolving system? So instead of updating the hidden units layer by layer, we define their derivative with respect to depth instead. We call this an ODE net. Now, we can use off-the-shelf adaptive ODE solvers to compute the final state of these dynamics, and call that the output of the neural network. This has drawbacks (it's slower to train) but lots of advantages too: We can loosen the numerical tolerance of the solver to make our nets faster at test time. We can also handle continuous-time models a lot more naturally. It turns out that there is also a simpler version of the change of variables formula (for density modeling) when you move to continuous time. |

Fast R-CNN

Girshick, Ross B.

International Conference on Computer Vision - 2015 via Local Bibsonomy

Keywords: dblp

Girshick, Ross B.

International Conference on Computer Vision - 2015 via Local Bibsonomy

Keywords: dblp

[link]
This method is based on improving the speed of R-CNN \cite{conf/cvpr/GirshickDDM14} 1. Where R-CNN would have two different objective functions, Fast R-CNN combines localization and classification losses into a "multi-task loss" in order to speed up training. 2. It also uses a pooling method based on \cite{journals/pami/HeZR015} called the RoI pooling layer that scales the input so the images don't have to be scaled before being set an an input image to the CNN. "RoI max pooling works by dividing the $h \times w$ RoI window into an $H \times W$ grid of sub-windows of approximate size $h/H \times w/W$ and then max-pooling the values in each sub-window into the corresponding output grid cell." 3. Backprop is performed for the RoI pooling layer by taking the argmax of the incoming gradients that overlap the incoming values. This method is further improved by the paper "Faster R-CNN" \cite{conf/nips/RenHGS15} |

Algorithms for Non-negative Matrix Factorization

Lee, Daniel D. and Seung, H. Sebastian

Neural Information Processing Systems Conference - 2000 via Local Bibsonomy

Keywords: dblp

Lee, Daniel D. and Seung, H. Sebastian

Neural Information Processing Systems Conference - 2000 via Local Bibsonomy

Keywords: dblp

[link]
We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So $$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$ Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value. $$ V = \left[\begin{array}{c c c} 5 & 4 & 1 \\\\ 4 & 5 & 1 \\\\ 2 & 1 & 5 \end{array}\right] $$ We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues: $$ W = \left[\begin{array}{c c c} -0.656 \\\ -0.652 \\\ -0.379 \end{array}\right], H = \left[\begin{array}{c c c} -6.48 & -6.26 & -3.20\\\\ \end{array}\right] $$ We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$): $$ W = \left[\begin{array}{c c c} 0.388 \\\\ 0.386 \\\\ 0.224 \end{array}\right], H = \left[\begin{array}{c c c} 11.22 & 10.57 & 5.41 \\\\ \end{array}\right] $$ Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. $$ V \approx WH = \left[\begin{array}{c c c} 4.36 & 4.11 & 2.10 \\\ 4.33 & 4.08 & 2.09 \\\ 2.52 & 2.37 & 1.21 \\\ \end{array}\right] $$ If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better` #### Paper Contribution This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$. ### Still a draft |

About