![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
Offline reinforcement learning is potentially high-value thing for the machine learning community learn to do well, because there are many applications where it'd be useful to generate a learnt policy for responding to a dynamic environment, but where it'd be too unsafe or expensive to learn in an on-policy or online way, where we continually evaluate our actions in the environment to test their value. In such settings, we'd like to be able to take a batch of existing data - collected from a human demonstrator, or from some other algorithm - and be able to learn a policy from those pre-collected transitions, without being able to query the environment further by taking arbitrary actions. There are two broad strategies for learning a policy from precollected transitions. One is to simply learn to mimic the action policy used by the demonstrator, predicting the action the demonstrator would take in a given state, without making use of reward data at all. This is Behavioral Cloning, and has the advantage of being somewhat more conservative (in terms of not experimenting with possibly-unsafe-or-low-reward actions the demonstrator never took), but this is also a disadvantage, because it's not possible to get higher reward than the demonstrator themselves got if you're simply copying their behavior. Another approach is to learn a Q function - estimating the value of a given action in a given state - using the reward data from the precollected transitions. This can also have some downsides, mostly in the direction of overconfidence. Q value Temporal Difference learning works by using the current reward added to the max Q value over possible next actions as the target for the current-state Q estimate. This tends to lead to overestimates, because regression to the mean effects mean that the highest value Q estimates are disproportionately likely to be noisy (possibly because they correspond to an action with little data in the demonstrator dataset). In on-policy Q learning, this is less problematic, because the agent can take the action associated with their noisily inaccurate estimate, and as a result get more data for that action, and get an estimate that is less noisy in future. But when we're in a fully offline setting, all our learning is completed before we actually start taking actions with our policy, so taking high-uncertainty actions isn't a valuable source of new information, but just risky. The approach suggested by this DeepMind paper - Critic Regularized Regression, or CRR - is essentially a synthesis of these two possible approaches. The method learns a Q function as normal, using temporal difference methods. The distinction in this method comes from how to get a policy, given a learned Q function. Rather than simply taking the action your Q estimate says is highest-value at a particular point, CRR optimizes a policy according to the formula shown below. The f() function is a stand-in for various potential functions, all of which are monotonic with respect to the Q function, meaning they increase when the Q function does. https://i.imgur.com/jGmhYdd.png This basically amounts to a form of a behavioral cloning loss (with the part that maximizes the probability under your policy of the actions sampled from the demonstrator dataset), but weighted or, as the paper terms it, filtered, by the learned Q function. The higher the estimated q value for a transition, the more weight is placed on that transition from the demo dataset having high probability under your policy. Rather than trying to mimic all of the actions of the demonstrator, the policy preferentially tries to mimic the demonstrator actions that it estimates were particularly high-quality. Different f() functions lead to different kinds of filtration. The `binary`version is an indicator function for the Advantage of an action (the Q value for that action at that state minus some reference value for the state, describing how much better the action is than other alternatives at that state) being greater than zero. Another, `exp`, uses exponential weightings which do a more "soft" upweighting or downweighting of transitions based on advantage, rather than the sharp binary of whether an actions advantage is above 1. The authors demonstrate that, on multiple environments from three different environment suites, CRR outperforms other off-policy baselines - either more pure behavioral cloning, or more pure RL - and in many cases does so quite dramatically. They find that the sharper binary weighting scheme does better on simpler tasks, since the trade-off of fewer but higher-quality samples to learn from works there. However, on more complex tasks, the policy benefits from the exp weighting, which still uses and learns from more samples (albeit at lower weights), which introduces some potential mimicking of lower-quality transitions, but at the trade of a larger effective dataset size to learn from. ![]() |
[link]
### Summary Knowing when a model is qualified to make a prediction is critical to safe deployment of ML technology. Model-independent / Unsupervised Out-of-Distribution (OoD) detection is appealing mostly because it doesn't require task-specific labels to train. It is tempting to suggest a simple one-tailed test in which lower likelihoods are OoD (assigned by a Likelihood Model), but the intuition that In-Distribution (ID) inputs should have highest likelihoods _does not hold in higher dimension_. The authors propose to use the Watanabe-Akaike Information Criterion (WAIC) to circumvent this problem and empirically show the robustness of the approach. ### Counterintuitive Properties of Likelihood Models: https://i.imgur.com/4vo0Ff5.png So a GLOW model with Gaussian prior maps SVHN closer to the origin than Cifar (but never actually generates SVHN because Gaussian samples are on the shell). This is bad news for OoD detection. ### Proposed Methodology: Use the WAIC criterion for OoD detection which gives an asymptotically correct estimate of the gap between the training set and test set expectations: https://i.imgur.com/vasSxuk.png Basically, the correction term subtracts the variance in likelihoods across independent samples from the posterior. This acts to robustify the estimate, ensuring that points that are sensitive to the particular choice of posterior are penalized. They use an ensemble of generative models as a proxy for posterior samples i.e. the ensembles acts as approximate posterior samples. Now, OoD can be detected with a Likelihood Model: https://i.imgur.com/M3CDKOA.png ### Discussion Interestingly, GLOW maps Cifar and other datasets INSIDE the gaussian shell (which is an annulus of radius $\sqrt{dim} = \sqrt{3072} \approx 55.4$ https://i.imgur.com/ERdgOaz.png This is in itself quite disturbing, as it suggests that better flow-based generative models (for sampling) can be obtained by encouraging the training distribution to overlap better with the typical set in latent space. ![]() |
[link]
Meta learning, or, the idea of training models on some distribution of tasks, with the hope that they can then learn more quickly on new tasks because they have “learned how to learn” similar tasks, has become a more central and popular research field in recent years. Although there is a veritable zoo of different techniques (to an amusingly literal degree; there’s an emergent fad of naming new methods after animals), the general idea is: have your inner loop consist of training a model on some task drawn from a distribution over tasks (be that maze learning with different wall configurations, letter identification from different languages, etc), and have the outer loop that updates some structural part of your model be based on improving generalization error on each task within the distribution. It’s been demonstrated that meta-learned systems can in fact learn more quickly (at least when their tasks are “in distribution” relative to the distribution they were trained on, which is an important point to be cognizant of), but this paper is less interested with how much better or faster they’re learning, and more interested in whether there are qualitative differences in the way normal learning systems and meta-trained learning systems go about learning a new task. The author (oddly for DeepMind, which typically goes in for super long author lists, there’s only the one on this paper) goes about this by studying simple learning tasks where it’s easier for us to introspect into what each model is learning over time. https://i.imgur.com/ceycq46.png In the first test, he looks at linear regression in a simple setting: where for each individual “task” data is generated according a known true weight matrix (sampled from a prior over weight matrices), with some noise added in. Given this weight matrix, he takes the singular value decomposition (think: PCA), and so ends up with a factorized representation of the weights, where higher eigenvalues on the factors, or “modes”, represent that those factors represent larger-scale patterns that explain more variance, and lower eigenvalues are smaller scale refinements on top of that. He can apply this same procedure to the weights the network has learned at any given point in training, and compare, to see how close the network is to having correctly captured each of these different modes. When normal learners (starting from a raw initialization) approach the task, they start by matching the large scale (higher eigenvalue) factors of variation, and then over the course of training improve performance on the higher-precision factors. By contrast, meta learners, in addition to learning faster, also learn large scale and small scale modes at the same rate. Similar analysis was performed and similar results found for nonlinear regression, where instead of PCA-style components, the function generating data were decomposed into different Fourier frequencies, and the normal learner learned the broad, low-frequency patterns first, where the meta learner learned them all at the same rate. The paper finds intuition for this by showing that the behavior of the meta learners matches quite well against how a Bayes-optimal learner would update on new data points, in the world where that learner had a prior over the data-generating weights that matched the true generating process. So, under this framing, the process of meta learning is roughly equivalent to your model learning a prior correspondant with the task distribution it was trained on. This is, at a high level, what I think we all sort of thought was happening with meta learning, but it’s pretty neat to see it laid out in a small enough problem where we can actually validate against an analytic model. A bit of a meta (heh) point: I wish this paper had more explanation of why the author chose to use the specific eigenvalue-focused metrics of progression on task learning that he did. They seem reasonable, but I’d have been curious to see an explication of what is captured by these, and what might be captured by alternative metrics of task progress. (A side note: the paper also contained a reinforcement learning experiment, but I both understood that one less well and also feel like it wasn’t really that analogous to the other tests) ![]() |
[link]
Liang et al. propose a perturbation-based approach for detecting out-of-distribution examples using a network’s confidence predictions. In particular, the approaches based on the observation that neural network’s make more confident predictions on images from the original data distribution, in-distribution examples, than on examples taken from a different distribution (i.e., a different dataset), out-distribution examples. This effect can further be amplified by using a temperature-scaled softmax, i.e., $ S_i(x, T) = \frac{\exp(f_i(x)/T)}{\sum_{j = 1}^N \exp(f_j(x)/T)}$ where $f_i(x)$ are the predicted logits and $T$ a temperature parameter. Based on these softmax scores, perturbations $\tilde{x}$ are computed using $\tilde{x} = x - \epsilon \text{sign}(-\nabla_x \log S_{\hat{y}}(x;T))$ where $\hat{y}$ is the predicted label of $x$. This is similar to “one-step” adversarial examples; however, in contrast of minimizing the confidence of the true label, the confidence in the predicted label is maximized. This, applied to in-distribution and out-distribution examples is illustrated in Figure 1 and meant to emphasize the difference in confidence. Afterwards, in- and out-distribution examples can be distinguished using simple thresholding on the predicted confidence, as shown in various experiment, e.g., on Cifar10 and Cifar100. https://i.imgur.com/OjDVZ0B.png Figure 1: Illustration of the proposed perturbation to amplify the difference in confidence between in- and out-distribution examples. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). ![]() |
[link]
This is a paper where I keep being torn between the response of “this is so simple it’s brilliant; why haven’t people done it before,” and “this is so simple it’s almost tautological, and the results I’m seeing aren’t actually that surprising”. The basic observation this paper makes is one made frequently before, most recently to my memory by Geoff Hinton in his Capsule Net paper: sometimes the translation invariance of convolutional networks can be a bad thing, and lead to worse performance. In a lot of ways, translation invariance is one of the benefits of using a convolutional architecture in the first place: instead of having to learn separate feature detectors for “a frog in this corner” and “a frog in that corner,” we can instead use the same feature detector, and just move it over different areas of the image. However, this paper argues, this makes convolutional networks perform worse than might naively be expected at tasks that require them to remember or act in accordance with coordinates of elements within an image. For example, they find that normal convolutional networks take nearly an hour and 200K worth of parameters to learn to “predict” the one-hot encoding of a pixel, when given the (x,y) coordinates of that pixel as input, and only get up to about 80% accuracy. Similarly, trying to take an input image with only one pixel active, and predict the (x,y) coordinates as output, is something the network is able to do successfully, but only when the test points are sampled from the same spatial region as the training points: if the test points are from a held-out quadrant, the model can’t extrapolate to the (x, y) coordinates there, and totally falls apart. https://i.imgur.com/x6phN4p.png The solution proposed by the authors is a really simple one: at one or more layers within the network, in addition to the feature channels sent up from the prior layer, add two addition channels: one with a with deterministic values going from -1 (left) to 1 (right), and the other going top to bottom. This essentially adds two fixed “features” to each pixel, which jointly carry information about where it is in space. Just by adding this small change, we give the network the ability to use spatial information or not, as it sees fit. If these features don’t prove useful, their weights will stay around their initialization values of expectation-zero, and the behavior should be much like a normal convolutional net. However, if it proves useful, convolution filters at the next layer can take position information into account. It’s easy to see how this would be useful for this paper’s toy problems: you can just create a feature detector for “if this pixel is active, pass forward information about it’s spatial position,” and predict the (x, y) coordinates out easily. You can also imagine this capability helping with more typical image classification problems, by having feature filters that carry with them not only content information, but information about where a pattern was found spatially. The authors do indeed find comparable performance or small benefits to ImageNet, MNIST, and Atari RL, when applying their layers in lieu of normal convolutional layer. On GANs in particular, they find less mode collapse, though I don’t yet 100% follow the intuition of why this would be the case. https://i.imgur.com/wu7wQZr.png ![]() |
[link]
CNNs predictions are known to be very sensitive to adversarial examples, which are samples generated to be wrongly classifiied with high confidence. On the other hand, probabilistic generative models such as `PixelCNN` and `VAEs` learn a distribution over the input domain hence could be used to detect ***out-of-distribution inputs***, e.g., by estimating their likelihood under the data distribution. This paper provides interesting results showing that distributions learned by generative models are not robust enough yet to employ them in this way. * **Pros (+):** convincing experiments on multiple generative models, more detailed analysis in the invertible flow case, interesting negative results. * **Cons (-):** It would be interesting to provide further results for different datasets / domain shifts to observe if this property can be quanitfied as a characteristics of the model or of the input data. --- ## Experimental negative result Three classes of generative models are considered in this paper: * **Auto-regressive** models such as `PixelCNN` [1] * **Latent variable** models, such as `VAEs` [2] * Generative models with **invertible flows** [3], in particular `Glow` [4]. The authors train a generative model $G$ on input data $\mathcal X$ and then use it to evaluate the likelihood on both the training domain $\mathcal X$ and a different domain $\tilde{\mathcal X}$. Their main (negative) result is showing that **a model trained on the CIFAR-10 dataset yields a higher likelihood when evaluated on the SVHN test dataset than on the CIFAR-10 test (or even train) split**. Interestingly, the converse, when training on SVHN and evaluating on CIFAR, is not true. This result was consistantly observed for various architectures including [1], [2] and [4], although it is of lesser effect in the `PixelCNN` case. Intuitively, this could come from the fact that both of these datasets contain natural images and that CIFAR-10 is strictly more diverse than SVHN in terms of semantic content. Nonetheless, these datasets vastly differ in appearance, and this result is counter-intuitive as it goes against the direction that generative models can reliably be use to detect out-of-distribution samples. Furthermore, this observation also confirms the general idea that higher likelihoods does not necessarily coincide with better generated samples [5]. --- ## Further analysis for invertible flow models The authors further study this phenomenon in the invertible flow models case as they provide a more rigorous analytical framework (exact likelihood inference unlike VAE which only provide a bound on the true likelihood). More specifically invertible flow models are characterized with a ***diffeomorphism*** (invertible function), $f(x; \phi)$, between input space $\mathcal X$ and latent space $\mathcal Z$, and choice of the latent distribution $p(z; \psi)$. The ***change of variable formula*** links the density of $x$ and $z$ as follows: $$ \int_x p_x(x)d_x = \int_x p_z(f(x)) \left| \frac{\partial f}{\partial x} \right| dx $$ And the training objective under this transformation becomes $$ \arg\max_{\theta} \log p_x(\mathbf{x}; \theta) = \arg\max_{\phi, \psi} \sum_i \log p_z(f(x_i; \phi); \psi) + \log \left| \frac{\partial f_{\phi}}{\partial x_i} \right| $$ Typically, $p_z$ is chosen to be Gaussian, and samples are build by inverting $f$, i.e.,$z \sim p(\mathbf z),\ x = f^{-1}(z)$. And $f_{\phi}$ is build such that computing the log determinant of the Jacabian in the previous equation can be done efficiently. First, they observe that contribution of the flow can be decomposed in a ***density*** element (left term) and a ***volume*** element (right term), resulting from the change of variables formula. Experiment results with Glow [4] show that the higher density on SVHN mostly comes from the ***volume element contribution***. Secondly, they try to directly analyze the difference in likelihood between two domains $\mathcal X$ and $\tilde{\mathcal X}$; which can be done by a second-order expansion of the log-likelihood locally around the expectation of the distribution (assuming $\mathbb{E} (\mathcal X) \sim \mathbb{E}(\tilde{\mathcal X})$). For the constant volume Glow module, the resulting analytical formula indeed confirms that the log-likelihood of SVHN should be higher than CIFAR's, as observed in practice. --- ## References * [1] Conditional Image Generation with PixelCNN Decoders, van den Oord et al, 2016 * [2] Auto-Encoding Variational Bayes, Kingma and Welling, 2013 * [3] Density estimation using Real NVP, Dinh et al., ICLR 2015 * [4] Glow: Generative Flow with Invertible 1x1 Convolutions, Kingma and Dhariwal * [5] A Note on the Evaluation of Generative Models, Theis et al., ICLR 2016 ![]() |
[link]
Cheney et al. study the robustness of deep neural networks, especially AlexNet, with regard to randomly dropping or perturbing weights. In particular, the authors consider three types of perturbations: synapse knockouts set random weights to zero, node knockouts set all weights corresponding to a set of neurons to zero, and weight perturbations add random Gaussian noise to the weights of a specific layer. These perturbations are studied on AlexNet, considering the top-5 accuracy on ImageNet; perturbations are considered per layer. For example, Figure 1 (left) shows the influence on accuracy when knocking out synapses. As can be seen, the lower layers, especially the first convolutional layer, are impacted significantly by these perturbations. Similar observations, Figure 1 (right) are made for random perturbations of weights; although the impact is less significant. Especially high-level features, i.e., the corresponding layers, seem to be robust to these kind of perturbations. The authors also provide evidence that these results extend to the top-1 accuracy, as well as other architectures. For VGG, however, the impact is significantly less pronounced which may also be due to the employed dropout layers. https://i.imgur.com/78T6Gg2.png Figure 1: Left: Influence of setting weights in the corresponding layers to zero. Right: Influence of randomly perturbing weights of specific layers. Experiments are on ImageNet using AlexNet. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). ![]() |
[link]
Summary by senior author [duvenaud on hackernews](https://news.ycombinator.com/item?id=18678078). A few years ago, everyone switched their deep nets to "residual nets". Instead of building deep models like this: h1 = f1(x) h2 = f2(h1) h3 = f3(h2) h4 = f3(h3) y = f5(h4) They now build them like this: h1 = f1(x) + x h2 = f2(h1) + h1 h3 = f3(h2) + h2 h4 = f4(h3) + h3 y = f5(h4) + h4 Where f1, f2, etc are neural net layers. The idea is that it's easier to model a small change to an almost-correct answer than to output the whole improved answer at once. In the last couple of years a few different groups noticed that this looks like a primitive ODE solver (Euler's method) that solves the trajectory of a system by just taking small steps in the direction of the system dynamics and adding them up. They used this connection to propose things like better training methods. We just took this idea to its logical extreme: What if we _define_ a deep net as a continuously evolving system? So instead of updating the hidden units layer by layer, we define their derivative with respect to depth instead. We call this an ODE net. Now, we can use off-the-shelf adaptive ODE solvers to compute the final state of these dynamics, and call that the output of the neural network. This has drawbacks (it's slower to train) but lots of advantages too: We can loosen the numerical tolerance of the solver to make our nets faster at test time. We can also handle continuous-time models a lot more naturally. It turns out that there is also a simpler version of the change of variables formula (for density modeling) when you move to continuous time. ![]() |
[link]
This method is based on improving the speed of R-CNN \cite{conf/cvpr/GirshickDDM14} 1. Where R-CNN would have two different objective functions, Fast R-CNN combines localization and classification losses into a "multi-task loss" in order to speed up training. 2. It also uses a pooling method based on \cite{journals/pami/HeZR015} called the RoI pooling layer that scales the input so the images don't have to be scaled before being set an an input image to the CNN. "RoI max pooling works by dividing the $h \times w$ RoI window into an $H \times W$ grid of sub-windows of approximate size $h/H \times w/W$ and then max-pooling the values in each sub-window into the corresponding output grid cell." 3. Backprop is performed for the RoI pooling layer by taking the argmax of the incoming gradients that overlap the incoming values. This method is further improved by the paper "Faster R-CNN" \cite{conf/nips/RenHGS15} ![]() |
[link]
We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So $$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$ Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value. $$ V = \left[\begin{array}{c c c} 5 & 4 & 1 \\\\ 4 & 5 & 1 \\\\ 2 & 1 & 5 \end{array}\right] $$ We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues: $$ W = \left[\begin{array}{c c c} -0.656 \\\ -0.652 \\\ -0.379 \end{array}\right], H = \left[\begin{array}{c c c} -6.48 & -6.26 & -3.20\\\\ \end{array}\right] $$ We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$): $$ W = \left[\begin{array}{c c c} 0.388 \\\\ 0.386 \\\\ 0.224 \end{array}\right], H = \left[\begin{array}{c c c} 11.22 & 10.57 & 5.41 \\\\ \end{array}\right] $$ Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. $$ V \approx WH = \left[\begin{array}{c c c} 4.36 & 4.11 & 2.10 \\\ 4.33 & 4.08 & 2.09 \\\ 2.52 & 2.37 & 1.21 \\\ \end{array}\right] $$ If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better` #### Paper Contribution This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$. ### Still a draft ![]() |