[link]
# Keypoints - Proposes the HIerarchical Reinforcement learning with Off-policy correction (**HIRO**) algorithm. - Does not require careful task-specific design. - Generic goal representation to make it broadly applicable, without any manual design of goal spaces, primitives, or controllable dimensions. - Use of off-policy experience using a novel off-policy correction. - A two-level hierarchy architecture - A higher-level controller outputs a goal for the lower-level controller every **c** time steps and collects the rewards given by the environment, being the goal the desired change in state space - The lower level controller has the goal given added to its input and acts directly in the environment, the reward received is parametrized from the current state and the goal. # Background This paper adopts a standard continuous control reinforcement learning setting, in which an agent acts on an environment that yields a next state and a reward from unknown functions. This paper utilizes the TD3 learning algorithm. ## General and Efficient Hierarchical Reinforcement Learning https://i.imgur.com/zAHoWWO.png ## Hierarchy of Two Policies The higher-level policy $\mu^{hi}$ outputs a goal $g_t$, which correspond directly to desired relative changes in state that the lower-level policy $\mu^{lo}$ attempts to reach. $\mu^{hi}$ operates at a time abstraction, updating the goal $g_t$ and collecting the environment rewards $R_t$ every $c$ environment steps, the higher-level transition $(s_{t:t+c−1},g_{t:t+c−1},a_{t:t+c−1},R_{t:t+c−1},s_{t+c})$ is stored for off-policy training. The lower-level policy $\mu^{lo}$ outputs an action to be applied directly to the environment, having as input the current environment observations $s_t$ and the goal $g_t$. The goal $g_t$ is given by $\mu^{hi}$ every $c$ environment time steps, for the steps in between, the goal $g_t$ used by $\mu^{lo}$ is given by the transition function $g_t=h(s_{t−1},g_{t−1},s_t)$, the lower-level controller reward is provided by the parametrized reward function $r_t=r(s_t,g_t,a_t,s_{t+1})$. The lower-level transition $(s_t,g_t,a_t,r_t,s_{t+1}, g_{t+1})$ is stored for off-policy training. ## Parameterized Rewards The goal $g_t$ indicates a desired relative changes in state observations, the lower-level agent task is to take actions from state $s_t$ that yield it an observation $s_{t+c}$ that is close to $s_t+g_t$. To maintain the same absolute position of the goal regardless of state change, the goal transition model, used between $\mu^{hi}$ updates every $c$ steps, is defined as: $h(s_t,g_t,s_{t+1}) =s_t+g_t−s_{t+1}$ And the reward given to the lower-level controller is defined as to reinforce reaching a state closer to the goal $g_t$, this paper parametrizes it by the function: $r(s_t,g_t,a_t,s_{t+1}) =−||s_t+g_t−s_{t+1}||_2$. ## Off-Policy Corrections for Higher-Level Training The higher-level transitions stored $(s_{t:t+c−1},g_{t:t+c−1},a_{t:t+c−1},R_{t:t+c−1},s_{t+c})$ have to be converted to state-action-reward transitions $(s_t,g_t,∑R_{t:t+c−1},s_{t+c})$ as they can be used in standard off-policy RL algorithms, however, since the lower-level controller is evolving, these past transitions do not accurately represent the actions tha would be taken by the current lower-level policy and must be corrected. This paper correction technique used is to change the goal $g_t$ of past transitions using an out of date lower-level controller to a relabeled goal $g ̃_t$ which is likely to induce the same lower-level behavior with the updated $\mu^{lo}$. In other words, we want to find a goal $g ̃_t$ which maximizes the probability $μ_{lo}(a_{t:t+c−1}|s_{t:t+c−1},g ̃_{t:t+c−1})$, in which the $\mu^{lo}$ is the current policy and the actions $a_{t:t+c−1}$ and states $s_{t:t+c−1}$ are from the stored high level transition. To approximately maximize this quantity in practice, the authors calculated the probability for 10 candidates $g ̃_t$, eight candidate goals sampled randomly from a Gaussian centered at $s_{t+c}−s_t$, the original goal $g_t$ and a goal corresponding to the difference $s_{t+c}−s_t$. # Experiments https://i.imgur.com/iko9nCd.png https://i.imgur.com/kGx8fZv.png The authors compared the $HIRO$ method to prior method in 4 different environments: - Ant Gather; - Ant Maze; - Ant Push; - Ant Fall. They also performed an ablative analysis with the following variants: - With lower-level re-labeling; - With pre-training; - No off-policy correction; - No HRL. # Closing Points - The method proposed is interesting in the hierarchical reinforcement learning setting for not needing a specific design, the generic goal representation enables applicability without the need of designing a goal space manually; - The off-policy correction method enables this algorithm to be sample efficient; - The hierarchical structure with intermediate goals on state-space enables to better visualize the agent goals; - The paper Appendix elaborates on possible alternative off-policy corrections. |
[link]
## General Framework The take-home message is that the challenge of Reinforcement Learning for environments with high-dimensional and partial observations is learning a good representation of the environment. This means learning a sensory features extractor V to deal with the highly dimensional observation (pixels for example). But also learning a temporal representation M of the environment dynamics to deal with the partial observability. If provided with such representations, learning a controller so as to maximize a reward is really easy (single linear layer evolved with CMA-ES). Authors call these representations a *World model* since they can use the learned environment's dynamics to simulate roll-outs. They show that policies trained inside the world model transfer well back to the real environment provided that measures are taken to prevent the policy from exploiting the world model's inaccuracies. ## Method **Learning the World Model** ![](https://i.imgur.com/tgV17k4.png =600x) In this work they propose to learn these representations off-line in an unsupervized manner in order to be more efficient. They use a VAE for V that they train exclusively with the reconstruction loss, that way the learned representations are independent of the reward and can be used alongside any reward. They then train M as Mixture-Density-Network-RNN to predict the next sensory features (as extracted by the VAE) --and possibly the done condition and the reward-- and thus learn the dynamics of the environment in the VAE's latent space (which is likely simpler there than in the pixel space). Note that the VAE's latent space is a single Gaussian (adding stochasticity makes it more robust to the "next state" outputs of M), whereas M outputs next states in a mixture of Gaussians. Indeed, an image is likely to have one visual encoding, yet it can have multiple and different future scenarii which are captured by the multimodal output of M. **Training the policy** ![](https://i.imgur.com/H5vpb2H.png) * In the real env: The agent is provided with the visual features and M's hidden state (temporal features). * In the world model: To avoid that the agent exploits this imperfect simulator they increase its dynamics' stochasticity by playing with $\tau$ the sampling temperature of $z_{t+1}$ in M. ## Limitations If exploration is important in the environment the initial random policy might fail to collect data in all the relevant part of the environment and an iterative version of Algorithm 1 might be required (see https://worldmodels.github.io/ for a discussion on the different iterative methods) for the data collection. By training V independently of M it might fail to encode all the information relevant to the task. Another option would be to train V and M concurrently so that the reward and $z_{t+1}$'s prediction loss (or next state reconstruction loss) of M flows through V (that would also be trained with its own reconstruction loss). The trade-off is that now V is tuned to a particular reward and cannot be reused. The authors argue that since $h_t$ is such that it can predict $z_{t+1}$, it contains enough insight about the future for the agent not needing to *plan ahead* and just doing reflexive actions based on $h_t$. This is interesting but the considered tasks (driving, dodging fireball) are still very reflexive and do not require much planning. ## Results When trained on the true env, a simple controller with the V and M representations achieve SOTA on car-racing. V + M is better than V alone. When trained inside the world model, its dynamics' stochasticity must be tuned in order for the policy to transfer well and perform well on the real env: too little stochasticity and the agent overfits to the world model flaws and does not transfer to the real env, too much and the agent becomes risk-averse and robust but suboptimal. ![](https://i.imgur.com/e8ETSjQ.png) ## Additional ressources Thorough interactive blog post with additional experiments and discussions: https://worldmodels.github.io/ |
[link]
Bafna et al. show that iterative hard thresholding results in $L_0$ robust Fourier transforms. In particular, as shown in Algorithm 1, iterative hard thresholding assumes a signal $y = x + e$ where $x$ is assumed to be sparse, and $e$ is assumed to be sparse. This translates to noise $e$ that is bounded in its $L_0$ norm, corresponding to common adversarial attacks such as adversarial patches in computer vision. Using their algorithm, the authors can provably reconstruct the signal, specifically the top-$k$ coordinates for a $k$-sparse signal, which can subsequently be fed to a neural network classifier. In experiments, the classifier is always trained on sparse signals, and at test time, the sparse signal is reconstructed prior to the forward pass. This way, on MNIST and Fashion-MNIST, the algorithm is able to recover large parts of the original accuracy. https://i.imgur.com/yClXLoo.jpg Algorithm 1 (see paper for details): The iterative hard thresholding algorithm resulting in provable robustness against $L_0$ attack on images and other signals. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Zhang and Sabuncu propose a generalized cross entropy loss for robust learning on noisy labels. The approach is based on the work by Gosh et al. [1] showing that the mean absolute error can be robust to label noise. Specifically, they show that a symmetric loss, under specific assumptions on the label noise, is robust. Here, symmetry corresponds to $\sum_{j=1}^c \mathcal{L}(f(x), j) = C$ for all $x$ and $f$ where $c$ is the number of classes and $C$ some constant. The cross entropy loss is not symmetric, while the mean absolute error is. The mean absolute error however, usually results in slower learning and may reach lower accuracy. As alternative, the authors propose $\mathcal{L}(f(x), e_j) = \frac{(1 – f_j(x)^q)}{q}$. Here, $f$ is the classifier which is assumed to contain a softmax layer at the end. For $q \rightarrow 0$ this reduces to the cross entropy and for $q = 1$ it reduces to the mean absolute error. As shown in Figure 1, this loss (or a slightly adapted version, see paper, respectively) may obtain better performance on noisy labels. To this end, the label noise is assumed to be uniform, meaning that $p(\tilde{y} = k|y = j, x)= 1 - \eta$ where $\tilde{y}$ is the perturbed label. https://i.imgur.com/HRQ84Zv.jpg Figure 1: Performance of the proposed loss for different $q$ and noise rate $\eta$ on Cifar-10. A ResNet-34 is used. [1] Aritra Gosh, Himanshu Kumar, PS Sastry. Robust loss functions under label noise for deep neural networks. AAAI, 2017. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Garg et al. propose adversarially robust features based on a graph interpretation of the training data. In this graph, training points are connected based on their distance in input space. Robust features are obtained using the eigenvectors of the Laplacian of the graph. It is theoretically shown that these features are robust, based on some assumptions on the graph. For example, the bound obtained on robustness depends on the gap between second and third eigenvalue. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Littwin and Wolf propose a activation variance regularizer that is shown to have a similar, even better, effect than batch normalization. The proposed regularizer is based on an analysis of the variance of activation values; the idea is that the measured variance of these variances is low if the activation values come from a distribution with few modes. Thus, the intention of the regularizer is to encourage distributions of activations with only few modes. This is achieved using the regularizers $\mathbb{E}[(1 - \frac{\sigma_s^2}{\sigma^2})^2]$ where $\sigma_s^2$ is the measured variance of activation values and $\sigma^2$ is the true variance of activation values. The estimate $\sigma^2_s$ is mostly influenced by the mini-batch used for training. In practice, the regularizer is replaced by $(1 - \frac{\sigma_{s_1}^2}{\sigma_{s_2}^2 + \beta})^2$ which can be estimated on two different batches, $s_1$ and $s_2$, during training and $\beta$ is a parameter that can be learned and mainly handles the case where the variance is close to zero. In the paper, the authors provide some theoretical bounds and also make a connection to batch normalization and in which cases and why the regularizer might be a better alternative. These claims are supported by experiments on Cifar and Tiny ImageNet. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
A common problem in phylogenetics is: 1. I have $p(\text{DNA sequences} | \text{tree})$ and $p(\text{tree})$. 2. I've used these to run an MCMC algorithm and generate many (approximate) samples from $p(\text{tree} | \text{DNA sequences})$. 3. I want to evaluate $p(\text{tree} | \text{DNA sequences})$. The first solution you might think of is to add up how many times you saw each *tree topology* and divide by the total number of MCMC samples; referred to in this paper as *simple sample relative frequencies* (SRF). An obvious problem with this method is that if you didn't happen to sample a tree topology you will assign it zero probability. This paper focuses on producing a better solution to this problem, by defining a distribution over trees that's easy to fit and provides support over the entire space of tree topologies. What is a Subsplit Bayesian Network? ============================== Phylogenies are leaf-labeled bifurcating trees, binary trees with labeled leaves. **Clades** are nonempty subsets of the leaf labels; labeled in the following figure as $C_1 \to C_7$: https://i.imgur.com/khS3uSo.png To build a phylogeny, I can just take the full set of leaves and recursively split them into clades as described in the above diagram. This means that the distribution over phylogenies is equivalent to the distribution over clades. To make this distribution tractable, it is typically assumed that clades are conditionally independent given parent clades; called the *Conditional Clade Distribution* (CCD) assumption, attributed here to [Larget][] and [Hohna][]. So, a phylogeny may be described by its clades, this paper proposes that these clades may be described by their **subsplits**; the splitting process placing leaf nodes into one of two clades. Then, the authors note this process is a directed Bayesian network and that this Bayesian network must include all possible clade splits. It is therefore a complete binary tree with depth $N-1$ (where $N$ is the number of leaf nodes). Fitting This Parameterisation to MCMC Samples ===================================== Under this parameterisation the likelihood of observing a given tree is: $$ p(\text{tree}) = p(S_1 = s_{1,k}) \prod_{i>1} p(S_i = s_{i,k} | S_{\pi_i} = s_{\pi_i, k}), $$ assuming a collection of subsplits $T_k = \{ S_i = s_{i,k}, i \geq 1 \}$. In this definition $S_{\pi_i}$ is index set of parent nodes of $S_i$; ie a subsplit can depend on any of the parent nodes. The maximum likelihood probabilities for possible subsplits can be solved in closed form; algorithmically involving counting the occurrences of subsplits and dividing by the number of trees observed. If this seems exactly the same as SRF, that's because it is; I [checked the published code to verify this][sbn_ds1]. The authors then consider the case of unrooted trees, where the log likelihood of observed trees can't be easily factorised. They then present a [simple averaging][sa] (not sure where this variational method is discussed in that paper, appears to be under a different name?) and an EM algorithm to fit SBN models over such distributions. They also discuss conditional probability sharing (parameterising conditional on parent-child relationships) immediately before this and it's not clear if this is used in the distribution fit by SA or EM. Experiments show the distributions fit using the EM algorithm perform well according to KL divergence from a "true" distribution defined by fitting a distribution using SRF to a larger dataset. They also show less dispersion estimating the probability of sampled trees versus that estimated by this "ground truth". [sa]: https://people.eecs.berkeley.edu/~wainwrig/Papers/WaiJor08_FTML.pdf [sbn_ds1]: https://github.com/morrislab/sbn/blob/master/experiments/DS1.ipynb [larget]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3676676/ [hohna]: https://academic.oup.com/sysbio/article/61/1/1/1676649 |
[link]
The paper designs some basic tests to compare saliency methods. It founds that some of the most popular methods are independent of model parameters and the data, meaning they are effectively useless. ## Methods compared The paper compare the following methods: gradient explanation, gradient x input, integrated gradients, guided backprop, guided GradCam and SmoothGrad. They provide a refresher on those methods in the appendix. All those methods can be put in the same framework. They require a classification model and an input (typically an image). The output of the method is an *explanation map* of the shape of the input where a higher value for a feature implies greater relevance in the decision of the model. ## Metrics of comparison The authors argue that visual inspection of the saliency maps can be misleading. They propose to compute the Spearman rank correlation, the structural similarity index (SSMI) and the Pearson correlation of the histogram of gradients. The authors point out that those metrics capture various notions of similarity, but it is an active area of research and those metrics are imperfect. ## First test: model parameters randomization A saliency method must be dependent of model parameters, otherwise it cannot help us understand a model. In this test, the authors randomize the model parameters, layer per layer, starting from the top. Surprisingly, methods such as guided backprop and guided gradcam are completely insensitive to model parameters, as illustrated on this Inception v3 trained on ImageNet: ![image](https://user-images.githubusercontent.com/8659132/61403152-b10b8000-a8a2-11e9-9f6a-cf1ed6a876cc.png) Integrated gradients looks also dubious as the bird is still visible with a mostly fully randomized model, but the quantitative metrics reveal the difference is actually big between the two models. ## Second test: data randomization It is well-known that randomly shuffling the labels of a dataset does not prevent a neural network from getting a high accuracy on the training set, though it does prevent generalization. The model is able to learn by either memorizing the data or finding spurious patterns. As a result, saliency maps obtained from such a network should have no clearly interpretable signal. Here is the result for a ConvNet trained on MNIST and a shuffled MNIST: ![image](https://user-images.githubusercontent.com/8659132/61406757-7efe1c00-a8aa-11e9-9826-a859a373cb4f.png) The results are very damning for most methods. Only gradients and GradCam are very different between both models, as confirmed by the low correlation. ## Discussion - Even though some methods do no depend on model parameters and data, they might still depend on the architecture of the models, which could be of some use in some contexts. - Methods that multiply the input with the gradient are dominated by the input. - Complex saliency methods are just fancy edge detectors. - Only gradient, smooth gradient and GradCam survives the sanity checks. # Comments - Why is their GradCam maps so ugly? They don't look like usual GradCam maps at all. - Their tests are simple enough that it's hard to defend a method that doesn't pass them. - The methods that are left are not very good either. They give fuzzy maps that are difficult to interpret. - In the case of integrated gradients (IG), I'm not convinced this is sufficient to discard the method. IG requires a "baseline input" that represents the absence of features. In the case of images, people usually just set the image to 0, which is not at all the absence of a feature. The authors also use the "set the image to 0" strategy, and I'd say their tests are damning for this strategy, not for IG in general. I'd expect an estimation of the baseline such as done in [this paper](https://arxiv.org/abs/1702.04595) would be a fairer evaluation of IG. Code: [GitHub](https://github.com/adebayoj/sanity_checks_saliency) (not available as of 17/07/19) |
[link]
Zhang et al. propose CROWN, a method for certifying adversarial robustness based on bounding activations functions using linear functions. Informally, the main result can be stated as follows: if the activation functions used in a deep neural network can be bounded above and below by linear functions (the activation function may also be segmented first), the network output can also be bounded by linear functions. These linear functions can be computed explicitly, as stated in the paper. Then, given an input example $x$ and a set of allowed perturbations, usually constrained to a $L_p$ norm, these bounds can be used to obtain a lower bound on the robustness of networks. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Tao et al. propose Attacks Meet Interpretability, an adversarial example detection scheme based on the interpretability of individual neurons. In the context of face recognition, in a first step, the authors identify neurons that correspond to specific face attributes. This is achieved by constructing sets of images were only specific attributes change, and then investigating the firing neurons. In a second step, all other neurons, i.e., neurons not corresponding to any meaningful face attribute, are weakened in order to improve robustness against adversarial examples. The idea is that adversarial examples make use of these non-interpretable neurons to fool the networks. Unfortunately, this defense has been shown not to be effective in [1]. [1] Nicholas Carlini. Is AmI (Attacks Meet Interpretability) Robust to Adversarial Examples? ArXiv.org, abs/1902.02322, 2019. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Alvarez-Melis and Jaakkola propose three requirements for self-explainable models, explicitness, faithfulness and stability, and construct a self-explainable, generalized linear model optimizing for these properties. In particular, the proposed model has the form $f(x) = \theta(x)^T h(x)$ where $\theta(x)$ are features (e.g., from a deep network) and $h(x)$ are interpretable features/concepts. In practice, these concepts are learned using an auto-encoder from the raw input while the latent code, which represents $h(x)$, is regularized to learn concept under weak supervision. Additionally, the classifier is regularized to be locally difference-bounded by the concept function $h(x)$. This means that for each point $x_0$ it holds $\|f(x) – f(x_0)\| \leq L \|h(x) – h(x_0)\|$ for all $\|x – x_0\|_\delta$ for some $\delta$ and $L$. This condition leads to some stability of interpretations with respect to the concepts $h(x)$. In practice, this is enforced through a regularizer. In experiments, the authors argue that this class of models has advantages regarding the following three properties of self-explainable models: explicitness, i.e., whether explanations are actually understandable, faithfulness, i.e. whether estimated importance of features reflects true relevance, and stability, i.e., robustness of interpretations against small perturbations. For some of these conditions, the authors propose quantitative metrics; robustness, for example, can be evaluated using $\arg\max_{\|x’ - x\|\leq\epsilon} \frac{\|f(x) – f(x’)}{\|h(x) – h(x’)\|}$ which is very similar to practically evaluating adversarial robustness. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Song et al. propose generative adversarial examples, crafted using a generative adversarial network (GAN) from scratch. In particular a GAN is trained on the original images in order to approximate the generative data distribution. Then, adversarial examples can be found in the learned latent space by finding a latent code that minimizes a loss consisting of fooling the target classifier, not fooling an auxiliary classifier (to not change the actual class) and (optionally) staying close to some fixed random latent code. These adversarial examples do not correspond ot original images anymore, instead they are unrestricted and computed from scratch. Figure 1 shows examples. https://i.imgur.com/Krr9MuU.png Figure 1: Examples of projected gradient descent (PGD, top) to find adversarial examples in the image space, and found adversarial examples in the latent space, as proposed. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
This paper out of DeepMind used a Google StreetView dataset and set out to train a network capable of navigating to a given goal destination, without knowing where it was on any birds-eye map, and with its only input being photographic viewpoint images of its current location and orientation. This was done through a framework of reinforcement learning, where the model is conditioned on a representation of its goal, and given the image features of its current view of the world, and has to take actions like “turn left,” “turn sharply left”, “go forward”, etc, in order to navigate. Rather than lat-long, goals are specified in city-specific ways, in terms of the distance between the goal position and a reference set of landmarks. I don’t entirely understand the motivation behind this approach; the authors say it’s more scalable, but it wasn’t obvious to me why that would be the case. https://i.imgur.com/V3UATsK.png - The authors construct different architectures that combine these two fundamental pieces of input data - current image and the goal you’re trying to reach - in different ways. In the simplest model, called GoalNav, there’s a single LSTM that combines the goal information with the output of a convolutional encoder processing images of your current viewpoint. - In the next most complex, CityNav, there are two LSTMs: one for processing your goal, and the other for combining the output of the goal network with your convolutional inputs, in order to decide on an action. Notionally, this separation of tasks corresponds to “figure out what absolute to go in, given your goal”, and “figure out how to go in that absolute direction from where you are now”. As a way to support training, the goal network is trained with an auxiliary loss function where it needs to predict how far its current orientation is from North. Note that this does pass some amount of information about current location into the model (since the network gets to know its actual orientation relative to true north), but this is only available during training, with the hope that the model will have gotten good enough at predicting orientation to perform well. - The final model, similar to above, is called MultiCityNav, and is explicitly designed for transfer learning. Instead of training multiple cities on a single shared network, only the convolutional encoder and policy network (the “how do I go in the absolute direction needed to reach my goal” parts) are shared between cities, and the goal processing LSTM (the “which direction should I be going in” part) is re-trained per city. This is designed to allow for transfer in the parts of learning you would expect to generalize, but allow the network to learn a city-specific approach for converting between goal specifications (in terms of city landmarks) and direction. In order to get over the fact that reward in this setting is very sparse (i.e. you only get reward when you reach the goal), the authors (1) train in a curriculum fashion, starting with tasks very nearby the model’s starting point, and gradually getting longer, and (2) add a small amount of reward shaping, where you get rewarded for moving in the direction of the goal, but only if you’re within 200m of it. This last is a bit of a concession on the realism front, and the authors say as much, but it’s just quite hard to train RL with purely dense rewards, and it makes sense that reward shaping would help here. Ultimately, they were able to get performance (in terms of goal-reaching rewards) around ¾ as strong as an Oracle model, who had access to the full map and could calculate the true shortest path. |
[link]
Schmidt et al. theoretically and experimentally show that training adversarially robust models requires a higher sample complexity compared to regular generalization. Theoretically, they analyze two very simple families of datasets, e.g., consisting of two Gaussian distributions corresponding to a two-class problem. On such datasets, they proof that “robust generalization”, i.e., generalization to adversarial examples, required much higher sample complexity compared to regular generlization, i.e., generalization to the test set. These results are interesting because they suggest that the sample complexity might be even worse for more complex and realistic data distributions – as we commonly tackle in computer vision. Experimentally, they show similar result son MNIST, CIFAR-10 and SVHN. Varying the size of the training set and plotting the accuracy on adversarially computed examples results in Figure 1. As can be seen, there seems to be a clear advantage of having larger training sets. Note that these models were trained using adversarial training using an $L_\infty$ adversary constrained by the given $\epsilon$. https://i.imgur.com/SriBAt4.png Figure 1: Training set size plotted against the adversarial test accuracy on MNIST, CIFAR-10 and SVHN. The models were trained using adversarial training. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
This is a paper where I keep being torn between the response of “this is so simple it’s brilliant; why haven’t people done it before,” and “this is so simple it’s almost tautological, and the results I’m seeing aren’t actually that surprising”. The basic observation this paper makes is one made frequently before, most recently to my memory by Geoff Hinton in his Capsule Net paper: sometimes the translation invariance of convolutional networks can be a bad thing, and lead to worse performance. In a lot of ways, translation invariance is one of the benefits of using a convolutional architecture in the first place: instead of having to learn separate feature detectors for “a frog in this corner” and “a frog in that corner,” we can instead use the same feature detector, and just move it over different areas of the image. However, this paper argues, this makes convolutional networks perform worse than might naively be expected at tasks that require them to remember or act in accordance with coordinates of elements within an image. For example, they find that normal convolutional networks take nearly an hour and 200K worth of parameters to learn to “predict” the one-hot encoding of a pixel, when given the (x,y) coordinates of that pixel as input, and only get up to about 80% accuracy. Similarly, trying to take an input image with only one pixel active, and predict the (x,y) coordinates as output, is something the network is able to do successfully, but only when the test points are sampled from the same spatial region as the training points: if the test points are from a held-out quadrant, the model can’t extrapolate to the (x, y) coordinates there, and totally falls apart. https://i.imgur.com/x6phN4p.png The solution proposed by the authors is a really simple one: at one or more layers within the network, in addition to the feature channels sent up from the prior layer, add two addition channels: one with a with deterministic values going from -1 (left) to 1 (right), and the other going top to bottom. This essentially adds two fixed “features” to each pixel, which jointly carry information about where it is in space. Just by adding this small change, we give the network the ability to use spatial information or not, as it sees fit. If these features don’t prove useful, their weights will stay around their initialization values of expectation-zero, and the behavior should be much like a normal convolutional net. However, if it proves useful, convolution filters at the next layer can take position information into account. It’s easy to see how this would be useful for this paper’s toy problems: you can just create a feature detector for “if this pixel is active, pass forward information about it’s spatial position,” and predict the (x, y) coordinates out easily. You can also imagine this capability helping with more typical image classification problems, by having feature filters that carry with them not only content information, but information about where a pattern was found spatially. The authors do indeed find comparable performance or small benefits to ImageNet, MNIST, and Atari RL, when applying their layers in lieu of normal convolutional layer. On GANs in particular, they find less mode collapse, though I don’t yet 100% follow the intuition of why this would be the case. https://i.imgur.com/wu7wQZr.png |
[link]
This paper draws from two strains of recent work: the hierarchical music modeling of MusicVAE - which intentionally model musical structure at both local and more global levels - , and the discrete autoencoder approaches of Vector Quantized VAEs - which seek to maintain the overall structure of a VAE, but apply a less aggressive form of regularization. The goal of this paper is to build a model that can generate music, not from that music’s symbolic representation - lists of notes - but from actual waveform audio. This is a more difficult task because the model now has to learn mappings between waveforms and symbolic notes, but confers the advantage of being able to model expressive dimensions of music that are difficult to capture in a pure symbolic representation. Models of pure waveform data have been used before - Wavenet is a central example - but typically they are learned alongside some kind of text conditioning structure, which is to say, you tell the model to say “Hello there, world” and the model is only responsible for building local mappings between those phonemes and waveforms, not actually modeling coherent words to follow after “Hello”. To try to address this problem, the authors of the paper propose the solution of learning an autoencoded representation over the full music sample, to try to capture global structure. Each predicted value of the global structure sequence then represents some number of timesteps of the generated sequence: say, 20. The idea here is: learn a global model that produces 1/N (1/20, in this case) fewer sequence points, whose job is ensuring long term consistency. Then, the authors also suggest the use of a lower level decoder model that uses the conditioning information from the autoencoder, and, in a similar fashion to a text to speech wavenet, captures a high fidelity mapping between that conditioning and the output waveform. This overall structure has a lot in common with the recently released MusicVAE paper. The most salient architectural change proposed by this paper is that of Argmax VAEs, rather than VQ VAEs. Overall, the reason for training discrete autoencoders is to have a more easily adjustable way of regularizing the bottlenecked representation, to avoid the fact that for some challenging problems, excessively strong VAE regularization can lead to that high level representational space just not being used. To understand the difference, it’s worth understanding that VQ VAEs work by generating a continuous encoding vector (the same as a typical VAE) but then instead of passing that continuous vector itself directly on to the decoder, the VQ VAE instead fits what is basically a K means operation: it maps the continuous vector to one of it’s “prototypical” or “codebook” vectors based on closeness in Euclidean distance (these codebook vectors are learned in a separate trading loop, in a K Means style algorithm). The Argmax VAE is similar, but instead of needing to take that alternating step of learning the codebook vectors via K Means, it performs a much simpler quantization operation: just taking the argmax of indices across the continuous vector, so that the output is the one-hot vector closest to the continuous input. While this reduces the capacity of the model, it also limits the problem of “codebook collapse”, which is a failure mode that can happen during the K Means iteration (I’m actually not entirely clear on the prototypical example of codebook collapse, or exactly why it happens). https://i.imgur.com/H5YqSZG.png Combining these ideas together: this paper’s model works by learning an Argmax VAE over a larger and courser timeframe of the model, and then learning a local, high resolution decoder - similar to Wavenet - over the smaller time scales, conditioned on the output of the Argmax VAE making high level decisions. This combination balances the needs of coherent musical structure and local fidelity, and allows for different weighing of those trade-offs in a fairly flexible way, by changing the frequency at which you produce Argmax VAE conditioning output. |
[link]
The overall goal of the paper is measure how similar different layer activation profiles are to one another, in hopes of being able to quantify the similarity of the representations that different layers are learning. If you had a measure that captured this, you could ask questions like: “how similar are the representations that are learned by different networks on the same task”, and “what is the dynamic of representational change in a given layer throughout training”? Canonical Correlation Analysis is one way of approaching this question, and the way taken by this paper. The premise of CCA is that you have two multidimensional variable sets, where each set is made up of vectors representing dimensions within that variable set. Concretely, in this paper, the sets under examination are the activation profiles of two layers (either the same layer at different points in training, or different layers in the same network, or layers in different networks). An activation profile is thought of in terms of multiple vectors, where each vector represents a given neuron’s activation value, evaluated over some observation set X. Importantly, for the two layers that you’re comparing, the set of observations X needs to be of the same length, but the layers can have different number of neurons (and, consequently, different numbers of vectors making up that layer’s multivariate set). Given this setup, the goal of CCA is to find vectors that are linear combinations of the basis vectors of each set, to satisfy some constraint. In that broad sense, this is similar to the project of PCA, which also constructs linear-combination principal components to better represent the underlying data space. However, in PCA, the constraints that define these combinations are based on one multidimensional feature space, not two. In CCA, instead of generating k principal components, you generate k *pairs* of canonical correlates. Each canonical correlate pair, (U1, V1) is a linear combination of the activation vectors of sets L1 and L2 respectively, and is chosen with the goal of minimizing the the angle (cosine) distance between the correlates in each pair. If you think about L1 and L2 each only having two activations (that is: if you think about them as being two-dimensional spaces) then the goal of CCA is to find the cosine distance between the planes defined by the two activation spaces. An important intuition here is that in this framing, vector sets that are just linear transformations of one another (scalings, rotations, swaps in the arbitrary order of activations) will look the same, which wouldn’t be the case if you just looked at raw correlations between the individual activations. This is connected to the linear algebra idea that, if you have two vectors, and a third that is just a linear combination of the first two, the span of those vectors is still just that two-dimensional space. This property is important for the analysis of neural network representations because it means it will be able to capture similarities between representational spaces that have fundamental geometric similarities, even if they’re different on a more surface level. In prior papers, CCA had been used by calculating the CCA vectors between varying sets of layers, and then taking the mean CCA value over all of the pairs of vectors. This paper argues against that approach, on the theory that network layers are probably not using the full representational capacity of their activation dimensions (think, as analogy: a matrix with three columns, that only actually spans two), and so including in your average very low-order correlations is mostly adding uninformative noise to your similarity measure. Instead, this paper weights the correlation coefficients according to the magnitudes of the correlate vectors in the pair; as best I can tell, this is roughly analogous to weighting according to eigenvalues, in a PCA setting. Using this weighted-average similarity measure, the authors do some really interesting investigations into learning dynamics. These include: * Comparing the intermediate-layer representations learned by networks that achieve low train error via memorization vs via actually-generalizing solutions, and show that, during training, the intermediate representations of generalizing networks are more similar to one another than memorizing networks are to one another. Intuitively, this aligns with the idea that there are many ways to noisily memorize, but a more constrained number of ways to actually learn meaningful information about a dataset. A super interesting implication of this is the idea that representational similarity *on the training set* across multiple bootstrapped or randomized trainings could be used as a proxy for test set performance, which could be particularly valuable in contexts where test data is limited https://i.imgur.com/JwyHFmN.png * Across networks, lower layers tend to be more similar to one another than layers closer to the output; said another way, the very simple (e.g. edge detectors) tend to be quite similar across networks, but the higher level representations are more divergent and influenceable by smaller quirks of the training set. * Within a given dataset, you can cluster learned internal representations across many training sets and recover groups trained with the same learning rate, even though the final layer softmax is inherently similar across models that achieve the same training error. This implies that metrics like this can give us some idea of the different minima that the optimization algorithm finds, as a function of different learning rates. Overall, I found this paper a great example of a straightforward idea used to clearly answer important and interesting questions, which is always refreshing amidst a sea of “tiny hack for an extra 0.05 accuracy”. |