[link]
This paper is an interesting extension of earlier work, in the TransformerXL paper, that sought to give Transformers access to a "memory" beyond the scope of the subsequence where full self-attention was being performed. This was done by caching the activations from prior subsequences, and making them available to the subsequence currently being calculated in a "read-only" way, with gradients not propagated backwards. This had the effect of (1) reducing the maximum memory size compared to simply doubling the subsequence length, and (2) reducing the extent to which gradients had to propagate backward through time. The authors of the Compressive Transformers paper want to build on that set of ideas to construct an even longer accessible memory. So, they take the baseline non-backpropogated memory design of TransformerXL, but instead of having tokens roll out of memory after the end of the previous (cached) subsequence, they create an extra compressed memory. Each token in this compressed memory is a function of C inputs in the normal memory. So, if C=3, you would input 3 memory vectors into your compression function to get one instance of a compressed memory vector. Depending on the scale of your C, you can turn up the temporal distance into the past that your compressed memory had to. https://i.imgur.com/7BaCzoU.png While the gradients from the main loss function didn't, as far as I could tell, pass back into the compression function, they did apply a compression loss to incentivize the compression to be coherent. They considered an autoencoder loss to reconstruct the input tokens from the compressed memory, but decided against that on the principle that memory inherently has to be compressed and lossy to be effective, and an autoencoder loss would promote infeasibly lossless compression. Instead, they take the interesting approach of incentivizing the compressed representations to be able to reconstruct the attention calculation performed on the pre-compressed representations. Basically, any information pulled out of the pre-compressed memories by content-based lookup also needs to be able to be pulled out of the compressed memories. This incentives the network to preferentially keep the information that was being actively used by the attention mechanisms in prior steps, and discard less useful information. One framing from this paper that I enjoyed was them drawing a comparison between the approach of Transformers (of keeping all lower-level activations in memory, and recombining them "in real time," for each downstream use of that information), and the approach of RNNs (of keeping a running compressed representation of everything seen up to this point). In this frame, their method is somewhere in between, with a tunable compression rate C (by contrast, a RNN would have an effectively unlimited compression rate, since all prior tokens would be compressed into a single state representation). |
[link]
I found this paper a bit difficult to fully understand. Its premise, as far as I can follow, is that we may want to use genetic algorithms (GA), where we make modifications to elements in a population, and keep elements around at a rate proportional to some set of their desirable properties. In particular we might want to use this approach for constructing molecules that have properties (or predicted properties) we want. However, a downside of GA is that its easy to end up in local minima, where a single molecule, or small modifications to that molecule, end up dominating your population, because everything else gets removed for having less-high reward. The authors proposed fix for this is by training a discriminator to tell the difference between molecules from the GA population and those from a reference dataset, and then using that discriminator loss, GAN-style, as part of the "fitness" term that's used to determine if elements stay in the population. The rest of the "fitness" term is made up of chemical desiderata - solubility, how easy a molecule is to synthesize, binding efficacy, etc. I think the intuition here is that if the GA produces the same molecule (or similar ones) over and over again, the discriminator will have an easy time telling the difference between the GA molecules and the reference ones. One confusion I had with this paper is that it only really seems to have one experiment supporting its idea of using a discriminator as part of the loss - where the discriminator wasn't used at all unless the chemical fitness terms plateaued for some defined period (shown below). https://i.imgur.com/sTO0Asc.png The other constrained optimization experiments in section 4.4 (generating a molecule with specific properties, improving a desired property while staying similar to a reference molecule, and drug discovery). They also specifically say that they'd like to be the case that the beta parameter - which controls the weight of the discriminator relative to the chemical fitness properties - lets you smoothly interpolate between prioritizing properties and prioritizing diversity/realness of images, but they find that's not the case, and that, in fact, there's a point at which you move beta a small amount and switch sharply to a regime where chemical fitness values are a lot lower. Plots of eventual chemical fitness found over time seem to be the highest for models with beta set to 0, which isn't what you'd expect if the discriminator was in fact useful for getting you out of plateaus and into long-term better solutions. Overall, I found this paper an interesting idea, but, especially since it was accepted into ICLR, found it had confusingly little empirical support behind it. |
[link]
Lee et al. propose a regularizer to increase the size of linear regions of rectified deep networks around training and test points. Specifically, they assume piece-wise linear networks, in its most simplistic form consisting of linear layers (fully connected layers, convolutional layers) and ReLU activation functions. In these networks, linear regions are determined by activation patterns, i.e., a pattern indicating which neurons have value greater than zero. Then, the goal is to compute, and later to increase, the size $\epsilon$ such that the $L_p$-ball of radius $\epsilon$ around a sample $x$, denoted $B_{\epsilon,p}(x)$ is contained within one linear region (corresponding to one activation pattern). Formally, letting $S(x)$ denote the set of feasible inputs $x$ for a given activation pattern, the task is to determine $\hat{\epsilon}_{x,p} = \max_{\epsilon \geq 0, B_{\epsilon,p}(x) \subset S(x)} \epsilon$. For $p = 1, 2, \infty$, the authors show how $\hat{\epsilon}_{x,p}$ can be computed efficiently. For $p = 2$, for example, it results in $\hat{\epsilon}_{x,p} = \min_{(i,j) \in I} \frac{|z_j^i|}{\|\nabla_x z_j^i\|_2}$. Here, $z_j^i$ corresponds to the $j$th neuron in the $i$th layer of a multi-layer perceptron with ReLU activations; and $I$ contains all the indices of hidden neurons. This analytical form can then used to add a regularizer to encourage the network to learn larger linear regions: $\min_\theta \sum_{(x,y) \in D} \left[\mathcal{L}(f_\theta(x), y) - \lambda \min_{(i,j) \in I} \frac{|z_j^i|}{\|\nabla_x z_j^i\|_2}\right]$ where $f_\theta$ is the neural network with paramters $\theta$. In the remainder of the paper, the authors propose a relaxed version of this training procedure that resembles a max-margin formulation and discuss efficient computation of the involved derivatives $\nabla_x z_j^i$ without too many additional forward/backward passes. https://i.imgur.com/jSc9zbw.jpg Figure 1: Visualization of locally linear regions for three different models on toy 2D data. On toy data and datasets such as MNIST and CalTech-256, it is shown that the training procedure is effective in the sense that larger linear regions around training and test points are learned. For example, on a 2D toy dataset, Figure 1 visualizes the linear regions for the optimal regularizer as well as the proposed relaxed version. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Hendrycks and Dietterich propose ImageNet-C and ImageNet-P benchmarks for corruption and perturbation robustness evaluation. Both datasets come in various sizes, and corruptions always come in different difficulties. The used corruptions include many common, realistic noise types such as various types of blur and random noise, brightness changes and compression artifacts. ImageNet-P differs from ImageNet-C in that sequences of perturbations are generated. This means, for a specific perturbation type, 30 different frames are generated; thus, less corruption types in total are used. The remainder of the paper introduces various evaluation metrics; these are usually based on the fact that the label of the corrupted image did not change. Finally, they also highlight some approaches to obtain more “robust” models against these corruptions. The list includes a variant of histogram equalization that is used to normalize the input images, the use of multi-scale or feature aggregation architectures and, surprisingly, adversarial logit pairing. Examples of ImageNet-C images can be found in Figure 1. https://i.imgur.com/YRBOzrH.jpg Figure 1: Examples of images in ImageNet-C. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Summary: An odd thing about machine learning these days is how far you can get in a line of research while only ever testing your method on image classification and image datasets in general. This leads one occasionally to wonder whether a given phenomenon or advance is a discovery of the field generally, or whether it's just a fact about the informatics and learning dynamics inherent in image data. This paper, part of a set of recent papers released by Facebook centering around the Lottery Ticket Hypothesis, exists in the noble tradition of "lets try <thing> on some non-image datasets, and see if it still works". This can feel a bit silly in the cases where the ideas or approaches do transfer, but I think it's still an important impulse for the field to have, lest we become too captured by ImageNet and its various descendants. This paper test the Lottery Ticket Hypothesis - the idea that there are a small subset of weights in a trained network whose lucky initializations promoted learning, such that if you reset those weights to their initializations and train only them you get comparable or near-comparable performance to the full network - on reinforcement learning and NLP datasets. In particular, within RL, they tested on both simple continuous control (where the observation state is a vector of meaningful numbers) and Atari from pixels (where the observation is a full from-pixels image). In NLP, they trained on language modeling and translation, with both a LSTM and a Transformer respectively. (Prior work had found that Transformers didn't exhibit lottery ticket like phenomenon, but this paper found a circumstance where they appear to. ) Some high level interesting results: https://i.imgur.com/kd03bQ4.png https://i.imgur.com/rZTH7FJ.png - So as to not bury the lede: by and large, "winning" tickets retrained at their original initializations outperform random initializations of the same size and configuration on both NLP and Reinforcement Learning problems - There is wide variability in how much pruning in general (a necessary prerequisite operation) impacts reinforcement learning. On some games, pruning at all crashes performance, on others, it actually improves it. This leads to some inherent variability in results https://i.imgur.com/4o71XPt.png - One thing that prior researchers in this area have found is that pruning weights all at once at the end of training tends to crash performance for complex models, and that in order to find pruned models that have Lottery Ticket-esque high-performing properties, you need to do "iterative pruning". This works by training a model for a period, then pruning some proportion of weights, then training again from the beginning, and then pruning again, and so on, until you prune down to the full percentage you want to prune. The idea is that this lets the model adapt gradually to a drop in weights, and "train around" the capacity reduction, rather than it just happening all at once. In this paper, the authors find that this is strongly necessary for Lottery Tickets to be found for either Transformers or many RL problems. On a surface level, this makes sense, since Reinforcement Learning is a notoriously tricky and non-stationary learning problem, and Transformers are complex models with lots of parameters, and so dramatically reducing parameters can handicap the model. A weird wrinkle, though, is that the authors find that lottery tickets found without iterative pruning actually perform worse than "random tickets" (i.e. initialized subnetworks with random topology and weights). This is pretty interesting, since it implies that the topology and weights you get if you prune all at once are actually counterproductive to learning. I don't have a good intuition as to why, but would love to hear if anyone does. https://i.imgur.com/9LnJe6j.png - For the Transformer specifically, there was an interesting divergence in the impact of weight pruning between the weights of the embedding matrix and the weights of the rest of the network machinery. If you include embeddings in the set of weights being pruned, there's essentially no difference in performance between random and winning tickets, whereas if you exclude them, winning tickets exhibit the more typical pattern of outperforming random ones. This implies that whatever phenomenon that makes winning tickets better is more strongly (or perhaps only) present in weights for feature calculation on top of embeddings, and not very present for the embeddings themselves |
[link]
## Introduction Bayesian Neural Networks (BNN): intrinsic importance model based on weight uncertainty; variational inference can approximate posterior distributions using Monte Carlo sampling for gradient estimation; acts like an ensemble method in that they reduce the prediction variance but only uses 2x the number of parameters. The idea is to use BNN's uncertainty to guide gradient descent to not update the important weight when learning new tasks. ## Bayes by Backprop (BBB): https://i.imgur.com/7o4gQMI.png Where $q(w|\theta)$ is our approximation of the posterior $p(w|x)$. $q$ is most probably gaussian with diagonal covariance. We can optimize this via the ELBO: https://i.imgur.com/OwGm20b.png ## Uncertainty-guided CL with BNN (UCB): UCB the regularizing is performed with the learning rate such that the learning rate of each parameter and hence its gradient update becomes a function of its importance. They set the importance to be inversely proportional to the standard deviation $\sigma$ of $q(w|\theta)$ Simply put, the more confident the posterior is about a certain weight, the less is this weight going to be updated. You can also use the importance for weight pruning (sort of a hard version of the first idea) ## Cartoon https://i.imgur.com/6Ld79BS.png |
[link]
Brendel and Bethge show empirically that state-of-the-art deep neural networks on ImageNet rely to a large extent on local features, without any notion of interaction between them. To this end, they propose a bag-of-local-features model by applying a ResNet-like architecture on small patches of ImageNet images. The predictions of these local features are then averaged and a linear classifier is trained on top. Due to the locality, this model allows to inspect which areas in an image contribute to the model’s decision, as shown in Figure 1. Furthermore, these local features are sufficient for good performance on ImageNet. Finally, they show, on scrambled ImageNet images, that regular deep neural networks also rely heavily on local features, without any notion of spatial interaction between them. https://i.imgur.com/8NO1w0d.png Figure 1: Illustration of the heap maps obtained using BagNets, the bag-of-local-features model proposed in the paper. Here, different sizes for the local patches are used. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Zhang et al. combine interval bound propagation and CROWN, both approaches to obtain bounds on a network’s output, to efficiently train robust networks. Both interval bound propagation (IBP) and CROWN allow to bound a network’s output for a specific set of allowed perturbations around clean input examples. These bounds can be used for adversarial training. The motivation to combine BROWN and IBP stems from the fact that training using IBP bounds usually results in instabilities, while training with CROWN bounds usually leads to over-regularization. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Frankle and Carbin discover so-called winning tickets, subset of weights of a neural network that are sufficient to obtain state-of-the-art accuracy. The lottery hypothesis states that dense networks contain subnetworks – the winning tickets – that can reach the same accuracy when trained in isolation, from scratch. The key insight is that these subnetworks seem to have received optimal initialization. Then, given a complex trained network for, e.g., Cifar, weights are pruned based on their absolute value – i.e., weights with small absolute value are pruned first. The remaining network is trained from scratch using the original initialization and reaches competitive performance using less than 10% of the original weights. As soon as the subnetwork is re-initialized, these results cannot be reproduced though. This suggests that these subnetworks obtained some sort of “optimal” initialization for learning. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Zhang et al. search for “blind spots” in the data distribution and show that blind spot test examples can be used to find adversarial examples easily. On MNIST, the data distribution is approximated using kernel density estimation were the distance metric is computed in dimensionality-reduced feature space (of an adversarially trained model). For dimensionality reduction, t-SNE is used. Blind spots are found by slightly shifting pixels or changing the gray value of the background. Based on these blind spots, adversarial examples can easily be found for MNIST and Fashion-MNIST. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
This paper came on my radar after winning Best Paper recently at ICLR, and all in all I found it a clever way of engineering a somewhat complicated inductive bias into a differentiable structure. The empirical results weren’t compelling enough to suggest that this structural shift made a regime-change difference in performing, but it does seem to have some consistently stronger ability to do syntactic evaluation across large gaps in sentences. The core premise of this paper is that, while language is to some extent sequence-like, it is in a more fundamental sense tree-like: a recursive structure of modified words, phrases, and clauses, aggregating up to a fully complete sentence. In practical terms, this cashes out to parse trees, labels akin to the sentence diagrams that you or I perhaps did once upon a time in grade school. https://i.imgur.com/GAJP7ji.png Given this, if you want to effectively model language, it might be useful to have a neural network structure explicitly designed to track where you are in the tree. To do this, the authors of this paper use a clever activation function scheme based on the intuition that you can think of jumping between levels of the tree as adding information to the stack of local context, and then removing that information from the stack when you’ve reached the end of some local phrase. In the framework of a LSTM, which has explicit gating mechanisms for both “forgetting” (removing information from cell memory) and input (adding information to the representation within cell memory) this can be understood as forcing a certain structure of input and forgetting, where you have to sequentially “close out” or add nodes as you move up or down the tree. To represent this mathematically, the authors use a new activation function they developed, termed cumulative max or cumax. In the same way that the softmax is a differentiable (i.e. “soft”) version of an argmax, the cumulative max is a softened version of a vector that has zeros up to some switch point k, and ones thereafter. If you had such a vector as your forget mask, then “closing out” a layer in your tree would be equivalent to shifting the index where you switch from 0 to 1 up by one, so that a layer that previously had a “remember” value of 1.0 now is removing its content from the stack. However, since we need to differentiate, this notional 0/1 vector is instead represented as a cumulative sum of a softmax, which can be thought of as the continuous-valued probability that you’ve reached that switch-point by any given point in the vector. Outside of the abstractions of what we’re imagining this cumax function to represent, in a practical sense, it does strictly enforce that you monotonically remember or input more as you move along the vector. This has the practical fact that the network will be biased towards remembering information at one end of the representation vector for longer, meaning it could be a useful inductive bias around storing information that has a more long-term usefulness to it. One advantage that this system has over a previous system that, for example, had each layer of the LSTM operate on a different forgetting-decay timescale, is that this is a soft approximation, so, up to the number of neurons in the representation, the model can dynamically approximate whatever number of tree nodes it likes, rather than being explicitly correspondent with the number of layers. Beyond being a mathematically clever idea, the question of whether it improves performance is a little mixed. It does consistently worse at tasks that require keeping track of short term dependency information, but seems to do better at more long-term tasks, although not in a perfectly consistent or overly dramatic way. My overall read is that this is a neat idea, and I’m interested to see if it gets built on, as well as interested to see later papers that do some introspective work to validate whether the model is actually using this inductive bias in the tree-like way that we’re hoping and imagining it will. |