[link]
An interesting category of machine learning papers  to which this paper belongs  are papers which use learning systems as a way to explore the incentive structures of problems that are difficult to intuitively reason about the equilibrium properties of. In this paper, the authors are trying to better understand how different dynamics of a cooperative communication game between agents, where the speaking agent is trying to describe an object such that the listening agent picks the one the speaker is being shown, influence the communication protocol (or, to slightly anthropomorphize, the language) that the agents end up using. In particular, the authors experiment with what happens when the listening agent is frequently replaced during training with a untrained listener who has no prior experience with the agent. The idea of this experiment is that if the speaker is in a scenario where listeners need to frequently "relearn" the mapping between communication symbols and objects, this will provide an incentive for that mapping to be easier to quickly learn. https://i.imgur.com/8csqWsY.png The metric of ease of learning that the paper focuses on is "topographic similarity", which is a measure of how compositional the communication protocol is. The objects they're working with have two properties, and the agents use a pair of two discrete symbols (two letters) to communicate about them. A perfectly compositional language would use one of the symbols to represent each of the properties. To mathematically measure this property, the authors calculate (cosine) similarity between the two objects property vectors, and the (edit) distance between the two objects descriptions under the emergent language, and calculate the correlation between these quantities. In this experimental setup, if a language is perfectly compositional, the correlation will be perfect, because every time a property is the same, the same symbol will be used, so two objects that share that property will always share that symbol in their linguistic representation. https://i.imgur.com/t5VxEoX.png The premise and the experimental setup of this paper are interesting, but I found the experimental results difficult to gain intuition and confidence from. The authors do show that, in a regime where listeners are reset, topographic similarity rises from a beginningoftraining value of .54 to an end of training value of .59, whereas in the baseline, noreset regime, the value drops to .51. So there definitely is some amount of support for their claim that listener resets lead to higher compositionality. But given that their central quality is just a correlation between similarities, it's hard to gain intuition for whether the difference is a meaningful. It doesn't naively seem particularly dramatic, and it's hard to tell otherwise without more references for how topographic similarity would change under a wider range of different training scenarios. 
[link]
If you've been at all aware of machine learning in the past five years, you've almost certainly seen the canonical word2vec example demonstrating additive properties of word embeddings: "king  man + woman = queen". This paper has a goal of designing embeddings for agent plans or trajectories that follow similar principles, such that a task composed of multiple subtasks can be represented by adding the vectors corresponding to the subtasks. For example, if a task involved getting an ax and then going to a tree, you'd want to be able to generate an embedding that corresponded to a policy to execute that task by summing the embeddings for "go to ax" and "go to tree". https://i.imgur.com/AHlCt76.png The authors don't assume that they know the discrete boundaries between subtasks in multipletask trajectories, and instead use a relatively simple and clever training structure in order to induce the behavior described above. They construct some network g(x) that takes in information describing a trajectory (in this case, start and end state, but presumably could be more specific transitions), and produces an embedding. Then, they train a model on an imitation learning problem, where, given one demonstration of performing a particular task (typically generated by the authors to be composed of multiple subtasks), the agent needs to predict what action will be taken next in a second trajectory of the same composite task. At each point in the sequence of predicting the next action, the agent calculates the embedding of the full reference trajectory, and the embedding of the actions they have so far performed in the current stage in the predicted trajectory, and calculates the difference between these two values. This embedding difference is used to condition the policy function that predicts next action. At each point, you enforce this constraint, that the embedding of what is remaining to be done in the trajectory be close to the embedding of (full trajectory)  (what has so far been completed), by making the policy that corresponds with that embedding map to the remaining part of the trajectory. In addition to this core loss, they also have a few regularization losses, including: 1. A loss that goes through different temporal subdivisions of reference, and pushes the summed embedding of the two parts to be close to the embedding of the whole 2. A loss that simply pushes the embeddings of the two paired trajectories performing the same task closer together The authors test mostly on relatively simple tasks  picking up and moving sequences of objects with a robotic arm, moving around and picking up objects in a simplified Minecraft world  but do find that their central partialconditioningbased loss gives them better performance on demonstration tasks that are made up of many subtasks. Overall, this is an interesting and clever paper: it definitely targeted additive composition much more directly, rather than situations like the original word2vec where additivity came as a side effect of other properties, but it's still an idea that I could imagine having interesting properties, and one I'd be interested to see applied to a wider variety of tasks. 
[link]
Reinforcement learning is notoriously sampleinefficient, and one reason why is that agents learn about the world entirely through experience, and it takes lots of experience to learn useful things. One solution you might imagine to this problem is the ones humans by and large use in encountering new environments: instead of learning everything through firstperson exploration, acquiring lots of your knowledge by hearing or reading condensed descriptions of the world that can help you take more sensible actions within it. This paper and others like it have the goal of learning RL agents that can take in information about the world in the form of text, and use that information to solve a task. This paper is not the first to propose a solution in this general domain, but it claims to be unique by dint of having both the dynamics of the environment and the goal of the agent change on a perenvironment basis, and be described in text. The precise details of the architecture used are very much the result of particular engineering done to solve this problem, and as such, it's a bit hard to abstract away generalizable principles that this paper showed, other than the proof of concept fact that tasks of the form they describe  where an agent has to learn which objects can kill which enemies, and pursue the goal of killing certain ones  can be solved. Arguably the most central design principle of the paper is aggressive and repeated use of different forms of conditioning architectures, to fully mix the information contained in the textual and visual data streams. This was done in two main ways:  Multiple different attention summaries were created, using the document embedding as input, but with queries conditioned on different things (the task, the inventory, a summarized form of the visual features). This is a natural but clever extension of the fact that attention is an easy way to generate conditional aggregated versions of some input https://i.imgur.com/xIsRu2M.png  The architecture uses FiLM (Featurewise Linear Modulation), which is essentially a manygenerationsgeneralized version of conditional batch normalization in which the gamma and lambda used to globally shift and scale a feature vector are learned, taking some other data as input. The canonical version of this would be taking in text input, summarizing it into a vector, and then using that vector as input in a MLP that generates gamma and lambda parameters for all of the convolutional layers in a vision system. The interesting innovation of this paper is essentially to argue that this conditioning operation is quite neutral, and that there's no essential way in which the vision input is the "true" data, and the text simply the auxiliary conditioning data: it's more accurate to say that each form of data should conditioning the process of the other one. And so they use Bidirectional FiLM, which does just that, conditioning vision features on text summaries, but also conditioning text features on vision summaries. https://i.imgur.com/qFaH1k3.png  The model overall is composed of multiple layers that perform both this mixing FiLM operation, and also visuallyconditioned attention. The authors did show, not super surprisingly, that these additional forms of conditioning added performance value to the model relative to the cases where they were ablated 
[link]
Reinforcement Learning is often broadly separated into two categories of approaches: modelfree and modelbased. In the former category, networks simply take observations and input and produce predicted bestactions (or predicted values of available actions) as output. In order to perform well, the model obviously needs to gain an understanding of how its actions influence the world, but it doesn't explicitly make predictions about what the state of the world will be after an action is taken. In modelbased approaches, the agent explicitly builds a dynamics model, or a model in which it takes in (past state, action) and predicts next state. In theory, learning such a model can lead to both interpretability (because you can "see" what the model thinks the world is like) and robustness to different reward functions (because you're learning about the world in a way not explicitly tied up with the reward). This paper proposes an interesting melding of these two paradigms, where an agent learns a model of the world as part of an endtoend policy learning. This works through something the authors call "observational dropout": the internal model predicts the next state of the world given the prior one and the action, and then with some probability, the state of the world that both the policy and the next iteration of the dynamics model sees is replaced with the model's prediction. This incentivizes the network to learn an effective dynamics model, because the farther the predictions of the model are from the true state of the world, the worse the performance of the learned policy will be on the iterations where the only observation it can see is the predicted one. So, this architecture is modelfree in the sense that the gradient used to train the system is based on applying policy gradients to the reward, but modelbased in the sense that it does have an internal world representation. https://i.imgur.com/H0TNfTh.png The authors find that, at a simple task, Swing Up Cartpole, very low probabilities of seeing the true world (and thus very high probabilities of the policy only seeing the dynamics model output) lead to world models good enough that a policy trained only on trajectories sampled from that model can perform relatively well. This suggests that at higher probabilities of the true world, there was less value in the dynamics model being accurate, and consequently less training signal for it. (Of course, policies that often could only see the predicted world performed worse during their original training iteration compared to policies that could see the real world more frequently). On a more complex task of CarRacing, the authors looked at how well a policy trained using the representations of the world model as input could perform, to examine whether it was learning useful things about the world. https://i.imgur.com/v9etll0.png They found an interesting tradeoff, where at high probabilities (like before) the dynamics model had little incentive to be good, but at low probabilities it didn't have enough contact with the real dynamics of the world to learn a sensible policy. 
[link]
In the last two years, the Transformer architecture has taken over the worlds of language modeling and machine translation. The central idea of Transformers is to use selfattention to aggregate information from variablelength sequences, a task for which Recurrent Neural Networks had previously been the most common choice. Beyond that central structural change, one more nuanced change was from having a single attention mechanism on a given layer (with a single set of query, key, and value weights) to having multiple attention heads, each with their own set of weights. The change was framed as being conceptually analogous to the value of having multiple feature dimensions, each of which focuses on a different aspect of input; these multiple heads could now specialize and perform different weighted sums over input based on their specialized function. This paper performs an experimental probe into the value of the various attention heads at test time, and tries a number of different pruning tests across both machine translation and language modeling architectures to see their impact on performance. In their first ablation experiment, they test the effect of removing (that is, zeromasking the contribution of) a single head from a single attention layer, and find that in almost all cases (88 out of 96) there's no statistically significant drop in performance. Pushing beyond this, they ask what happens if, in a given layer, they remove all heads but the one that was seen to be most important in the single head tests (the head that, if masked, caused the largest performance drop). This definitely leads to more performance degradation than the removal of single heads, but the degradation is less than might be intuitively expected, and is often also not statistically significant. https://i.imgur.com/Qqh9fFG.png This also shows an interesting distribution over where performance drops: in machine translation, it seems like decoderdecoder attention is the least sensitive to heads being pruned, and encoderdecoder attention is the most sensitive, with a very dramatic performance dropoff observed if particularly the last layer of encoderdecoder attention is stripped to a single head. This is interesting to me insofar as it shows the intuitive roots of attention in these architectures; attention was originally used in encoderdecoder parts of models to solve problems of pulling out information in a source sentence at the time it's needed in the target sentence, and this result suggests that a lot of the value of multiple heads in translation came from making that mechanism more expressive. Finally, the authors performed an iterative pruning test, where they ordered all the heads in the network according to their singlehead importance, and pruned starting with the least important. Similar to the results above, they find that drops in performance at high rates of pruning happen eventually to all parts of the model, but that encoderdecoder attention suffers more quickly and more dramatically if heads are removed. https://i.imgur.com/oS5H1BU.png Overall, this is a clean and straightforward empirical paper that asks a fairly narrow question and generates some interesting findings through that question. These results seem reminiscent to me of the Lottery Ticket Hypothesis line of work, where it seems that having a network with a lot of weights is useful for training insofar as it gives you more chances at an initialization that allows for learning, but that at test time, only a small percentage of the weights have ultimately become important, and the rest can be pruned. In order to make the comparison more robust, I'd be interested to see work that does more specific testing of the number of heads required for good performance during training and also during testing, divided out by different areas of the network. (Also, possibly this work exists and I haven't found it!) 
[link]
SelfSupervised Learning is a broad category of approaches whose goal is to learn useful representations by asking networks to perform constructed tasks that only use the content of a dataset itself, and not external labels. The idea with these tasks is to design tasks such that solving them requires the network to have learned useful Some examples of this approach include predicting the rotation of rotated images, reconstructing color from greyscale, and, the topic of this paper, maximizing mutual information between different areas of the image. The hope behind this last approach is that if two areas of an image are generated by the same set of underlying factors (in the case of a human face: they're parts of the same person's face), then a representation that correctly captures those factors for one area will give you a lot of information about the representation of the other area. Historically, this conceptual desire for representations that are mutually informative has been captured by mutual information. If we define the representation distribution over the data of area 1 as p(x) and area 2 as q(x), the mutual information is the KL divergence between the joint distribution of these two distributions and the product of their marginals. This is an old statistical intuition: the closer the joint is to the product of marginals, the closer the variables are to independent; the farther away, the closer they are to informationally identical. https://i.imgur.com/2SzD5d5.png This paper argues that the presence of the KL divergence in this mutual information formulation impedes the ability of networks to learn useful representations. This argument is theoretically based on a result from a recent paper (which for the moment I'll just take as foundation, without reading it myself) that empirical lowerbound measurements of mutual information, of the kind used in these settings, are upper bounded by log(n) where n is datapoints. Our hope in maximizing a lower bound to any quantity is that the bound is fairly tight, since that means that optimizing a network to push upward a lower bound actually has the effect of pushing the actual value up as well. If the lower bound we can estimate is constrained to be far below the actual lower bound in the data, then pushing it upward doesn't actually require the value to move upward. The authors identify this as a particular problem in areas where the underlying mutual information of the data is high, such as in videos where one frame is very predictive of the next, since in those cases the constraint imposed by the dataset size will be small relative to the actual possible maximum mutual information you could push your network to achieve. https://i.imgur.com/wm39mQ8.png Taking a leaf out of the GAN literature, the authors suggest keeping replacing the KL divergence component of mutual information and replacing it with the Wasserstein Distance; otherwise known as the "earthmover distance", the Wasserstein distance measures the cost of the least costly way to move probability mass from one distribution to another, assuming you're moving that mass along some metric space. A nice property of the Wasserstein distance, in both GANs and in this application) is that they don't saturate quite as quickly: the value of a KL divergence can shoot up if the distributions are even somewhat different, making it unable to differentiate between distributions that are somewhat and very far away, whereas a Wasserstein distance continues to have more meaningful signal in that regime. In the context of the swap for mutual information, the authors come up with the "Wasserstein Dependency Measure", which is just the Wasserstein Distance between the joint distributions and the product of the marginals. https://i.imgur.com/3s2QRRz.png In practice, they use the dual formulation of the Wasserstein distance, which amounts to applying a (neural network) function f(x) to values from both distributions, optimizing f(x) so that the values are far apart, and using that distance as your training signal. Crucially, this function has to be relatively smooth in order for the dual formulation to work: in particular it has to have a small Lipschitz value (meaning its derivatives are bounded by some value). Intuitively, this has the effect of restricting the capacity of the network, which is hoped to incentivize it to use its limited capacity to represent true factors of variation, which are assumed to be the most compact way to represent the data. Empirically, the authors found that their proposed Wasserstein Dependency Measure (with a slight variation applied to reduce variance) does have the predicted property of performing better for situations where the native mutual information between two areas is high. I found the theoretical points of this paper interesting, and liked the generalization of the idea of Wasserstein distances from GANs to a new area. That said, I wish I had a better mechanical sense for how it ground out in actual neural network losses: this is partially just my own lack of familiarity with how e.g. mutual information losses are actually formulated as network objectives, but I would have appreciated an appendix that did a bit more of that mapping between mathematical intuition and practical network reality. 
[link]
In response to increasing calls for ways to explain and interpret the predictions of neural networks, one major genre of explanation has been the construction of salience maps for imagebased tasks. These maps assign a relevance or saliency score to every pixel in the image, according to various criteria by which the value of a pixel can be said to have influenced the final prediction of the network. This paper is an interesting blend of ideas from the saliency mapping literature with ones from adversarial examples: it essentially shows that you can create adversarial examples whose goal isn't to change the output of a classifier, but instead to keep the output of the classifier fixed, but radically change the explanation (their term for the previouslydescribed pixel saliency map that results from various explanationfinding methods) to resemble some desired target explanation. This is basically a targeted adversarial example, but targeting a different property of the network (the calculated explanation) while keeping an additional one fixed (keeping the output of the original network close to the original output, as well as keeping the input image itself in a norm ball around the original image. This is done in a pretty standard way: by just defining a loss function incentivizing closeness to the original output and also closeness of the explanation to the desired target, and performing gradient descent to modify pixels until this loss was low. https://i.imgur.com/N9uReoJ.png The authors do a decent job of showing such targeted perturbations are possible: by my assessment of their results their strongest successes at inducing an actual targeted explanation are with Layerwise Relevance Propogation and Pattern Attribution (two of the 6 tested explanation methods). With the other methods, I definitely buy that they're able to induce an explanation that's very unlike the true/original explanation, but it's not as clear they can reach an arbitrary target. This is a bit of squinting, but it seems like they have more success in influencing propogation methods (where the effect size of the output is propogated backwards through the network, accounting for ReLus) than they do with gradient ones (where you're simply looking at the gradient of the output class w.r.t each pixel. In the theory section of the paper, the authors do a bit of differential geometry that I'll be up front and say I did not have the niche knowledge to follow, but which essentially argues that the manipulability of an explanation has to do with the curvature of the output manifold for a constant output. That is to say: how much can you induce a large change in the gradient of the output, while moving a small distance along the manifold of a constant output value. They then go on to argue that ReLU activations, because they are by definition discontinuous, induce sharp changes in gradient for points nearby one another, and this increase the ability for networks to be manipulated. They propose a softplus activation instead, where instead of a sharp discontinuity, the ReLU shape becomes more curved, and show relatively convincingly that at low values of Beta (more curved) you can mostly eliminate the ability of a perturbation to induce an adversarially targeted explanation. https://i.imgur.com/Fwu3PXi.png For all that I didn't have a completely solid grasp of some of the theory sections here, I think this is a neat proof of concept paper in showing that neural networks can be smallperturbation fragile on a lot of different axes: we've known this for a while in the area of adversarial examples, but this is a neat generalization of that fact to a new area. 
[link]
If your goal is to interpret the predictions of neural networks on images, there are a few different ways you can focus your attention. One approach is to try to understand and attach conceptual tags to learnt features, to form a vocabulary with which models can be understood. However, techniques in this family have to content with a number of challenges, from the difficulty in attaching clear concepts to the sheer number of neurons to interpret. An alternate approach, and the one pursued by this paper, is to frame interpretability as a matter of introspecting on *where in an image* the model is pulling information from to make its decision. This is the question for which hard attention provides an answer: identify where in an image a model is making a decision by learning a metamodel that selects small patches of an image, and then makes a classification decision by applying a network to only those patches which were selected. By definition, if only a discrete set of patches were used for prediction, those were the ones that could be driving the model's decision. This central fact of the model only choosing a discrete set of patches is a key complexity, since the choice to use a patch or not is a binary, discontinuous action, and not something through which one can backpropogate gradients. Saccader, the approach put forward by this paper, proposes an architecture which extracts features from locations within an image, and uses those spatially located features to inform a stochastic policy that selects each patch with some probability. Because reinforcement learning by construction is structured to allow discrete actions, the system as a whole can be trained via policy gradient methods. https://i.imgur.com/SPK0SLI.png Diving into a bit more detail: while I don't have a deep familiarity with prior work in this area, my impression is that the notion of using policy gradient to learn a hard attention policy isn't a novel contribution of this work, but rather than its novelty comes from clever engineering done to make that policy easier to learn. The authors cite the problem of sparse reward in learning the policy, which I presume to mean that if you start in more traditional RL fashion by just sampling random patches, most patches will be unclear or useless in providing classification signal, so it will be hard to train well. The Saccader architecture works by extracting localized features in an architecture inspired by the 2019 BagNet paper, which essentially applies very tall and narrow convolutional stacks to spatially small areas of the image. This makes it the case that feature vectors for different overlapping patches can be computed efficiently: instead of rerunning the network again for each patch, it just combined the features from the "tops" of all of the small column networks inside the patch, and used that aggregation as a patchlevel feature. These features from the "representation network" were then used in an "attention network," which uses larger receptive field convolutions to create patchlevel features that integrated the context of things around them. Once these two sets of features were created, they were fed into the "Saccader cell", which uses them to calculate a distribution over patches which the policy then samples over. The Saccader cell is a simplified memory cell, which sets a value to 1 when a patch has been sampled, and applies a very strong penalization on that patch being sampled on future "glimpses" from the policy (in general, classification is performed by making a number of draws and averaging the logits produced for each patch). https://i.imgur.com/5pSL0oc.png I found this paper fairly conceptually clever  I hadn't thought much about using a reinforcement learning setup for classification before  though a bit difficult to follow in its terminology and notation. It's able to perform relatively well on ImageNet, though I'm not steeped enough in that as a benchmark to have an intuitive sense for the paper's claim that their accuracy is meaningfully in the same ballpark as fullimage models. One interesting point the paper made was that their system, while limited to small receptive fields for the patch features, can use an entirely different model for mapping patches to logits once the patches are selected, and so can benefit from more powerful generic classification models being tacked onto the end. 
[link]
The fundamental unit of Reinforcement Learning is the reward function, with a core assumption of the area being that actions induce rewards, with some actions being higher reward than others. But, reward functions are just artificial objects we design to induce certain behaviors; the universe doesn’t hand out “true” rewards we can build off of. Inverse Reinforcement Learning as a field is rooted in the difficulty of designing reward functions, and has the aspiration of, instead of requiring a human to hard code a reward function, inferring rewards from observing human behavior. The rough idea is that if we imagine a human is (even if they don’t know it) operating so as to optimize some set of rewards, we might be able to infer that set of underlying incentives from their actions, and, once we’ve extracted a reward function, use that to train new agents. This is a mathematically quite tricky problem, for the basic reason that a human’s actions are often consistent with a wide range of possible underlying “policy” parameters, and also that a given human policy could be an optimal for a wide range of underlying reward functions. This paper proposes using an adversarial frame on the problem, where you learn a reward function by trying to make reward higher for the human demonstrations you observe, relative to the actions the agent itself is taking. This has the effect of trying to learn an agent that can imitate human actions. However, it specifically designs its model structure to allow it to go beyond just imitation. The problem with learning a purely imitative policy is that it’s hard for the model to separate out which actions the human is taking because they are intrinsically high reward (like, perhaps, eating candy), versus actions which are only valuable in a particular environment (perhaps opening a drawer if you’re in a room where that’s where the candy is kept). If you didn’t realize that the real reward was contained in the candy, you might keep opening drawers, even if you’re in a room where the candy is laying out on the table. In mathematical terms, separating out intrinsic vs instrumental (also known as "shaped") rewards is a matter of making sure to learn separate out the reward associated with a given state from value of taking a given action at that state, because the value of your action is only born out based on assumptions about how states transition between each other, which is a function of the specific state to state dynamics of the you’re in. The authors do this by defining a g(s) function, and a h(s) function. They then define their overall reward of an action as (g(s) + h(s’))  h(s), where s’ is the new state you end up in if you take an action. https://i.imgur.com/3ENPFVk.png This follows the natural form of a Bellman update, where the sum of your future value at state T should be equal to the sum of your future value at time T+1 plus the reward you achieve at time T. https://i.imgur.com/Sd9qHCf.png By adopting this structure, and learning a separate neural network to capture the h(s) function representing the value from here to the end, the authors make it the case that the g(s) function is a purer representation of the reward at a state, regardless of what we expect to happen in the future. Using this, they’re able to use this learned reward to bootstrap good behavior in new environments, even in contexts where a learned value function would be invalid because of the assumptions of instrumental value. They compare their method to the baseline of GAIL, which is a purely imitationlearning approach, and show that theirs is more able to transfer to environments with similar states but different statetostate dynamics. 
[link]
In the area of explaining model predictions over images, there are two main strains of technique: methods that look for pixels that have the highest gradient effect on the output class, and assign those as the “reason” for the class, and approaches that ask which pixel regions are most responsible for a given classification, in the sense that the classification would change the most if they were substituted with some uninformative reference value. The tricky thing about the second class of methods is that you need to decide what to use as your uninformative fillin value. It’s easy enough to conceptually pose the problem of “what would our model predict if it couldn’t see this region of pixels,” but as a practical matter, these models take in full images, and you have to put *something* to give the classifier in a region, if you’re testing what the score would be if you removed the information contained in the pixels in that region. What should you fill in instead? The simplest answers are things like “zeros”, or “a constant value” or “white noise”. But all of these are very offdistribution for the model; it wouldn’t have typically seen images that resemble white noise, or all zeros, or all a single value. So if you measure the change in your model score from an offdistribution baseline to your existing pixels, you may not be getting the marginal value of the pixels, so much as the marginal disutility of having something so different from what the model has previously seen. There are other, somewhat more sensible approaches, like blurring out the areas around the pixel region of interest, but these experience less intense forms of the same issue. This paper proposes instead, using generative models to fill in the regions conditioned on the surrounding pixels, and use that as a reference. The notion here is that a conditioned generative model, like a GAN or VAE, can take into account the surrounding pixels, and “imagine” a fillin that flows smoothly from the surrounding pixels, and looks generally like an image, but which doesn’t contain the information from the pixels in the region being tested, since it wasn’t conditioned on that. https://i.imgur.com/2fKnY0M.png Using this approach, the authors run two types of test: one where they optimize to find the smallest region they can remove from the image, and have it switch class (Smallest Deletion Region, or SDR), and also the smallest informative region that can be added to an otherwise uninformative image, and have the model predict the class connected to that region. They find that regions calculated using their generative model fill in, and specifically with GANs, find a smaller and more compact pixel region as their explanation for the prediction, which is consistent with both human intuitions and also with a higher qualitative sensibleness of the explanations found. 
[link]
This paper tries to do an exhaustive empirical evaluation of the question of how effectively you can reduce the number of training steps needed to train your model, by increasing the batch size. This is an interesting question because it’s becoming increasingly the case that computational costs scale very slowly with additional datapoints added to a batch, and so your perexample cost will be lower, the larger a batch you do your gradient calculation in. In the most ideal world, we might imagine that there’s a perfect tradeoff relationship between batch size and training steps. As a simplistic example, if it were the case that your model only needed to see each observation in the dataset once in order to obtain some threshold of accuracy, and there was an unbounded ability to trade off batch size against training steps, then one might imagine that you could just take one large step based on the whole dataset (in which case you’d then be doing not Stochastic Gradient Descent, but just Batch Gradient Descent). However, there’s reason to suspect that this won’t be possible; for one thing, it seems like having multiple noisier steps is better for optimization than taking one single step of training. https://i.imgur.com/uwCfBJR.png This paper set out to do a largescale evaluation of what this behavior looks like over a range of datasets. They did so by setting a target test error rate, and then measuring how many training steps were necessary to reach that error rate, for a given batch size. For fairness, they trained hyperparameters separately for each batch size. They found that, matching some theoretical predictions, at small to medium batch sizes, your increase in batch size pays off 1:1 in fewer needed training steps. As batch size increases more, the tradeoff curve diverges from 1:1, and eventually goes flat, meaning that, even if you increase your batch size more, you can no longer go any lower in terms of training steps. This seems to me connected to the idea that having a noisy, multistep search process is useful for the nonconvex environments that neural net optimizers are working in. https://i.imgur.com/ycigYVX.png A few other notes from the paper:  Different model architectures can extend 1:1 scaling to higher batch sizes, and thus plateau at a lower number of training steps  Momentum also has the effect of plateauing at a lower number of needed training steps  It’s been previously suggested that you need to scale optimal learning rate linearly or according to the square root of the batch size, in order to maintain best performance. The authors find that there are different learning rates across batch size, but that they aren’t wellapproximated by a linear or squareroot relationship 
[link]
This was definitely one of the more conceptually nuanced and complicated papers I’ve read recently, and I’ve only got about 60% confidence that I fully grasp all of its intuitions. However, I’m going to try to collect together what I did understand. There is a lot of research into generative models of text or image sequences, and some amount of research into building “models” in the reinforcement learning sense, where your model can predict future observations given current observations and your action. There’s an important underlying distinction here between modelbased RL (where you learn a model of how the world evolves, and use that to optimize reward) and modelfree RL (where you leave don’t bother explicitly learning a world model, and just directly try to optimize rewards) However, this paper identifies a few limitations of this research. 1) It’s largely focused on predicting observations, rather than predicting *state*. State is a bit of a fuzzy concept, and corresponds to, roughly, “the true underlying state of the game”. An example I like to use is a game where you walk in one door, and right next to it is a second door, which requires you to traverse the space and find rewards and a key before you can open. Now, imagine that the observation of your agent is it looking at the door. If the game doesn’t have any onscreen representation of the fact that you’ve found the key, it won’t be present in your observations, and you’ll observe the same thing at the point you have just entered and once you found the key. However, the state of the game at these two points will be quite different, in that in the latter case, your next states might be “opening the door” rather than “going to collect rewards”. Scenarios like this are referred to broadly as Partially Observable games or environments. This paper wants to build a model of how the game evolves into the future, but it wants to build a model of *statetostate* evolution, rather than observationtoobservation evolution, since observations are typically both higherdimensionality and also more noisy/less informative. 2) Past research has typically focused on predicting each nextstep observation, rather than teaching models to be able to directly predict a state many steps in the future, without having to roll out the entire sequence of intermediate predictions. This is arguably quite valuable for making models that can predict the long term consequences of their decision This paper approaches that with an approach inspired by the Temporal Difference framework used in much of RL, in which you update your past estimate of future rewards by forcing it to be consistent with the actual observed rewards you encounter in the future. Except, in this model, we sample two a state (z1) and then a state some distance into the future (z2), and try to make our backwardslooking prediction of the state at time 1, taking into account observations that happened in between, match what our prediction was with only the information at time one. An important mechanistic nuance here is the idea of a “belief state”, something that captures all of your knowledge about game history up to a certain point. We can then directly sample a state Zt given the belief state Bt at that point. This isn’t actually possible with a model where we predict a state at time T given the state at time T1, because the state at time Z1 is itself a sample, and so in order to get a full distribution of Zt, you have to sample Zt over the distribution of Zt1, and in order to get the distribution of Zt1 you have to sample over the distribution of Zt2, and so on and so on. Instead, we have a separate nonstate variable, Bt that is created conditional on all our past observations (through a RNN). https://i.imgur.com/N0Al42r.png All said and done, the mechanics of this model look like: 1) Pick two points along the sequence trajectory 2) Calculate the belief state at each point, and, from that, construct a distribution over states at each timestep using p(zb) 3) Have an additional model that predicts z1 given z2, b1, and b2 (that is, the future beliefs and states), and push the distribution over z1 from this model to be close to the distribution over z1 given only the information available at time t1 4) Have a model that predicts Z2 given Z1 and the time interval ahead that we’re jumping, and try to have this model be predictive/have high likelihood over the data 5) And, have a model that predicts an observation at time T2 given the state Z2, and train that so that we can convert our way back to an observation, given a state They mostly test it on fairly simple environments, but it’s an interesting idea, and I’d be curious to see other people develop it in future. (A strange aspect of this model is that, as far as I can tell, it’s noninterventionist, in that we’re not actually conditioning over agent action, or trying to learn a policy for an agent. This is purely trying to learn the long term transitions between states) 
[link]
Current work in image generation (and generative models more broadly) can be split into two broad categories: implicit models, and likelihoodbased models. Implicit models is a categorically that predominantly creates GANs, and which learns how to put pixels in the right places without actually learning a joint probability model over pixels. This is a detriment for applications where you do actually want to be able to calculate probabilities for particular images, in addition to simply sampling new images from your model. Within the class of explicit probability models, the autoencoder and the autoregressive model are the two most central and wellestablished. An autoencoder works by compressing information about an image into a central lowerdimensional “bottleneck” code, and then trying to reconstruct the original image using the information contained in the code. This structure works well for capturing global structure, but is generally weaker at local structure, because by convention images are generated through stacked convolutional layers, where each pixel in the image is sampled separately, albeit conditioned on the same latent state (the value of the layer below). This is in contrast to an autoregressive decoder, where you apply some ordering to the pixels, and then sample them in sequence: starting the prior over the first pixel, and then the second conditional on the first, and so on. In this setup, instead of simply expecting your neighboring pixels to coordinate with you because you share latent state, the model actually has visibility into the particular pixel sampled at the prior step, and has the ability to condition on that. This leads to higherprecision generation of local pixel structure with these models . If you want a model that can get the best of all of these worlds  highlocal precision, good global structure, and the ability to calculate probabilities  a sensible approach might be to combine the two: to learn a globalcompressed code using an autoencoder, and then, conditioning on that autoencoder code as well as the last sampled values, generate pixels using an autoregressive decoder. However, in practice, this has proved tricky. At a high level, this is because the two systems are hard to balance with one another, and different kinds of imbalance lead to different failure modes. If you try to constrain the expression power of your global code too much, your model will just give up on having global information, and just condition pixels on surrounding (pastsampled) pixels. But, by contrast, if you don’t limit the capacity of the code, then the model puts even very local information into the code and ignores the autoregressive part of the model, which brings it away from playing our desired role as global specifier of content. This paper suggests a new combination approach, whereby we jointly train an encoder and autoregressive decoder, but instead of training the encoder on the training signal produced by that decoder, we train it on the training signal we would have gotten from decoding the code into pixels using a simpler decoder, like a feedforward network. The autoregressive network trains on the codes from the encoder as the encoder trains, but it doesn’t actually pass any signal back to it. Basically, we’re training our global code to believe it’s working with a less competent decoder, and then substituting our autoregressive decoder in during testing. https://i.imgur.com/d2vF2IQ.png Some additional technical notes:  Instead of using a more traditional continuousvalued bottleneck code, this paper uses the VQVAE tactic of discretizing code values, to be able to more easily control code capacity. This essentially amounts to generating code vectors as normal, clustering them, passing their cluster medians forward, and then ignoring the fact that none of this is differentiable and passing back gradients with respect to the median  For their auxiliary decoders, the authors use both a simple feedforward network, and also a more complicated network, where the model needs to guess a pixel, using only the pixel values outside of a window of size of that pixel. The goal of the latter variant is to experiment with a decoder that can’t use local information, and could only use global 
[link]
Unsupervised representation learning is a funny thing: our aspiration in learning representations from data is typically that they’ll be useful for future tasks, but, since we (by definition) don’t have access to labels, our approach has historically been to define heuristics, such as representing the data distribution in a lowdimensional space, and hope that those heuristics translate to useful learned representations. And, to a fair extent, they have. However, this paper’s goal is to attach this problem more directly, by explicitly metalearning an unsupervised update rule so that performs well in future tasks. They do this by: https://i.imgur.com/EEkpW9g.png 1) Defining a parametrized weight update function, to slot into the role that Stochastic Gradient Descent on a labeldefined loss function would play in a supervised network. This function calculates a “hidden state”, is defined for each neuron in each layer, and takes in the pre and postnonlinearity activations for that batch, the hidden state of the next layer, and a set of learned perlayer “backwards weights”. The weight update for that neuron is then calculated using the current hidden state, the last batch's hidden state, and the current value of the weight. In the traditional way of people in this field who want to define some generic function, they instantiate these functions as a MLP. 2) Using that update rule on the data from a new task, taking the representing resulting from applying the update rule, and using it in a linear regression with a small number of samples. The generalization performance from this kshot regression, taken in expectation over multiple tasks, acts as our meta training objective. By backpropagating from this objective, to the weight values of the representation, and from there to the parameters of the optimization step, they incentivize their updater to learn representations that are useful across some distribution of tasks. A slightly weird thing about this paper is that they train on image datasets, but shuffle the pixels and use a fully connected network rather than a conv net. I presume this has to do with the complexities of defining a weight update rule for a convolution, but it does make it harder to meaningfully compare with other imagebased unsupervised systems, which are typically done using convolution. An interesting thing they note is that, early in metatraining on images, their update rules generalize fairly well to text data. However, later in training the update rules seem to have specialized to images, and generalize more poorly to images. 
[link]
In the two years since it’s been introduced, the Transformer architecture, which uses multiheaded selfattention in lieu of recurrent models to consolidate information about input and output sequences, has more or less taken the world of language processing and understanding by storm. It has become the default choice for language for problems like translation and questioning answering, and was the foundation of OpenAI’s massive languagemodeltrained GPT. In this context, I really appreciate this paper’s work to try to build our collective intuitions about the structure, specifically by trying to understand how the multiple heads that make up the aforementioned multihead attention divide up importance and specialize function. As a quick orientation, attention works by projecting each value in the sequence into query, key, and value vectors. Then, each element in the sequence creates its nextlayer value by calculating a function of query and key (typically dot product) and putting that in a softmax against the query results with every other element. This weighting distribution is then used as the weights of a weighted average, combining the values together. By default this is a single operation, with a single set of projection matrices, but in the Transformer approach, they use multiheaded attention, which simply means that they learn independent parameters for multiple independent attention “filters”, each of which can then notionally specialize to pull in and prioritize a certain kind of information. https://i.imgur.com/yuC91Ja.png The high level theses of this paper are:  Among attention heads, there’s a core of the most crucial and important ones, and then a long tail of heads that can be pruned (or, have their weight in the concatenation of all attention heads brought to nearly zero) and have a small effect on performance  It’s possible and useful to divide up the heads according to the kinds of other tokens that it is most consistently pulling information from. The authors identify three: positional, syntactic, and rare. Positional heads consistently (>90% of the time) put their maximum attention weight on the same position relative to the query word. Syntactic heads are those that recurringly in the same grammatical relation to the query, the subject to its verb, or the adjective to its noun, for example. Rare words is not a frequently used head pattern, but it is a very valuable head within the first layer, and will consistently put its highest weight on the lowestfrequency word in a given sentence. An interesting side note here is that the authors tried at multiple stages in pruning to retrain a network using only the connections between unpruned heads, and restarting from scratch. However, in a effect reminiscent of the Lottery Ticket Thesis, retraining from scratch cannot get quite the same performance. 
[link]
https://i.imgur.com/JJFljWo.png This paper follows in a recent tradition of results out of Samsung: in the wake of StyleGAN’s very impressive generated images, it uses a lot of similar architectural elements, combined with metalearning and a new discriminator framework, to generate convincing “talking head” animations based on a small number of frames of a person’s face. Previously, models that generated artificial face videos could only do so by training by a large number of frames of each individual speaker that they wanted to simulate. This system instead is able to generate video in a fewshot way: where they only need one or two frames of a new speaker to do convincing generation. The structure of talking head video generation as a problem relies on the idea of “landmarks,” explicit parametrization of where the nose, the eyes, the lips, the head, are oriented in a given shot. The model is trained to be able to generate frames of a specified person (based on an input frame), and in a specific pose (based on an input landmark set). While the visual quality of the simulated video generated here is quite stunning, the most centrally impressive fact about this paper is that generation was only conditioned on a few frames of each target person. This is accomplished through a combination of metalearning (as an overall training procedure/regime) and adaptive instance normalization, a way of dynamically parametrizing models that was earlier used in a StyleGAN paper (also out of the Samsung lab). Metalearning works by doing simulated fewshot training iterations, where a model is trained for a small number of steps on a given “task” (where here a task is a given target face), and then optimized on the metalevel to be able to get good test set error rates across many such target faces. https://i.imgur.com/RIkO1am.png The mechanics of how this metalearning approach actually work are quite interesting: largely a new application of existing techniques, but with some extensions and innovations worked in.  A convolutional model produces an embedding given an input image and a pose. An average embedding is calculated by averaging over different frames, with the hopes of capturing information about the video, in a poseindependent way. This embedding, along with a goal set of landmarks (i.e. the desired facial expression of your simulation) is used to parametrize the generator, which is then asked to determine whether the generated image looks like it came from the sequence belonging to the target face, and looks like it corresponds to the target pose  Adaptive instance normalization works by having certain parameters of the network (typically, per the name, postnormalization rescaling values) that are dependent on the properties of some input data instance. This works by training a network to produce an embedding vector of the image, and then multiplying that embedding by perlayer, perfilter projection matrices to obtain new parameters. This is in particular a reasonable thing to do in the context of conditional GANs, where you want to have parameters of your generator be conditioned on the content of the image you’re trying to simulate  This model structure gives you a natural way to do fewshot generation: you can train your embedding network, your generator, and your projection matrices over a large dataset, where they’ve hopefully learned how to compress information from any given target image, and generate convincing frames based on it, so that you can just pass in your new target image, have it transformed into an embedding, and have it contain information the rest of the network can work with  This model uses a relatively new (~mid 2018) formulation of a conditional GAN, called the projection discriminator. I don’t have time to fully explain this here, but at a high level, it frames the problem of a discriminator determining whether a generated image corresponds to a given conditioning class by projecting both the class and the image into vectors, and calculating a similarityesque dot product.  During fewshot application of this model, it can get impressively good performance without even training on the new target face at all, simply by projecting the target face into an embedding, and updating the targetspecific network parameters that way. However, they do get better performance if they finetune to a specific person, which they do by treating the embeddingprojection parameters as an initialization, and then taking a few steps of gradient descent from there 
[link]
[Machine learning is a nuanced, complicated, intellectually serious topic...but sometimes it’s refreshing to let ourselves be a bit less serious, especially when it’s accompanied by clear, cogent writing on a topic. This particular is a particularly delightful example of goodnatured silliness  from the dataset name HellaSwag to figures containing cartoons of BERT and ELMO representing language models  combined with interesting science.] https://i.imgur.com/CoSeh51.png This paper tackles the problem of natural language comprehension, which asks: okay, our models can generate plausible looking text, but do they actually exhibit what we would consider true understanding of language? One natural structure of task for this is to take questions or “contexts”, and, given a set of possible endings or completion, pick the correct one. Positive examples are relatively easy to come by: adjacent video captions and question/answer pairs from WikiHow are two datasets used in this paper. However, it’s more difficult to come up with *negative* examples. Even though our incorrect endings won’t be a meaningful continuation of the sentence, we want them to be “close enough” that we can feel comfortable attributing a model’s ability to pick the correct answer as evidence of some meaningful kind of comprehension. As an obvious failure mode, if the alternative multiple choice options were all the same word repeated ten times, that would be recognizable as being not real language, and it would be easy for a model to select the answer with the distributional statistics of real language, but it wouldn’t prove much. Typically failure modes aren’t this egregious, but the overall intuition still holds, and will inform the rest of the paper: your ability to test comprehension on a given dataset is a function of how contextuallyrelevant and realistic your negative examples are. Previous work (by many of the same authors as are on this paper), proposed a technique called Adversarial Filtering to try to solve this problem. In Adversarial Filtering, a generative language model is used to generate possible many endings conditioned on the input context, to be used as negative examples. Then, a discriminator is trained to predict the correct ending given the context. The generated samples that the discriminator had the highest confidence classifying as negative are deemed to be not challenging enough comparisons, and they’re thrown out and replaced with others from our pool of initiallygenerated samples. Eventually, once we’ve iterated through this process, we have a dataset with hopefully realistic negative examples. The negative examples are then given to humans to ensure none of them are conceptually meaningful actual endings to the sentence. The dataset released by the earlier paper, which used as it’s generator a relatively simple LSTM model, was called Swag. However, the authors came to notice that the performance of new language models (most centrally BERT) on this dataset might not be quite what it appears: its success rate of 86% only goes down to 76% if you don’t give the classifier access to the input context, which means it can get 76% (off of a random baseline of 25%, with 4 options) simply by evaluating which endings are coherent as standalone bits of natural language, without actually having to understand or even see the context. Also, shuffling the words in the words in the possible endings had a similarly small effect: the authors are able to get BERT to perform at 60% accuracy on the SWAG dataset with no context, and with shuffled words in the possible answers, meaning it’s purely selecting based on the distribution of words in the answer, rather than on the meaningfullyordered sequence of words. https://i.imgur.com/f6vqJWT.png The authors overall conclusion with this is: this adversarial filtering method is only as good as the generator, and, more specifically, the training dynamic between the generator that produces candidate endings, and the discriminator that filters them. If the generator is too weak, the negative examples can be easily detected as fake by a stronger model, but if the generator is too strong, then the discriminator can’t get good enough to usefully contribute by weeding samples out. They demonstrate this by creating a new version of Swag, which they call HellaSwag (for the expected acronymoptimization reasons), with a GPT generator rather than the simpler one used before: on this new dataset, all existing models get relatively poor results (3040% performance). However, the authors’ overall point wasn’t “we’ve solved it, this new dataset is the end of the line,” but rather a call in the future to be wary, and generally aware that with benchmarks like these, especially with generated negative examples, it’s going to be a constantly moving target as generation systems get better. 
[link]
This paper out of DeepMind used a Google StreetView dataset and set out to train a network capable of navigating to a given goal destination, without knowing where it was on any birdseye map, and with its only input being photographic viewpoint images of its current location and orientation. This was done through a framework of reinforcement learning, where the model is conditioned on a representation of its goal, and given the image features of its current view of the world, and has to take actions like “turn left,” “turn sharply left”, “go forward”, etc, in order to navigate. Rather than latlong, goals are specified in cityspecific ways, in terms of the distance between the goal position and a reference set of landmarks. I don’t entirely understand the motivation behind this approach; the authors say it’s more scalable, but it wasn’t obvious to me why that would be the case. https://i.imgur.com/V3UATsK.png  The authors construct different architectures that combine these two fundamental pieces of input data  current image and the goal you’re trying to reach  in different ways. In the simplest model, called GoalNav, there’s a single LSTM that combines the goal information with the output of a convolutional encoder processing images of your current viewpoint.  In the next most complex, CityNav, there are two LSTMs: one for processing your goal, and the other for combining the output of the goal network with your convolutional inputs, in order to decide on an action. Notionally, this separation of tasks corresponds to “figure out what absolute to go in, given your goal”, and “figure out how to go in that absolute direction from where you are now”. As a way to support training, the goal network is trained with an auxiliary loss function where it needs to predict how far its current orientation is from North. Note that this does pass some amount of information about current location into the model (since the network gets to know its actual orientation relative to true north), but this is only available during training, with the hope that the model will have gotten good enough at predicting orientation to perform well.  The final model, similar to above, is called MultiCityNav, and is explicitly designed for transfer learning. Instead of training multiple cities on a single shared network, only the convolutional encoder and policy network (the “how do I go in the absolute direction needed to reach my goal” parts) are shared between cities, and the goal processing LSTM (the “which direction should I be going in” part) is retrained per city. This is designed to allow for transfer in the parts of learning you would expect to generalize, but allow the network to learn a cityspecific approach for converting between goal specifications (in terms of city landmarks) and direction. In order to get over the fact that reward in this setting is very sparse (i.e. you only get reward when you reach the goal), the authors (1) train in a curriculum fashion, starting with tasks very nearby the model’s starting point, and gradually getting longer, and (2) add a small amount of reward shaping, where you get rewarded for moving in the direction of the goal, but only if you’re within 200m of it. This last is a bit of a concession on the realism front, and the authors say as much, but it’s just quite hard to train RL with purely dense rewards, and it makes sense that reward shaping would help here. Ultimately, they were able to get performance (in terms of goalreaching rewards) around ¾ as strong as an Oracle model, who had access to the full map and could calculate the true shortest path. 
[link]
The Lottery Ticket Hypothesis is the idea that you can train a deep network, set all but a small percentage of its highmagnitude weights to zero, and retrain the network using the connection topology of the remaining weights, but only if you reinitialize the unpruned weights to the the values they had at the beginning of the first training. This suggests that part of the value of training such big networks is not that we need that many parameters to use their expressive capacity, but that we need many “draws” from the weight and topology distribution to find initial weight patterns that are welldisposed for learning. This paper out of Uber is a refreshingly exploratory experimental work that tries to understand the contours and contingencies of this effect. Their findings included:  The pruning criteria used in the original paper, where weights are kept according to which have highest final magnitude, works well. However, an alternate criteria, where you keep the weights that have increased the most in magnitude, works just as well and sometimes better. This makes a decent amount of sense, since it seems like we’re using magnitude as a signal of “did this weight come to play a meaningful role during training,” and so weights whose influence increased during training fall in that category, regardless of their starting point https://i.imgur.com/wTkNBod.png  The authors’ next question was: other than just reinitialize weights to their initial values, are there other things we can do that can capture all or part of the performance effect? The answer seems to be yes; they found that the most important thing seems to be keeping the sign of the weights aligned with what it was at its starting point. As long as you do that, redrawing initial weights (but giving them the right sign), or resetting weights to a correctly signed constant value, both work nearly as well as the actual starting values https://i.imgur.com/JeujUr3.png  Turning instead to the weights on the pruning chopping block, the authors find that, instead of just zeroing out all pruned weights, they can get even better performance if they zero the weights that moved towards zero during training, and reinitialize (but freeze) the weights that moved away from zero during training. The logic of the paper is “if the weight was trying to move to zero, bring it to zero, otherwise reinitialize it”. This performance remains high at even lower levels of training than does the initial zeromasking result  Finally, the authors found that just by performing the masking (i.e. keeping only weights with large final values), bringing those back to their values, and zeroing out the rest, *and not training at all*, they were able to get 40% test accuracy on MNIST, much better than chance. If they masked according to “large weights that kept the same sign during training,” they could get a pretty incredible 80% test accuracy on MNIST. Way below even simple trained models, but, again, this model wasn’t *trained*, and the only information about the data came in the form of a binary weight mask This paper doesn’t really try to come up with explanations that wrap all of these results up neatly with a bow, and I really respect that. I think it’s good for ML research culture for people to feel an affordance to just run a lot of targeted experiments aimed at explanation, and publish the results even if they don’t quite make sense yet. I feel like on this problem (and to some extent in machine learning generally), we’re the blind men each grabbing at one part of an elephant, trying to describe the whole. Hopefully, papers like this can bring us closer to understanding strange quirks of optimization like this one 
[link]
A few years ago, a paper came out demonstrating that adaptive gradient methods (which dynamically scale gradient updates in a perparameter way according to the magnitudes of past updates) have a tendency to generalize less well than nonadaptive methods, even they adaptive methods sometimes look more performant in training, and are easier to hyperparameter tune. The 2017 paper offered a theoretical explanation for this fact based on Adam learning less complex solutions than SGD; this paper offers a different one, namely that Adam performs poorly because it is typically implemented alongside L2 regularization, which has importantly different mechanical consequences than it does in SGD. Specifically, in SGD, L2 regularization, where the loss includes both the actual loss and a L2 norm of the weights, can be made equivalent to weight decay, by choosing the right parameters for each. (see proof below). https://i.imgur.com/79jfZg9.png However, for Adam, this equivalence doesn’t hold. This is true because, in SGD, all the scaling factors are just constants, and for each learning rate value and regularization parameter, a certain weight decay parameter is implied by that. However, since Adam scales its parameter updates not by a constant learning rate but by a matrix, it’s not possible to pick other hyperparameters in a way that could get you something similar to constantparameter weight decay. To solve this, the authors suggest using an explicit weight decay term, rather than just doing implicit weight decay via L2 regularization. This is salient because the L2 norm is added to the *loss function*, and it makes up part of the gradient update, and thus gets scaled down by Adam by the same adaptive mechanism that scales down historically large gradients. When weight decay is moved outside of the form of being a norm calculation inside a loss function, and just something applied to the final update but not actually part of the adaptive scaling calculation, the authors find that 1) Adam is able to get comparable performance on image and sequence tasks (where it has previously had difficult), and 2) that even for SGD, where it was possible to find a optimal parameter setting to reproduce weight decay, having an explicit and decoupled weight decay parameter made parameters that were previously dependent on one another in their optimal values (regularization and learning rate) more independent. 
[link]
Meta learning, or, the idea of training models on some distribution of tasks, with the hope that they can then learn more quickly on new tasks because they have “learned how to learn” similar tasks, has become a more central and popular research field in recent years. Although there is a veritable zoo of different techniques (to an amusingly literal degree; there’s an emergent fad of naming new methods after animals), the general idea is: have your inner loop consist of training a model on some task drawn from a distribution over tasks (be that maze learning with different wall configurations, letter identification from different languages, etc), and have the outer loop that updates some structural part of your model be based on improving generalization error on each task within the distribution. It’s been demonstrated that metalearned systems can in fact learn more quickly (at least when their tasks are “in distribution” relative to the distribution they were trained on, which is an important point to be cognizant of), but this paper is less interested with how much better or faster they’re learning, and more interested in whether there are qualitative differences in the way normal learning systems and metatrained learning systems go about learning a new task. The author (oddly for DeepMind, which typically goes in for super long author lists, there’s only the one on this paper) goes about this by studying simple learning tasks where it’s easier for us to introspect into what each model is learning over time. https://i.imgur.com/ceycq46.png In the first test, he looks at linear regression in a simple setting: where for each individual “task” data is generated according a known true weight matrix (sampled from a prior over weight matrices), with some noise added in. Given this weight matrix, he takes the singular value decomposition (think: PCA), and so ends up with a factorized representation of the weights, where higher eigenvalues on the factors, or “modes”, represent that those factors represent largerscale patterns that explain more variance, and lower eigenvalues are smaller scale refinements on top of that. He can apply this same procedure to the weights the network has learned at any given point in training, and compare, to see how close the network is to having correctly captured each of these different modes. When normal learners (starting from a raw initialization) approach the task, they start by matching the large scale (higher eigenvalue) factors of variation, and then over the course of training improve performance on the higherprecision factors. By contrast, meta learners, in addition to learning faster, also learn large scale and small scale modes at the same rate. Similar analysis was performed and similar results found for nonlinear regression, where instead of PCAstyle components, the function generating data were decomposed into different Fourier frequencies, and the normal learner learned the broad, lowfrequency patterns first, where the meta learner learned them all at the same rate. The paper finds intuition for this by showing that the behavior of the meta learners matches quite well against how a Bayesoptimal learner would update on new data points, in the world where that learner had a prior over the datagenerating weights that matched the true generating process. So, under this framing, the process of meta learning is roughly equivalent to your model learning a prior correspondant with the task distribution it was trained on. This is, at a high level, what I think we all sort of thought was happening with meta learning, but it’s pretty neat to see it laid out in a small enough problem where we can actually validate against an analytic model. A bit of a meta (heh) point: I wish this paper had more explanation of why the author chose to use the specific eigenvaluefocused metrics of progression on task learning that he did. They seem reasonable, but I’d have been curious to see an explication of what is captured by these, and what might be captured by alternative metrics of task progress. (A side note: the paper also contained a reinforcement learning experiment, but I both understood that one less well and also feel like it wasn’t really that analogous to the other tests) 
[link]
As per the “holistic” in the paper title, the goal of this work is to take a suite of existing work within semisupervised learning, and combine many of its ideas into one training pipeline that can (with really impressive empirical success) leverage the advantages of those different ideas. The core premise of supervised learning is that, given truelabel training signal from a small number of labels, you can leverage large amounts of unsupervised data to improve your model. A central intuition of many of these methods is that, even if you don’t know the class of a given sample, you know it *has* a class, and you can develop a loss by pushing your model to predict the class for an example and a modified or perturbed version of that example, since, if you have a prior belief that that modification should not change your true class label, then your unlabeled data point should have the same class prediction both times. Entropy minimization is built off similar notions: although we don’t know a point’s class, we know it must have one, and so we’d like our model to make a prediction that puts more of its weight on a single class, rather than be spread out, since we know the “correct model” will be a very confident prediction of one class, though we don’t know which it is. These methods will give context and a frame of mind for understanding the techniques merged together into the MixMatch approach. At its very highest level, MixMatch’s goal is to take in a dataset of both labeled and unlabeled data, and produce a training set of inputs, predictions, and (occasionally constructed or modified labels) to calculate a model update loss from. https://i.imgur.com/6lHQqMD.png  First, for each unlabeled example in the dataset, we produce K different augmented versions of that image (by cropping it, rotating it, flipping it, etc). This is in the spirit of the consistency loss literature, where you want your model to make the same prediction across augmentations  Do the same augmentation for each labeled example, but only once per input, rather than k times  Run all of your augmented examples through your model, and take the average of their predictions. This is based on the idea that the average of the predictions will be a lower variance, more stable pseudotarget to pull each of the individual predictions towards. Also in the spirit of making something more shaped like a real label, they undertake a sharpening step, turning down the temperature of the averaged distribution. This seems like it would have the effect of more confidently pulling the original predictions towards a single “best guess” label  At this point, we have a set of augmented labeled data, with a true label, and also a set of augmented unlabeled data, with a label based off of an averaged and sharpened best guess from the model over different modifications. At this point, the pipeline uses something called “MixUp” (on which there is a previous paper, so I won’t dive into it too much here), which takes pairs of data points, calculates a convex combination of the inputs, runs it through the model, and uses as the lossfunction target a convex combination of the outputs. So, in the simple binary case, if you have a positive and negatively labeled image and sample a combination parameter of 0.75, you have an image that is 0.75 positive, 0.25 negative, and the new label that you’re calculating cross entropy loss against is 0.75.  MixMatch generates pairs for its MixUp calculation by mixing (heh) labeled and unlabeled data together, and pairing each labeled and unlabeled pair with one observation from the merged set. At this point, we have combined inputs, and we have combined labels, and we can calculate loss between them With all of these methods combined, this method takes the previous benchmark of 38% error, for a CIFAR dataset with only 250 labels, and drops that to 11%, which is a pretty astonishing improvement in error rate. After performing an ablation study, they find that MixUp itself, temperature sharpening, and calculating K>1 augmentations of unlabeled data rather than K=1 are the strongest valueadds; it doesn’t appear like there’s that much difference that comes from mixing between unlabeled and labeled for the MixUp pairs. 
[link]
This paper blends concepts from variational inference and hierarchical reinforcement learning, learning skills or “options” out of which master policies can be constructed, in a way that allows for both information transfer across tasks and specialization on any given task. The idea of hierarchical reinforcement learning is that instead of maintaining one single policy distribution (a learned mapping between worldstates and actions), a learning system will maintain multiple simpler policies, and then learn a metapolicy for transitioning between these objectlevel policies. The hope is that this setup leads to both greater transparency and compactness (because skills are compartmentalized), and also greater ability to transfer across tasks (because if skills are granular enough, different combinations of the same skills can be used to solve quite different tasks). The differentiating proposal of this paper is that, instead of learning skills that will be fixed with respect to the master, taskspecific policy, we instead learning crosstask priors over different skills, which can then be finetuned for a given specific task. Mathematically, this looks like a reward function that is a combination of (1) actual rewards on a trajectory, and (2) the difference in the log probability of a given trajectory under the taskspecific posterior and under the prior. https://i.imgur.com/OCvmGSQ.png This framing works in two directions: it allows a general prior to be pulled towards taskspecific rewards, to get more specialized value, but it also pulls the pertask skill towards the global prior. This is both a source of transfer knowledge and general regularization, and also an incentive for skills to be relatively consistent across tasks, because consistent posteriors will be more locally clustered around their prior. The paper argues that one advantage of this is a symmetrybreaking effect, avoiding a local minimum where two skills are both being used to solve subtask A, and it would be better for one of them to specialize on subtask B, but in order to do so the local effect would be worse performance of that skill on subtask A, which would be to the overall policy’s detriment because that skill was being actively used to solve that task. Under a priordriven system, the model would have an incentive to pick one or the other of the options and use that for a given subtask, based on whichever’s prior was closest in trajectoryspace. https://i.imgur.com/CeFQ9PZ.png On a mechanical level, this set of priors is divided into a few structural parts: 1) A termination distribution, which chooses whether to keep drawing actions from the skill/subpolicy you’re currently on, or trade it in for a new one. This one has a prior set at a Bernoulli distribution with some learned alpha 2) A skill transition distribution, which chooses, conditional on sampling a “terminate”, which skill to switch to next. This has a prior of a uniform distribution over skills, which incentivizes the learning system to not put all its sampling focus on one policy too early 3) A distribution of actions given a skill choice, which, as mentioned before, has both a crosstask prior and a pertask learned posterior 
[link]
This paper came on my radar after winning Best Paper recently at ICLR, and all in all I found it a clever way of engineering a somewhat complicated inductive bias into a differentiable structure. The empirical results weren’t compelling enough to suggest that this structural shift made a regimechange difference in performing, but it does seem to have some consistently stronger ability to do syntactic evaluation across large gaps in sentences. The core premise of this paper is that, while language is to some extent sequencelike, it is in a more fundamental sense treelike: a recursive structure of modified words, phrases, and clauses, aggregating up to a fully complete sentence. In practical terms, this cashes out to parse trees, labels akin to the sentence diagrams that you or I perhaps did once upon a time in grade school. https://i.imgur.com/GAJP7ji.png Given this, if you want to effectively model language, it might be useful to have a neural network structure explicitly designed to track where you are in the tree. To do this, the authors of this paper use a clever activation function scheme based on the intuition that you can think of jumping between levels of the tree as adding information to the stack of local context, and then removing that information from the stack when you’ve reached the end of some local phrase. In the framework of a LSTM, which has explicit gating mechanisms for both “forgetting” (removing information from cell memory) and input (adding information to the representation within cell memory) this can be understood as forcing a certain structure of input and forgetting, where you have to sequentially “close out” or add nodes as you move up or down the tree. To represent this mathematically, the authors use a new activation function they developed, termed cumulative max or cumax. In the same way that the softmax is a differentiable (i.e. “soft”) version of an argmax, the cumulative max is a softened version of a vector that has zeros up to some switch point k, and ones thereafter. If you had such a vector as your forget mask, then “closing out” a layer in your tree would be equivalent to shifting the index where you switch from 0 to 1 up by one, so that a layer that previously had a “remember” value of 1.0 now is removing its content from the stack. However, since we need to differentiate, this notional 0/1 vector is instead represented as a cumulative sum of a softmax, which can be thought of as the continuousvalued probability that you’ve reached that switchpoint by any given point in the vector. Outside of the abstractions of what we’re imagining this cumax function to represent, in a practical sense, it does strictly enforce that you monotonically remember or input more as you move along the vector. This has the practical fact that the network will be biased towards remembering information at one end of the representation vector for longer, meaning it could be a useful inductive bias around storing information that has a more longterm usefulness to it. One advantage that this system has over a previous system that, for example, had each layer of the LSTM operate on a different forgettingdecay timescale, is that this is a soft approximation, so, up to the number of neurons in the representation, the model can dynamically approximate whatever number of tree nodes it likes, rather than being explicitly correspondent with the number of layers. Beyond being a mathematically clever idea, the question of whether it improves performance is a little mixed. It does consistently worse at tasks that require keeping track of short term dependency information, but seems to do better at more longterm tasks, although not in a perfectly consistent or overly dramatic way. My overall read is that this is a neat idea, and I’m interested to see if it gets built on, as well as interested to see later papers that do some introspective work to validate whether the model is actually using this inductive bias in the treelike way that we’re hoping and imagining it will. 
[link]
It’s possible I’m missing something here, but my primary response to reading this paper is just a sense of confusion: that there is an implicit presenting of an approach as novel, when there doesn’t seem to me to be a clear mechanism that is changed from prior work. The premise of this paper is that selfsupervised learning techniques (a subcategory of unsupervised learning, where losses are constructed based on reconstruction or perturbation of the original image) should be made into supervised learning by learning on a loss that is a weighted combination of the selfsupervised loss and the supervised loss, making the overall method a semisupervised one. The selfsupervision techniques that they identify integrating into their semisupervised framework are:  Rotation prediction, where an image is rotated to one of four rotation angles, and then a classifier is applied to guess what angle  Exemplar representation invariance, where an imagenet is cropped, mirrored, and colorrandomized in order to provide inputs, whose representations are then pushed to be closer to the representation for the unmodified image My confusion is due to the fact that the I know that I’ve read several semisupervised learning papers that do things of this ilk (insofar as combining unsupervised and supervised losses together) and it seems strange to identify it as a novel contribution. That said, this paper does give an interesting overview of selfsupervisation techniques, I found it valuable to read for that purpose. 
[link]
It didn’t hit me how much this paper was a pun until I finished it, and in retrospect, I say, bravo. This paper focuses on adversarial examples, and argues that, at least in some cases, adversarial perturbations aren’t purely overfitting failures on behalf of the model, but actual features that generalize to the test set. This conclusion comes from a set of two experiments:  In one, the authors create a dataset that only contains what they call “robust features”. They do this by taking a classifier trained to be robust using adversarial training (training on adversarial examples), and doing gradient descent to modify the input pixels until the finallayer robust model activations of the modified inputs match the final layer activations when the unmodified inputs are passed in. Operating under the premise that features identified by a robust model are themselves robust, because by definition they don’t change in the presence of an adversarial perturbation, creating a training set that matches these features means that you’ve created some kind of platonic, robust version of the training set, with only robust features present. They then take this dataset, and train a new model on it, and show that it has strong test set performance, in both normal settings, and adversarial ones. This is not enormously surprising, since the original robust classifier performed well, but still interesting.  The most interesting and perhaps surprising experiment is where the authors create a dataset by taking normal images, and layering on top an adversarial perturbation. They then label these perturbed images with the label corresponding to the perturbation class, and train a model off of that. They then find that this model, which is trained on images which correspond to their labeled class only in their perturbation features, and not in the underlying visual features a human would recognize, achieves good test set performance under normal conditions. However, it performs poorly on adversarial perturbations of the test set. https://i.imgur.com/eJQXb0i.png Overall, the authors claim that the perturbations that are “tricking” models are features that can genuinely provide some amount of test set generalization, due to real but unintuitive regularities in the data, but that these features are nonrobust, in that small amounts of noise can cause them to switch sign. 
[link]
In modern machine learning, gradient descent has diversified into a zoo of subtly distinct techniques, all designed, analytically, heuristically, or practically, to ease or accelerate our model’s path through multidimensional loss space. A solid contingent of these methods are Adaptive Gradient methods, which scale the size of gradient updates according to variously calculated historical averages or variances of the vector update, which has the effect of scaling down the updates along feature dimensions that have experienced large updates in the past. The intuition behind this is that we may want to effectively reduce the learning rate (by dividing by a larger number) along dimensions where there have been large or highly variable updates. These methods are commonly used because, as the name suggests, they update to the scale of your dataset and particular loss landscape, avoiding what might otherwise be a lengthy process of hyperparameter tuning. But this paper argues that, at least on a simplified problem, adaptive methods can reach overly simplified and overfit solutions that generalize to test data less well than a nonadaptive, more standard gradient descent method. The theoretical core of the paper is a proof showing limitations of the solution reached by adaptive gradient on a simple toy regression problem, on linearly separable data. It’s a little dodgy to try to recapitulate a mathematical proof in verbal form, but I’ll do my best, on the understanding that you should really read the fully thing to fully verify the logic. The goal of the proof is to characterize the solution weight vector learned by different optimization systems. In this simplified environment, a core informational unit of your equations is X^T(y), which (in a world where labels are either 1 or 1), goes through each feature, and for each feature, takes a dot product between that feature vector (across examples) and the label vector, which has the effect of adding up a positive sum of all the feature values attached to positive examples, and then subtracting out (because of the multiply by 1) all the feature values attached to positive examples. When this is summed, we get a per feature value that will be positive if positive values of the feature tend to indicate positive labels, and negative if the opposite is true, in each case with a magnitude relating to the strength of that relationship. The claim made by the paper, supported by a lot of variable transformations, is that the solution learned by Adaptive Gradient methods reduces to a sign() operation on top of that vector, where magnitude information is lost. This happens because the running gradients that you divide out happen to correspond to the absolute value of this vector, and dividing a vector (which would be the core of the solution in the nonadaptive case) by its absolute value gives you a simple sign. The paper then goes on to show that this edge case can lead to cases of pathological overfitting in cases of high feature dimensionality relative to data points. (I wish I could give more deep insight on why this is the case, but I wasn’t really able to translate the math into intuition, outside of this fact of scaling by gradient magnitudes having the effect of losing potentially useful gradient information. The big question from all this is...does this matter? Does it matter, in particular, beyond a toy dataset, and an artificially simple problem? The answer seems to be a pretty strong maybe. The authors test adaptive methods against hyperparameteroptimized SGD and momentum SGD (a variant, but without the adaptive aspects), and find that, while adaptive methods often learn more quickly at first, SGD approaches pick up later in training, first in terms of test set error at a time when adaptive methods’ training set error still seems to be decreasing, and later even in training set error. So there seems to be evidence that solutions learned by adaptive methods generalize worse than ones learned by SGD, at least on some image recognition and languageRNN models. (Though, interestingly, RMSProp comes close to the SGD test set levels, doing the best out of the adaptive methods). Overall, this suggests to me that doing fully hyperparameter optimized SGD might be a stronger design choice, but that adaptive methods retain popularity because of their (very appealing, practically) lack of need for hyperparameter tuning to at least to a *reasonable* job, even if their performance might have more of a ceiling than that of vanilla SGD. 
[link]
[I do occasionally wonder if people will look back on the “Is All You Need” with genuine confusion in a few years. “Really…all you need?”] This paper merges the ideas of curiositybased learning and hierarchical reinforcement learning, to propose an architecture for learning distinctive skills based solely on an incentive to make those skills distinguishable from one another and relatively internally random, rather than because they’re directly useful in achieving some reward. The notion of hierarchical reinforcement learning is that, instead of learning a single joint policy, we learn some discrete number of subpolicies, and then treat the distribution over those subpolicies as you would a distribution over actions in a baseline RL policy. In order to achieve a reward, a model jointly optimizes the action distribution of the subpolicies, and also the distribution over subpolicies. One issue with this approach, which is raised by this paper (though I don’t really have strong enough domain background here to know how much of a problem this is in practice) is that this joint optimization process means that, early in the process, we choose subpolicies that are doing the best, and sample more from and thus improve those. This “early exploitation” problem (in the explore vs exploit frame) means that we might not learn skills that would be valuable to know later on, but that don’t give us any reward until we’ve developed them further. To address this, this paper proposes DIAYN, an algorithm which (1) samples discrete latent skill vectors according to a uniform, highentropy prior, rather than according to how useful we think they are now, and (2) doesn’t even have a direct notion of usefulness, but instead incentivizes shaping of skills to be more distinct from one another, in terms of the states that are visited by each skill’s policy. The network then learns policies conditioned on each skill vector, and at each point operates according to whichever has been sampled. This idea of distinctiveness is encapsulated by saying “we want to have high mutual information between the states visited by a skill, and the discrete ID of that skill,” or, in more practical terms, “we want to be able to train a discriminator to do a good job predicting which skill we’re sampling from, based on the states it sees. (I swear, every time I read a paper where someone uses mutual information these days, it’s actually a discriminator under the hood). https://i.imgur.com/2a378Bo.png This incentivizes the model to take actions associated with each skill that will get it to states that are unlikely to occur in any of the existing skills. Depending on what set of observations you give the discriminator to work with, you can shape what axes your skills are incentivized to vary on; if you try to discriminate skills based solely on an agent’s center of mass, you’ll end up with policies that vary their center of mass more wildly. The paper shows that, at least on simple environments, agents can learn distinctive clusters of skills based on this objective. An interesting analogy here is to unsupervised pretraining of e.g. large language models and other similar settings, where we first train a model without (potentially costly) explicit reward, and this gives us a starting point set of representations that allow us to reach good performance more quickly once we start training on supervised reward signal. There is some evidence that this pretraining effect could be captured by this kind of purelyexploratory approach, as suggested by experiments done to take the learned skills or subpolicies, hold them fixed, and train a metacontroller to pick subpolicies according to an external reward, where the “pretrained” policy reaches high reward more quickly. 
[link]
Reward functions are a funny part of modern reinforcement learning: enormously salient from the inside, if you’re coding or working with RL systems, yet not as clearly visible from the outside perspective, where we just see agents playing games in what seem to be humanlike ways. Just seeing things from this angle, it can be easy to imagine that the mechanisms being used to learn are humanlike as well. And, it’s true that some of the Atari games being examined are cases where there is in fact a clear, explicit reward in the form of points, that human players would also be trying to optimize. But in most cases, the world isn’t really in the habit of producing clear reward signals, and it definitely doesn’t typically do so on time scales that account for most of the learning humans do. So, it’s generally hypothesized that in addition to updating on (sparse) environmental rewards, humans also operate according to certain precoded, possibly evolutionarilyengineered heuristics, of which one is curiosity. The intuition is: it sure seems like, especially early in life, humans learn by interacting with objects purely driven by curiosity, and we’d love to somehow harness that same drive to allow our learning systems to function in environments lacking dense, informative reward signals. One such environment is the video game Montezuma’s Revenge, which in addition to being amusingly difficult to search for, is a game with sparse, longrange rewards, on which typical rewardbased agents have historically performed poorly, and on which this current paper focuses. A strong existing tradition of curiosity objectives focuses on incentivizing agents to be able to better predict the next observation, given the current observation and their action within it. Intuitively, by training such a network on historical observations, and giving agents a bonus according to that prediction’s error on a given observation. The theory behind this is that if an agent isn’t able to predict the observationtransition dynamics at a given state, that probably means it hasn’t visited many nearby states, and so we want to incentivize it doing so to gain information. If this sounds familiar to the classic “explore vs exploit” tradeoff, it’s very much a similar idea: in cases of clear reward, we should take the reward, but in cases of low or uncertain reward, there’s value to exploration. One difficulty of systems like the one described above is that they reward the agent for being in environments where the next observation is difficult to predict from the current one. And while that could describe novel states about which the agent needs to gain information, it can also describe states that are inherently stochastic; the canonical example being random static on a TV screen. The agent has a lot of trouble predicting the next observation because it’s fundamentally nondeterministic to a greater degree than even the randombutcausal dynamics of most games. The proposed alternative of this paper is a little strange, but makes more sense in the context of responding to this stochasticity problem. The authors propose to create a random mapping, in the form of an initialized but untrained neural network, taking in observations and spitting out embedding vectors. Then, they incentivize their agent to go to places that have high prediction error on a network designed to predict these random embeddings. Since the output is just a function mapping, it’s deterministic with respect to observations. The idea here is that if you’ve seen observations similar to your current observation, you’ll be more able to predict the corresponding embedding, even if there’s no meaningful relationship that you’re learning. https://i.imgur.com/Ds5gHDE.png The authors found that this performed well on Montezuma’s Revenge and Private Eye, but only middlinglywell on other environments. I’m a bit torn on this paper overall. On one hand, it seems like a clever idea, and I’m in general interested in seeing more work on curiosity. It does clearly seem to be capturing something that corresponds to noveltyseeking, and the agent trained using it explores a higher number of rooms than alternative options. On the other, I’m a little skeptical of the fact that it only has consistent performance in two environments, and wish there had been more comparisons to simpler forms of observation similarity, since this really does just seem like a metric of “how similar of observation vectors to this have you seen before”. I find myself wondering if some sort of density modeling could even be effective here, especially if (as may be the case, I’m unsure) the input observations are metadata rather than pixels. 
[link]
Language seems obviously useful to humans in coordinating on complicated tasks, and, the logic goes, you might expect that if you gave agents in a multiagent RL system some amount of shared interest, and the capacity to communicate, that they would use that communication channel to coordinate actions. This is particularly true in cases where some part of the environment is only visible to one of the agents. A number of papers in the field have set up such scenarios, and argued that meaningful communication strategies developed, mostly in the form of one agent sending a message to signal its planned action to the other agent before both act. This paper tries to tease apart the various quantitative metrics used to evaluate whether informative message are being sent, and tries to explain why they can diverge from each other in unintuitive ways. The experiments in the paper are done in quite simple environments, where there are simple oneshot actions and a payoff matrix, as well as an ability for the agents to send messages before acting. Some metrics identified by the paper are:  Speaker Consistency: There’s high mutual information shown between the message a speaker sends, and what action it takes. Said another way, you could use a speaker’s message to predict their action at a rate higher than random, because it contains information about the action  Heightened reward/task completion under communication: Fairly straightforward, this metric argues that informative communication happened when pairs of agents do better in the presence of communication channels than when they aren’t available  Instantaneous coordination: Measures the mutual information between the message sent by agent A and the action of agent B, in a similar way to Speaker Consistency. This work agrees that it’s important to measure the causal impact of messages on otheragent actions, but argues that instantaneous communication is flawed because the mutual information metric between messages and response actions doesn’t properly condition on the state of the game under whcih the message is being sent. Even if you successfully communicate your planned action to me, the action I actually take in response will be conditioned on my personal payoff matrix, and may average out to seeming unrelated or random if you take an expectation over every possible state the message could be recieved in. Instead, they suggest doing an explicit causal causal approach, where for each configuration of the game (different payoff matrix), they sample different messages, and calculate whether you see messages driving more consistent actions when you condition on other factors in the game. An interesting finding of this paper is that, at least in these simple environments, you’re able to find cases where there is Speaker Consistency (SC; messages that contain information about the speaker’s next action), but no substantial Causal Influence of Communication (CIC). This may seem counterintuitive, since, why would you as an agent send a message containing information about your action, if not because you’re incentivized to communicate with the other agent? It seems like the answer is that it’s possible to have this kind of shared information *on accident,* as a result of the shared infrastructure between the action network and the messaging network. Because both use a shared set of earlylayer representations, you end up having one contain information about the other as an incidental fact; if the networks are fully separated with no shared weights, the Speaker Consistency values drop. An important caveat to make here is that this paper isn’t, or at least shouldn’t be, arguing that agents in multiagent systems don’t actually learn communication. The environments used here are quite simple, and just might not plausibly be difficult enough to incentivize communication. However, it is a fair point that it’s valuable to be precise in what exactly we’re measuring, and test how that squares with what we actually care about in a system, to try to avoid cases like these where we may be liable to be led astray by our belief about how the system *should* be learning, rather than how it actually is 
[link]
In 2018, a group including many of the authors of this updated paper argued for a theory of deep neural network optimization that they called the “Lottery Ticket Hypothesis”. It framed itself as a solution to what was otherwise a confusing seemingcontradiction: that you could prune or compress trained networks to contain a small percentage of their trained weights without loss of performance, but also that if you tried to train a comparably small network (comparable to the posttraining pruned network) from initialization, you wouldn’t be able to achieve similar performance. They showed that, at least for some set of tasks, you could in fact train a smaller network to equivalent performance, but only if you kept the same connectivity patterns as in the pruned network, and if you reinitialized those neurons to the same values they were initialized at during the initial training. These lucky initialization patterns are the lottery tickets being referred to in the eponymous hypothesis: small subnetworks of wellinitialized weights that are set up to be able to train well. This paper assesses whether and under what conditions the LTH holds on larger problems, and does a bit of a metaanalysis over different alternate theories in this space. One such alternate theory, from Liu et al, proposes that, in fact, there is no value in reinitializing to the specific initial values, and that you can actually get away with random initialization if you keep the connectivity patterns of the pruned weights. The “At Scale” paper compares the two methods over a wide range of pruning percentages, and convincingly shows that while random initialization with the same connectivity can perform well up to 80% of the weights being removed, after 80%, the performance of the random initialization drops, whereas the performance of the “winning ticket” approach remains comparable with full network training up to 99% of the weights being pruned. This seems to provide support for the theory that there is value in reinitializing the weights to how they were, especially when you prune to very small subnetworks. https://i.imgur.com/9O2aAIT.png The core of the current paper focuses on a difficulty in the original LTH paper: that the procedure of iterative pruning (train, then prune some weights, then train again) wasn’t able to reliably find “winning tickets” for deep networks of the type needed to solve ImageNet or CIFAR. To be precise, reinitializing pruned networks to their original values did no better than initializing them randomly in these networks. In order to actually get these winning tickets to perform well, the original authors had to do a somewhat arcane process of of starting the learning rate very small and scaling it up, called “warmup”. Neither paper gave a clear intuition as to why this would be the case, but the updated paper found that they could avoid the need for this approach if, instead of reinitializing weights to their original value, they set them to the values they were at after some small number of iterations into training. They justify this by showing that performance under this new initialization is related to something they call “stability to pruning,” which measures how close the learned weights after reinitialization are to the original learned weights in the full model training. And, while the weights of deeper networks are unstable (by this metric) when first initialized, they become stable fairly early on. I was a little confused by this framing, since it seemed fairly tautological to me, since you’re using “how stably close are the weights to the original weights” as a way to explain “when can you recover performance comparable to original performance.” This was framed as being a mechanistic explanation of why you can see a lottery ticket phenomenon to some extent, but only if you do a “late reset” to several iterations after initialization, but it didn’t feel quite mechanistically satisfying enough to me. That said, I think this is overall an intriguing idea, and I’d love to see more papers discuss it. In particular, I’d love to read more qualitative analysis about whether there are any notable patterns shared by “winning tickets”. 
[link]
The Transformer, a somewhat confusinglynamed model structure that uses attention mechanisms to aggregate information for understanding or generating data, has been having a real moment in the last year or so, with GPT2 being only the most wellpublicized tip of that iceberg. It has lots of advantages: the obvious attractions of strong performance, as well as the ability to train in parallel across parts of a sequence, which RNNs can’t do because of the need to build up and maintain state. However, a problematic fact about the Transformer approach is how it scales to large sequences of input data. Because attention is based on performing pairwise queries between each point in the data sequence and each other point, to allow for aggregation of information from places throughout the sequence, it scales as O(N^2), because every new element in the sequence needs to be queried by N other ones. This makes it resourceintensive to run transformer models on large architectures. The Sparse Transformer design proposed in this OpenAI paper tries to cut down on this resource cost by loosening the requirement that, in every attention operation, each element directly pulls information from every other element. In this new system, each point doesn’t get information about each other point in a single operation, but, having two operations such limited operations being chained in a row provides that global visibility. This is done in one of two ways. (1) The first, called the “strided” version, performs two operations in a row, one masked attention that only looks at the last k timesteps (for example, the last 7), and then a second masked attention that only looks at every kth timestep. So, at the end of the second operation, each point has pulled information from points at checkpoints 7, 14, 21 steps ago, and each of these has pulled information from the window between it and its preceding checkpoint, giving visibility into a full global receptive frame in the course of two operations (2) The second, called the “fixed” version, uses a similar sort of logic, but instead of having the “window accumulation points” be defined in reference to the point doing the querying, you instead have fixed accumulation points responsible for gathering information from the windows around them. So, using the example given in the paper, if you imagine a window of size 128, and an “accumulation points per window” of 8, then the points in indices 120128 (say) would have visibility into points 0128. That represents the first operation, and in the second one, all other points in the sequence pull in information by querying the designated accumulation points for all the windows that aren’t masked for it. The paper argues that, between these two systems, the Strided system should work best when the data has some inherent periodicity, but I don’t know that I particularly follow that intuition. I have some sense that the important distinction here is that in the strided case you have many points of accumulation, each with not much context, whereas in the fixed case you have very few accumulation points each with a larger window, but I don’t know what performance differences exactly I’d expect these mechanical differences to predict. This whole project of reducing access to global information seems initially a little counterintuitive, since the whole point of a transformer design, in some sense, was its ability to gain global context in a single layer, as opposed to a convnet needing multiple layers to build receptive field, or a RNN needing to maintain state throughout the sequence. However, I think this paper makes the most sense as a way of interpolating the space between something like a CNN and a full attention design, for the sake of efficiency. With a CNN, you have a fixed kernel, and so as your sequence gets longer, you need to add more and more layers in order for any given point to be able to incorporate into its representation information from the complete other side of the sequence. With a RNN, as your sequence gets longers, you pay the cost of needing to backpropogate state farther. So, by contrast, even though the Sparse Transformer seems to be giving up its signal advantage, it’s instead just trading one constant number of steps to achieve global visibility (1), for another (2, in this paper, but conceptually could be more), but still in a way that’s constant with respect to the length of the data. And, in exchange for this trade, they get very sparse, very masked operations, where many of the multiplications involved in these big query calculations can be ignored, making for faster computation speeds. On the datasets tried, the Sparse Transformer increased speed, and in fact in I think all cases increased performance  not by much, the performance gain by itself isn’t that dramatic, but in the context of expecting if anything worse performance as a result of limiting model structure, it’s encouraging and interesting that it either stays about the same or possible improves. 
[link]
This paper focuses on taking advances from the (mostly heuristic, practical) world of data augmentation for supervised learning, and applying those to the unsupervised setting, as a way of inducing better performance in a semisupervised environment (with many unlabeled points, and few labeled ones) Data augmentation has been a mostly behindthescenes implementation detail in modern deep learning: minor modifications like shifting a dataset by a few pixels, rotating it slightly, or flipping it horizontally, to generate additional pseudoexamples for the model to train on. The core idea motivating such approaches is that the tactics of data augmentation are modifications that should not change the class label in a platonic, ground truth sense, which allows us to use them as training examples of the same class label as the original image from which the perturbations were made. In addition to just giving then network generically more data, this approach communicates to the network the specific kinds of modifications that can be made to an image and have it still be of the same class. The Unsupervised Data Augmentation (UDA) tactic from this paper notices two things: (1) Within the sphere of supervised learning, there has been datasetspecific innovation in generating augmented data that will be particularly helpful to a given dataset. Inlanguage modeling, an example of this is having a sentence go into another language and back again through two welltrained translation networks, and use the resulting sentence as another example of the same class. For ImageNet, there’s an approach called AutoAugment that uses reinforcement learning on a validation set to learn a policy of image operations (rotate, shear, change color) in order to increase validation accuracy. [I am a little confused about this as an approach, since I worry about essentially overfitting to the validation set. That said, I don’t have time to delve specifically into the AutoAugment paper today, so I’ll just leave this here as a caveat] (2) Within semisupervised learning, there’s a growing tendency to use consistency loss as a way of making use of unlabeled data. The basic idea of consistency loss is that, even if you don’t know the class of a given datapoint, if you modify it in some small way, you can be confident that the model’s prediction should be consistent between the point and its perturbation, even if you don’t have knowledge of what the actual ground truth is. Often, systems like this have been designed using simple Gaussian noise on top of original unlabeled images. The key proposal of this paper is to substitute this more simplified perturbation procedure with the augmentation approaches being iterated on in supervised learning, since the goal of both  to modify inputs so as to capture ways in which they might differ but still be notionally of the same class  is nearly identical On top of this core idea, the UDA paper also proposes an additional clever training tactic: in cases where you have many unlabeled examples and few labeled ones, you may need a large model to capture the information from the unlabeled examples, but this may result in overfitting on the labeled ones. To avoid this, they use an approach called “Training Signal Annealing,” where at each point in training they remove from the loss calculation any examples that the model is particularly confident about: where the prediction of the true class is above some threshold eta. As training goes on, the network is gradually allowed to see more of the training signals. In this kind of a framework, the model can’t overfit as easily, because once it starts getting the right answer on supervised examples, they drop out of the loss calculation. In terms of empirical results, the authors find that in UDA they’re able to improve on many semisupervised benchmarks with quite small numbers of labeled examples. At one point, they use as a baseline a BERT model that was finetuned in an unsupervised way prior to its semisupervised training, and show that their augmentation method can add value even on top of the value that the unsupervised pretraining adds. 
[link]
The Magenta group at Google is a consistent source of really interesting problems for machine learning to solve, in the vein of creative generation of art and music, as well as mathematically creative ways to solve those problem. In this paper, they tackle a new problem with some interesting modelstructural implications: generating Bach chorales composed of polyphonic multiinstrument arrangements. On one layer, this is similar to music generation problems that have been studied before, in that generating a musically coherent sequence requires learning both local and largerscale structure between time steps in the music sequence. However, an additional element here is that there’s dependence of multiple instruments’ notes on one another at a given time step, so, in addition to generating time steps conditional on one another, you ideally want to learn how to model certain notes in a given harmony conditional on the other notes already present there. Understanding the specifics of the approach was one of those scenarios where the mathematical arguments were somewhat opaque, but the actual mechanical description of the model gave a lot of clarity. I find this frequently the case with machine learning, where there’s this strange set of dual incentives between the engineering impulse towards designing effective system, and the academic need to connect the approach to a more theoretical mathematical foundation. The approach taken here has a lot in common with the autoregressive model structures used in PixelCNN or WaveNet. These are all based, theoretically speaking, on the autoregressive property of joint probability distributions, that they can sampled from by sampling first from the prior over the first variable (or pixel, or wave value), and then the second conditional on the first, then the third conditional on the first two, and so on. In practice, autoregressive models don’t necessary condition on the *entire* previous rest of the input in generating a conditional distribution for a new point (for example, because they use a convolutional structure that doesn’t have a receptive field big enough to reach back through the entire previous sequence), but they are based on that idea. A unique aspect of this model is that, instead of defining one specific conditional dependence relationship (where pixel J is conditioned on wave values J5 through J, or some such), they argue that they instead learn conditional relationships over any possible autoregressive ordering of both time steps and instrument IDs. This is a bit of a strange idea, that, like I mentioned, is simplified by going through the mechanics. The model works in a way strikingly similar to recent large scale language modeling: by, for each sample, masking some random subset of the tokens, and asking the model to predict the masked values given the unmasked ones. In this case, an interesting nuance is that the values to be masked are randomly sampled across both time step and instrument, such that in some cases you’ll have a prior time step but no other instruments at your time step, or other instruments at your time step but no prior time steps to work from, and so on. The model needs to flexibly use various kinds of local context to predict the notes that are masked. (As an aside, in addition to the actual values, the network is given the actual 0/1 mask, so it can better distinguish between “0, no information” and “0 because in the actual data sample there wasn’t a pitch here”.) The model refers to these unmasked points as “context points”. An interesting capacity that this gives the model, and which the authors use as their sampling technique, is to create songs that are hybrids of existing chorales by randomly keeping some chunks and dropping out others, and using the model to interpolate through the missing bits. 
[link]
Attention mechanisms are a common subcomponent within language models, initially as a part of recurrent models, and more recently as their own form of aggregating information over sequences, independent from the recurrence structure. Attention works by taking as input some sequence of inputs, in the most typical case embedded representations of words in a sentence, and learning a distribution of weights over those representations, which allows the network to aggregate the representations, typically by taking a weighted sum. One effect of using an attention mechanism is that, for each instance being predicted, the network produces this weight distribution over inputs, which intuitively feels like it’s the network demonstrating which input words were most important in constructing its decision. As a result, uses of attention have often been accompanied by examples that show attention distributions for examples, implicitly using them as a form of interpretability or model explanation. This paper has the goal of understanding whether attention distributions can be seen as a valid form of feature importance, and takes the position that they shouldn’t be. At a high level, I think the paper makes some valid criticisms, but ultimately I didn’t find the evidence it presented quite as strong as I would have liked. The paper performs two primary analyses of the attention distributions produced by a trained LSTM model: (1) It calculates the level of correlation between the importance that would be implied by attention weights and the importance as calculated using more canonical gradientbased methods (generally things in the shape of “which words contributed the most towards the prediction being what it was). It finds correlation values that range across random seeds, but are generally centered around 0.5. The paper frames this as a negative result, implying that, in the case where attention was a valid form of importance, the correlation with existing metrics would be higher. I definitely follow the intuition that you would expect there be a significant and positive correlation between methods in this class, but it’s unclear to me what a priori reasoning chooses to draw the threshold on “significant” in a way that makes 0.5 fall below it. It just feels like one of those cases where I could imagine someone showing the same plots and coming to a different interpretation, and it’s not clear to me what criteria support one threshold of magnitude vs another (2) It measures how much it can permute the weights of an attention distribution, and have the prediction made by the network not change in a meaningful way. It does this both by random tests, and also by measuring the maximum adversarial perturbation: the farthestaway distribution (in terms of JensonShannon distance) that still produces a prediction within some epsilon of the original prediction. There are a few concerns I have about this as an analysis. First off, it makes a bit of an assumption that attention can only be a valid form of explanation if it’s a causal mechanism within the model. You could imagine that attention distributions still give you information about the internal state of the model, even if they are just reporting that state rather than directly influencing it. Secondly, it seems possible to me that you could get a relatively high JensonShannon distance from an initial distribution just by permuting the indexes of the lowvalue weights, and shifting distributional weight between them in a way that doesn’t fundamentally change what the network is primarily attending to. Even if this is not the case in this paper, I’d love to see an example or some kind of quantitative measure showing that the JS Shannon distances they demonstrate require a substantive change in weight priorities, rather than a trivial one. Another general critique is that the experiments in this paper only focused on attention within a LSTM structure, where the embedding associated with each word isn’t really strictly an embedding of that specific word, but also contains a lot of information about things before and after, because of the nature of a BiLSTM. So, there is some specificity in the embedding corresponding to just that word, but a lot less than in a pure attention model, like some being used in NLP these days, where you’re learning an attention distribution over the raw, nonLSTMed representations. In this case, it makes sense that attention would be blurry, and not map exactly to our notions of which words are more important, since the word level representations are themselves already aggregations. I think it’s totally fair to only focus on the LSTM case, but would prefer the paper scoped its claims in better accordance with its empirical results. I feel a bit bad: overall, I really approve of papers like this being done to put a critical empirical frame on ML’s tendency to get conceptually ahead of itself. And, I do think that the evidentiary standard for “prove that X metric isn’t a form of interpretability” shouldn’t be that high, becuase on priors, I would expect most things not to be. I think that they may well be right in their assessment, I would just like a more surefooteded set of analyses and interpretation behind it. 
[link]
One of the dominant narratives of the deep learning renaissance has been the value of welldesigned inductive bias  structural choices that shape what a model learns. The biggest example of this can be found in convolutional networks, where models achieve a dramatic parameter reduction by having features maps learn local patterns, which can then be reused across the whole image. This is based on the prior belief that patterns in local images are generally locally contiguous, and so having feature maps that focus only on small (and gradually larger) local areas is a good fit for that prior. This paper operates in a similar spirit, except its input data isn’t in the form of an image, but a graph: the social graph of multiple agents operating within a Multi Agent RL Setting. In some sense, a graph is just a more general form of a pixel image: where a pixel within an image has a fixed number of neighbors, which have fixed discrete relationships to it (up, down, left, right), nodes within graphs have an arbitrary number of nodes, which can have arbitrary numbers and types of attributes attached to that relationship. The authors of this paper use graph networks as a sort of auxiliary information processing system alongside a more typical policy learning framework, on tasks that require group coordination and knowledge sharing to complete successfully. For example, each agent might be rewarded based on the aggregate reward of all agents together, and, in the stag hunt, it might require collaborative effort by multiple agents to successfully “capture” a reward. Because of this, you might imagine that it would be valuable to be able to predict what other agents within the game are going to do under certain circumstances, so that you can shape your strategy accordingly. The graph network used in this model represents both agents and objects in the environment as nodes, which have attributes including their position, whether they’re available or not (for captureable objects), and what their last action was. As best I can tell, all agents start out with directed connections going both ways to all other agents, and to all objects in the environment, with the only edge attribute being whether the players are on the same team, for competitive environments. Given this setup, the graph network works through a sort of “diffusion” of information, analogous to a message passing algorithm. At each iteration (analogous to a layer), the edge features pull in information from their past value and sender and receiver nodes, as well as from a “global feature”. Then, all of the nodes pull in information from their edges, and their own past value. Finally, this “global attribute” gets updated based on summations over the newlyupdated node and edge information. (If you were predicting attributes that were graphlevel attributes, this global attribute might be where you’d do that prediction. However, in this case, we’re just interested in predicting agentlevel actions). https://i.imgur.com/luFlhfJ.png All of this has the effect of explicitly modeling agents as entities that both have information, and have connections to other entities. One benefit the authors claim of this structure is that it allows them more interpretability: when they “play out” the values of their graph network, which they call a Relational Forward Model or RFM, they observe edge values for two agents go up if those agents are about to collaborate on an action, and observe edge values for an agent and an object go up before that object is captured. Because this information is carefully shaped and structured, it makes it easier for humans to understand, and, in the tests the authors ran, appears to also help agents do better in collaborative games. https://i.imgur.com/BCKSmIb.png While I find graph networks quite interesting, and multiagent learning quite interesting, I’m a little more uncertain about the inherent “graphiness” of this problem, since there aren’t really meaningful inherent edges between agents. One thing I am curious about here is how methods like these would work in situations of sparser graphs, or, places where the connectivity level between a node’s neighbors, and the average other node in the graph is more distinct. Here, every node is connected to every other node, so the explicit information localization function of graph networks is less pronounced. I might naively think that  to whatever extent the graph is designed in a way that captures information meaningful to the task  explicit graph methods would have an even greater comparative advantage in this setting. 
[link]
It is a fact universally acknowledged that a reinforcement learning algorithm not in possession of a model must be in want of more data. Because they generally are. Joking aside, it is broadly understood that modelfree RL takes a lot of data to train, and, even when you can design them to use offpolicy trajectories, collecting data in the real environment might still be too costly. Under those conditions, we might want to learn a model of the environment and generate synthesized trajectories, and train on those. This has the advantage of not needing us to run the actual environment, but the obvious disadvantage that any model will be a simplification of the true environment, and potentially an inaccurate one. These authors seek to answer the question of: “is there a way to generate trajectories that has higher fidelity to the true environment.” As you might infer from the fact that they published a paper, and that I’m now writing about it, they argue that, yes, there is, and it’s through explicit causal/counterfactual modeling. Causal modeling is one of those areas of statistics that seems straightforward at its highest level of abstraction, but tends to get mathematically messy and unintuitive when you dive into the math. So, rather than starting with equations, I’m going to try to verbally give some intuitions for the way causal modeling is framed here. Imagine you’re trying to understand what would happen if a person had gone to college. There’s some set of information you know about them, and some set of information you don’t, that’s just random true facts about them and about the universe. If, in the real world, they did go to college, and you want to simulate what would have happened if they didn’t, it’s not enough to just know the observed facts about them, you want to actually isolate all of the random other facts (about them, about the world) that weren’t specifically “the choice to go to college”, and condition on those as well. Obviously, in the example given here, it isn’t really practically possible to isolate all the specific unseen factors that influence someone’s outcome. But, conceptually, this quantity, is what we’re going to focus on in this paper. Now, imagine a situation where a RL agent has been dropped into a mazelike puzzle. It has some set of dynamics, not immediately visible to the player, that make it difficult, but ultimately solvable. The best kind of simulated data, the paper argues, would be to keep that state of the world (which is partially unobservable) fixed, and sample different sets of actions the agent might take in that space. Thus, “counterfactual modeling”: for a given configuration of random states in the world, sampling different actions within it. To do this, you first have to infer the random state the agent is experiencing. In the normal modelbased case, you’d have some prior over world states, and just sample from it. However, if you use the experience of the agent’s trajectory, you can make a better guess as to what world configuration it was dropped into. If you can do this, which is, technically speaking, sampling from the posterior over unseen context, conditional on an agent’s experience, then the paper suggests you’ll be able to generate data that’s more realistic, because the trajectories will be direct counterfactuals of “real world” scenarios, rather than potentiallyunsolvable or unrealistic draws from the prior. This is, essentially, the approach proposed by the paper: during training, they make this “world state” visible to the agent, and let it learn a model predicting what state it started with, given some trajectory of experience. They also learn a model that predicts the outcome and ultimately the value of actions taken, conditioned on this random context (as well as visible context, and the agent’s prior actions). They start out by using this as a tool for policy evaluation, which is a nice problem setup because you can actually check how well you’re doing against some baseline: if you want to know how good your simulated data is at replicating the policy reward on real data, you can just try it out on real data and see. The authors find that they reduce policy reward estimation error pretty substantially by adding steps of experience (in Bayesian terms, bit of evidence moving them from the prior, towards the posterior). https://i.imgur.com/sNAcGjZ.png They also experiment with using this for actual policy search, but, honestly, I didn’t quite follow the intuitions behind Guided Policy Search, so I’m just going to not dive into that for now, since I think a lot of the key contributions of the paper are wrapped up in the idea of “estimate the reward of a policy by simulating data from a counterfactual trajectory” 
[link]
This paper feels a bit like watching a 90’s show, and everyone’s in denim and miniskirts, except it’s a 2017 ML paper, and everything uses attention. (I’ll say it again, ML years are like dog years, but more so). That said, that’s not a critique of the paper: finding clever ways to cobble together techniques for your application can be an important and valuable contribution. This paper addresses the problem of text to image generation: how to take a description of an image and generate an image that matches it, and it makes two main contributions: 1) a GAN structure that seems to merge insights from Attention and Progressive GANs in order to select areas of the sentence to inform details in specific image regions, and 2) a novel discriminator structure to evaluate whether a sentence matches an image. https://i.imgur.com/JLuuhJF.png Focusing on the first of these first: their generation system works by an iterative process, that gradually builds up image resolution, and also pulls specific information from the sentence to inform details in each region. The first layer of the network generates a first “hidden state” based on a compressed representation of the sentence as a whole (the final hidden state of a LSTM text encoder, I believe), as well as random noise (typical input to a GAN). Subsequent “hidden states” are calculated by calculating attention weightings between each region of the image, and each word in the sentence, and pulling together a perregion context vector based on that attention map. (As far as I understand it, “region” here refers to the fact that when you’re at lower spatial scales of what is essentially a progressive generation process, 64x64 rather than 256x256, for example, each “pixel” actually represents a larger region of the image). I’m using quotes around “hidden state” in the above paragraph because I think it’s actually pretty confusing terminology, since it suggests a recurrent structure, but this model isn’t actually recurrent: there’s a specific set of weights for resolution block 0, and 1, and 2. This whole approach, of calculating a specific attentionweighted context vector over input words based on where you are in the generation process is very conceptually similar to the original domain of attention, where the attention query would be driven by the hidden state of the LSTM generating the translated version of some input sentence, except, here, instead of translating between languages, you’re translating across mediums. The loss for this model is a combination of perlayer loss, and a final, special, fullresolution loss. At each level of resolution, there exists a separate discriminator, which seems to be able to take in both 1) only an image, and judge whether it thinks that image looks realistic on it’s own, and 2) an image and a global sentence vector, and judge whether the image matches the sentence. It’s not fully clear from the paper, but it seems like this is based on just feeding in the sentence vector as additional input? https://i.imgur.com/B6qPFax.png For each nonfinal layer’s discriminator, the loss is a combination of both of these unconditional and conditional losses. The final contribution of this paper is something they call the DAMSM loss: the Deep Attention Multimodal Similarity Model. This is a fairly complex model structure, whose ultimate goal is to assess how closely a final generated image matches a sentence. The whole structure of this loss is based on projecting regionlevel image features (from an intermediate, 17x17 layer of a pretrained Inception Net) and word features into the same space, and then calculating dot product similarities between them, which are then used to build “visual context vectors” for each word (for each word, created a weighted sum of visual vectors, based on how similar each is to the word). Then, we take each word’s context vector, and see how close it is to the original word vector. If we, again, imagine image and word vectors as being in a conceptually shared space, then this is basically saying “if I take a weighted average of all the things that are the most similar to me, how ultimately similar is that weighted average to me”. This allows there to be a “concept representation” match found when, for example, a particular word’s concept, like “beak”, is only present in one region, but present there very strongly: the context vector will be strongly weighted towards that region, and will end up being very close, in cosine similarity terms, to the word itself. By contrast, if none of the regions are a particularly good match for the word’s concept, this value will be low. DAMSM then aggregates up to an overall “relevance” score between a sentence and image, that’s simply a sum over a word’s “concept representation”, for each word in a sentence. It then calculates conditional probabilities in two directions: what’s the probability of the sentence, given the image (relevance score of (Sent, Imag), divided by that image’s summed relevance with all possible sentences in the batch), and, also, what’s the probability of the image, given the sentence (relevance score of the pair, divided by the sentence’s summed relevance with all possible images in the batch). In addition to this wordlevel concept modeling, DAMSM also has full sentencelevel versions, where it simply calculates the relevance of each (sentence, image) pair by taking the cosine similarity between the global sentence and global image features (the final hidden state of an encoder RNN, and the final aggregated InceptionNet features, respectively). All these losses are aggregated together, to get one that uses both global information, and information as to whether specific words in a sentence are represented well in an image. 
[link]
This is a paper where I keep being torn between the response of “this is so simple it’s brilliant; why haven’t people done it before,” and “this is so simple it’s almost tautological, and the results I’m seeing aren’t actually that surprising”. The basic observation this paper makes is one made frequently before, most recently to my memory by Geoff Hinton in his Capsule Net paper: sometimes the translation invariance of convolutional networks can be a bad thing, and lead to worse performance. In a lot of ways, translation invariance is one of the benefits of using a convolutional architecture in the first place: instead of having to learn separate feature detectors for “a frog in this corner” and “a frog in that corner,” we can instead use the same feature detector, and just move it over different areas of the image. However, this paper argues, this makes convolutional networks perform worse than might naively be expected at tasks that require them to remember or act in accordance with coordinates of elements within an image. For example, they find that normal convolutional networks take nearly an hour and 200K worth of parameters to learn to “predict” the onehot encoding of a pixel, when given the (x,y) coordinates of that pixel as input, and only get up to about 80% accuracy. Similarly, trying to take an input image with only one pixel active, and predict the (x,y) coordinates as output, is something the network is able to do successfully, but only when the test points are sampled from the same spatial region as the training points: if the test points are from a heldout quadrant, the model can’t extrapolate to the (x, y) coordinates there, and totally falls apart. https://i.imgur.com/x6phN4p.png The solution proposed by the authors is a really simple one: at one or more layers within the network, in addition to the feature channels sent up from the prior layer, add two addition channels: one with a with deterministic values going from 1 (left) to 1 (right), and the other going top to bottom. This essentially adds two fixed “features” to each pixel, which jointly carry information about where it is in space. Just by adding this small change, we give the network the ability to use spatial information or not, as it sees fit. If these features don’t prove useful, their weights will stay around their initialization values of expectationzero, and the behavior should be much like a normal convolutional net. However, if it proves useful, convolution filters at the next layer can take position information into account. It’s easy to see how this would be useful for this paper’s toy problems: you can just create a feature detector for “if this pixel is active, pass forward information about it’s spatial position,” and predict the (x, y) coordinates out easily. You can also imagine this capability helping with more typical image classification problems, by having feature filters that carry with them not only content information, but information about where a pattern was found spatially. The authors do indeed find comparable performance or small benefits to ImageNet, MNIST, and Atari RL, when applying their layers in lieu of normal convolutional layer. On GANs in particular, they find less mode collapse, though I don’t yet 100% follow the intuition of why this would be the case. https://i.imgur.com/wu7wQZr.png 
[link]
This paper was a real delight to read, and even though I’m summarizing it here, I’d really encourage you, if you’re reading this, to read the paper itself, since I found it to be unusually clearly written. It tackles the problem of understanding how features of loss functions  these integral, yet arcane, objects defined in millions of parameterdimensions  impact model performance. Loss function analysis is generally a difficult area, since the number of dimensions and number of points needed to evaluate to calculate loss are both so high. The latter presents computational challenges, the former ones of understanding: human brains and manydimensional spaces are not a good fit. Overall, this paper contributes by 1) arguing for a new way of visualizing loss functions, 2) demonstrating how and in what cases “flatness” of loss function contributes to performance and trainability, and 3)) The authors review a few historically common ways of visualizing loss functions, before introducing their variant. The simplest, onedimensional visualization technique, 1D Linear Interpolation, works by taking two parameter settings (say, a random initialization, and the final network minimum), and smoothly interpolating between the two, by taking a convex combination mediated by alpha. Then, you can plot the value of the loss at all of these parameter configurations as a function of alpha. If you want to plot in 2D, with a contour plot, you can do so in a pretty similar manner, by picking two random “direction vectors” of the same size as the parameter vector, and then adding amounts of those directions, weighted by alpha and beta, to your starting point. These random directions become your axes, and you get a snapshot of the change in your loss function as you move along them. The authors then make the observation that these techniques can’t natively be used to compare two different models, if the parameters of those models are on different scales. If you take a neural net, multiply one layer by 10, and then divide the next layer by 10, you’ve essentially done a noop that won’t impact the outcome. However, if you’re moving by a fixed amount along your random direction in parameter space, you’ll have to move much farther to go the commensurate amount of distance in the network that’s been multiplied by 10. To address this problem, they suggest a simple fix: after you’ve selected each of your random directions, scale the value in each direction vector by the norm of the filter that corresponds to that value. This gets rid of the sensitivity of your plots to the scale of weights. (One thing I admit I’m a little confused by here is the fact that each value in the direction vector corresponds to a filter, rather than to a weight; I would have natively thought theta, and all the direction vectors, are of length numberofmodelparameters, and each value is a single weight. I think I still broadly grasp the intuition, but I’d value having a better sense of this). To demonstrate the value of their normalization system, they compare the interpolation plots for a model with small and large batch size, with and without weight decay. Small batches are known to increase flatness of the loss function around the eventual minimum, which seems cooccurrent with good generalization results. And, that bears out in the original model’s linear interpolation (figs a, b, c), where the small model has the wider solution basin, and also better performance. However, once weight decay is applied (figs d, e, f), the smallbatch basin appears to shrink to be very narrow, although smallbatch still has dominant performance. At first glance, this would seem to be a contradiction of the “flatter solutions mean more generalization” rule. https://i.imgur.com/V0H13kK.png But this is just because weight decay hits smaller models more strongly, because they have more distinct updates during which they apply the weight decay penalty. This means that when weight decay is applied, the overall scale of weights in the smallbatch network is lower, and so it’s solution looked “sharp” when plotted on the same weight scale as the largebatch network. When normalization was used, this effect by and large went away, and you once again saw higher performance with flatter loss functions. (batch size and performance with and without weight decay, shown normalized below) https://i.imgur.com/vEUIgo0.png A few other, more scattered observations from the paper:  I’ve heard explanations of skip connections in terms of “giving the model shorter gradient paths between parameters and output,” but haven’t really seen an argument for why skip connections lead to smoother loss functions, even they empirically seem to https://i.imgur.com/g3QqRzh.png  The authors also devise a technique for visualizing the change in loss function along the trajectory taken by the optimization algorithm, so that different ones can be compared. The main problem in previous methods for this has been that optimization trajectories happen in a lowdimensional manifold within parameter space, so if you just randomly select directions, you won’t see any interesting movement along the trajectory. To fix this, they choose as their axes the principal components you get from making a matrix out of the parameter values at each epoch: this prioritizes the parameters that had the most variance throughout training. 
[link]
This paper focuses on the wellknown fact that adversarial examples are often transferable: that is, that an adversarial example created by optimizing loss on a surrogate model trained on similar data can often still induce increased loss on the true target model, though typically not to the same magnitude as an example optimized against the target itself. Its goal is to come up with clearer theoretical formulation for transferred examples, and more clearly understand what kinds of models transfer better than others. The authors define their two scenarios of interest as white box (where the parameters of the target model are known), and limited knowledge, or black box, where only the data type and feature representation is known, but the exact training dataset is unknown, as well as the parameters of the target model. Most of the mathematics of this paper revolve around this equation, which characterizes how to find a delta to maximize loss on the surrogate model: https://i.imgur.com/Y0mD35x.png In words: you’re finding a delta (perturbations of each input value) such that the pnorm of delta is less than some radius epsilon, and such that delta maximizes the dot product between delta and the model gradient with respect to the inputs. The closer two vectors are to one another, the higher their dot product. So, having your delta just *be* the model gradient w.r.t inputs maximizes that quantity. However, we also need to meet the requirement of having our perturbation’s norm be less than epsilon, so we in order to find the actual optimal value, we divide by the norm of the gradient (to get ourselves a norm of 1), and multiply by epsilon (to get ourselves a norm of epsilon). This leads to the optimal value of delta being, for a norm of 2: https://i.imgur.com/Op0H7KL.png An important thing to remember is that all of the above has been using what, meaning it’s been an examination of what the optimal delta is when we’re calculating against the surrogate model. But, if we plug in the optimal transfer value of delta we found above, how does this compare to the increase in loss if we were able to optimize against the true model? https://i.imgur.com/RHILZK1.png Loss on the true model is, as above, calculated as the dot product of the delta perturbation with the gradient w.r.t inputs of the true model. Using the same logic as above, this quantity is maximized when our perturbation is as close as possible to the target model’s gradient vector. So, the authors show, the degree to which adversarial examples calculated on one model transfer to another is mediated by the cosine distance between surrogate model’s gradient vector and the target model’s one. The more similar these gradients w.r.t the input are to one another, the closer surrogatemodel loss increase will be to targetmodel loss increase. This is one of those things that makes sense once it’s laid out, but it’s still useful to have a specific conceptual quality to point to when predicting whether adversarial examples will transfer, rather than just knowing that they do, at least some of the time, to at least some extent. Another interesting thing to notice from the above equation, though not directly related to transfer examples, is the right hand of the equation, the upper bound on loss increase, which is the pnorm of the gradient vector of the target model. In clearer words, this means that the amount of loss that it’s possible to induce on a model using a given epsilon of perturbation is directly dependent on the norm of that model’s gradient w.r.t inputs. This suggests that more highly regularized models, which are by definition smoother and have smaller gradients with respect to inputs, will be harder to attack. This hypothesis is borne out by the authors’ experiments. However, they also find, consistent with my understanding of prior work, that linear models are harder to attack than nonlinear ones. This draws a line between two ways we’re used to thinking about model complexity/simplicity: having a lesssmooth function with bigger gradients increases your vulnerability, but having nonlinear model structure seems to decrease it. https://i.imgur.com/mw9exLU.png One final intriguing empirical finding of this paper is that, in addition to being the hardest models to attack when they are the target, highly regularized models work the best as surrogate models. There’s a simplistic way in which this makes sense, in that if you create your examples against a “harder” adversary to begin with, they’ll be in some sense stronger, and transfer better. However, I’m not sure that intuition is a correct one here. 
[link]
In the literature of adversarial examples, there’s this (to me) constant question: is it the case that adversarial examples are causing the model to objectively make a mistake, or just displaying behavior that is deeply weird, and unintuitive relative to our sense of what these models “should” be doing. A lot of the former question seems to come down to arguing over about what’s technically “out of distribution”, which has an occasional angelsdancingonapin quality, but it’s pretty unambiguously clear that the behavior displayed in this paper is weird, and beyond what I naively expected a network to be able to be manipulated to do. The goal these authors set for themselves is what they call “reprogramming” of a network; they want the ability to essentially hijack the network’s computational engine to perform a different task, predicting a different set of labels, on a different set of inputs than the ones the model was trained on. For example, one task they perform is feeding in MNIST images at the center of a bunch of (what appear to be random, but are actually carefully optimized) pixels, and getting a network that can predict MNIST labels out the other end. Obviously, it’s not literally possible to change the number of outputs that a network produces once it’s trained, so the authors would arbitrarily map ImageNet outputs to MNIST categories (like, “when this model predicts Husky, that actually means the digit 7”) and then judge how well this mapped output performs as a MNIST classifier. I enjoyed the authors’ wry commentary here about the arbitrariness of the mapping, remarking that “a ‘White Shark’ has nothing to do with counting 3 squares in an image, and an ‘Ostrich’ does not at all resemble 10 squares”. https://i.imgur.com/K02cwK0.png This paper assumes a white box attack model, which implies visibility of all of the parameters, and ability to directly calculate gradients through the model. So, given this setup of a input surrounded by modifiable pixel weights, and a desire to assign your “MNIST Labels” correctly, this becomes a straightforward optimization problem: modify the values of your input weights so as to maximize your MNIST accuracy. An important point to note here is that the same input mask of pixel values is applied for every newtask image, and so these values are optimized over a full training set of inserted images, the way that normal weights would be. One interesting observation the authors make is that, counter to the typical setup of adversarial examples, this attack would not work with a fully linear model, since you actually need your “weights” to interact with your “input”, which is different each time, but these are both just different areas of your true input. This need to have different regions of input determine how other areas of input are processed isn’t possible in a linear model where each input has a distinct impact on the output, regardless of other input values. By contrast, when you just need to optimize a single perturbation to get the network to jack up the prediction for one class, that can be accomplished by just applying a strong enough bias everywhere in the input, all pointing in the same direction, which can be added together linearly and still get the job done. The authors are able to perform MNIST and the task of “count the squares in this small input” to relatively high levels of accuracy. They perform reasonably on CIFAR (as well as a fully connected network, but not as well as a convnet). They found that performance was higher when using a pretrained ImageNet, relative to just random weights. There’s some suggestion made that this implies there’s a kind of transfer learning going on, but honestly, this is weird enough that it’s hard to say. https://i.imgur.com/bj2MUnk.png They were able to get this reprogramming work on different model structures, but, fascinatingly, saw distinctive patterns to the "weight pixels" they needed to add to each model structure, with ResNet easily differentiable from Inception. One minor quibble I have with the framing of this paper  which I overall found impressive, creative, and wellwritten  is that I feel like it’s stretching the original frame of “adversarial example” a bit too far, to the point of possible provoking confusion. It’s not obvious that the network is making a mistake, per se, when it classifies this very outofdistribution input as something silly. I suppose, in an ideal world, we may want our models to return to something like a uniformoveroutputs state of low confidence when predicting out of distribution, but that’s a bit different than seeing a gibbon in a picture of a panda. I don’t dispute the authors claim that the behavior they’re demonstrating is a vulnerability in terms of its ability to let outside actors “hijack” networks compute, but I worry we might be overloading the “adversarial example” to cover too many types of network failure modes. 
[link]
This paper tries to solve the problem of how to learn systems that, given a starting state and a desired target, can earn the set of actions necessary to reach that target. The strong version of this problem requires a planning algorithm to learn a full set of actions to take the agent from state A to B. However, this is a difficult and complex task, and so this paper tries to address a relaxed version of this task: generating a set of “waypoint” observations between A and B, such that each successive observation is relatively close to one another in terms of possible actions (the paper calls this ‘hreachable’, if observations are reachable from one another in h timesteps). With these checkpoint observations in hand, the planning system can them solve many iterations of a much shortertimescale version of the problem. However, the paper asserts, applying predesigned planning algorithms in observation space (sparse, highdimensional) is difficult, because planning algorithms apparently do better with denser representations. (I don’t really understand, based on just reading this paper, *why* this is the case, other than the general fact that high dimensional, sparse data is just hard for most things). Historically, a typical workflow for applying planning algorithms to an environment would have been to handdesign feature representations where nearby representations were close in causal decision space (i.e. could be easily reached from one another). This paper’s goal is to derive such representations from data, rather than handdesigning them. The system they design to do this is a little unwieldy to follow, and I only have about 80% confidence that I fully understand all the mechanisms. One basic way you might compress highdimensional space into a lowdimensional code is by training a Variational Autoencoder, and pulling the latent code out of the bottleneck in the middle. However, we also want to be able to map between our lowdimensional code and a realistic observation space, once we’re done planning and have our trajectory of codes, and VAE typically have difficulty generating highdimensional observations with high fidelity. If what you want is imagegeneration fidelity, the natural step would be to use a GAN. However, GANs aren’t really natively designed to learn an informative representation; their main goal is generation, and there’s no real incentive for the noise variables used to seed generation to encode any useful information. One GAN design that tries to get around this is the InfoGAN, which gets its name from the requirement that there be high mutual information between (some subset of) the noise variables used to seed the generator, and the actual observation produced. I’m not going to get into the math of the variational approximation, but what this actually mechanically shakes out to is: in addition to generating an observation from a code, an InfoGAN also tries to predict the original code subset given the observation. Intuitively, this requirement, for the observation to contain information about the code, also means the code is forced to contain meaningful information about the image generated from it. However, even with this system, even if each code separately corresponds to a realistic observation, there’s no guarantee that closeness in state space corresponds to closeness in “causality space”. This feature is valuable for planning, because it means that if you chart out a trajectory through state space, it actually corresponds to a reasonable trajectory through observation space. In order to solve this problem, the authors added their final, and more novel, modification to the InfoGAN framework: instead of giving the GAN one latent code, and having it predict one observation, they would give two at a time, and have the GAN try to generate a pair of temporally nearby (i.e. less than h actions away) observations. Importantly, they’d also define some transition or sampling function within state space, so that there would be a structured or predictable way that adjacent pairs of states looked. So, if the GAN were able to learn to map adjacent points in state space to adjacent points in observation space, then you’d be able to plan out trajectories in state space, and have them be realistic in observation space. https://i.imgur.com/oVlVc0x.png They do some experiments and do show that both adding the “Info” structure of the InfoGAN, and adding the paired causal structure, lead to states with improved planning properties.They also compared the clusters derived from their Causal InfoGAN states to the clusters you’d get from just naively assuming that nearness in observation space meant nearness in causality space. https://i.imgur.com/ddQpIdH.png They specifically tested this on an environment divided into two “rooms”, where there were many places where there were two points, nearby in Euclidean space, but far away (or mutually inaccessible) in action space. They showed that the Causal InfoGAN (b) was successfully able to learn representations such that points nearby in action space clustered together, whereas a Euclidean representation (c) didn't have this property. 
[link]
This paper builds very directly on the idea of “empowerment” as an intrinsic reward for RL agents. Where empowerment incentivizes agents to increase the amount of influence they’re able to have over the environment, “social influence,” this paper’s metric, is based on the degree which the actions of one agent influence the actions of other agents, within a multiagent setting. The goals between the two frameworks are a little different. The notion of “empowerment” is built around a singular agent trying to figure out a shortterm proxy for likelihood of longterm survival (which is a feedback point no individual wants to hit). By contrast, the problems that the authors of this paper seek to solve are more explicitly multiagent coordination problems: prisoner’s dilemmastyle situations where collective reward requires cooperation. However, they share a mathematical basis: the idea that an agent’s influence on some other element of its environment (be it the external state, or another agent’s actions) is well modeled by calculating the mutual information between its agents and that element. While this is initially a bit of an odd conceptual jump, it does make sense: if an action can give statistical information to help you predict an outcome, it’s likely (obviously not certain, but likely) that that action influenced that outcome. In a multiagent problem, where cooperation and potentially even communication can help solve the task, being able to influence other agents amounts to “finding ways to make oneself useful to other agents”, because other agents aren’t going to change behavior based on your actions, or “listen” to your “messages” (in the experiment where a communication channel was available between agents) if these signals don’t help them achieve *their* goals. So, this incentive, to influence the behavior of other (selfinterested) agents, amounts to a good proxy for incentivizing useful cooperation. Zooming in on the exact mathematical formulations (which differ slightly from, though they’re in a shared spirit with, the empowerment math): the agent’s (A’s) Causal Influence reward is calculated by taking a KL divergence between the action distribution of the other agent (B) conditional on the action A took, compared to other actions A might have taken. (see below. Connecting back to empowerment: Mutual Information is just the expected value of this quantity, taken over A’s action distribution). https://i.imgur.com/oxXCbdK.png One thing you may notice from the above equation is that, because we’re working in KL divergences, we expect agent A to have access to the full distribution of agent B’s policy conditional on A’s action, not just the action B actually took. We also require the ability to sample “counterfactuals,” i.e. what agent B would have done if agent A had done something differently. Between these two requirements. If we take a realistic model of two agents interacting with each other, in only one timeline, only having access to the external and not internal parameters of the other, it makes it clear that these quantities can’t be pulled from direct experience. Instead, they are calculated by using an internal model: each agent builds its own MOA (Model of Other Agents), where they build a predictive model of what an agent will do at a given time, conditional on the environment and the actions of all other agents. It’s this model that is used to sample the aforementioned counterfactuals, since that just involves passing in a different input. I’m not entirely sure, in each experiment, whether the MOAs are trained concurrent with agent policies, or in a separate prior step. https://i.imgur.com/dn2cBg4.png Testing on, again, Prisoner’s Dilemma style problems requiring agents to take risky collaborative actions, the authors did find higher performance using their method, compared to approaches where each agent just maximizes its own external reward (which, it should be said, does depend on other agents’ actions), with no explicit incentive towards collaboration. Interestingly, when they specifically tested giving agents access to a “communication channel” (the ability to output discrete signals or “words” visible to other agents), they found that it was able to train just as effectively with only an influence reward, as it was with both an influence and external reward. 
[link]
This paper performs a fascinating toy experiment, to try to see if something languagelike in structure can be effectively induced in a population of agents, if they are given incentives that promote it. In some sense, a lot of what they find “just makes sense,” but it’s still a useful proof of concept to show that it can be done. The experiment they run takes place in a simple, twodimensional world, with a fixed number of landmarks (representing locations goals need to take place), and agents, and actions. In this construction, each agent has a set of internal goals, which can either be actions (like “go to green landmark”) they themselves need to perform, or actions that they want another agent to perform. Agents’ goals are not visible to other agents, but all agents’ reward is defined to be the aggregated reward of all agents together, so if agent A has a goal involving an action of agent B’s, it’s in B’s “interest” to do that action, if it can be communicated to them. In order to facilitate other agents performing goals, at each step, each agent both takes an action, and also emits an “utterance”, which is just a discrete symbolic “word” out of some some fixed vocabulary of words (Note that applying “word” here is a but fuzzy; the agents do not pronounce or spell a characterbased word, they just pick a discrete symbol that is playing the role of a word”. Even though other agents cannot see a given agent’s goals, they can see its public utterances, and so agents learn that communication is a way to induce other agents to perform desired actions. As a mathematically interesting aside: this setup, of allowing each agent to sample a single discrete word out of a small vocabulary at each setting, takes the deployment of some interesting computational tricks to accomplish. First off, in general, sampling a discrete single symbol out of a set of possible symbols is not differentiable, since it’s a discrete rather than continuous action, and derivatives require continuous functions. However, a paper from 2016 proposed a (heuristic) solution to this problem by means of the Gumbel Softmax Trick. This derives from the older “Gumbel Max Trick”, which is the mathematical fact that if you want to sample from a categorical distribution, a computationally easy way to do so is to add a variable sampled from a (0,1) Gumbel distribution to the log probability of each category, and then take the argmax of this as the index of the sample category (I’m not going to go another level down into why this is true, since I think it’s too far afield of the scope of this summary). Generally, argmax functions are also not differentiable. However, they can be approximated with softmaxes, which interpolate between a totally uniform and very nearly discretesample distribution based on a temperature parameter. In practice, or, at least, if this paper does what the original Gumbel Softmax paper did, during training, a discrete sample is taken, but a lowtemperature continuous approximation is used for actual gradient calculation (i.e. for gradients, the model pretends that it used the continuous approximation rather than the discrete sample). https://i.imgur.com/0RpRJG2.png Coming back to the actual communication problem, the authors do find that under these (admittedly fairly sanitized and contrived) circumstances, agents use series of discrete symbols to communicate goals to other agents, which ends up looking a lot like a very simple language. https://i.imgur.com/ZF0EbN4.png As one might expect, in environments where there were only two agents, there was no symbol that ended up corresponding to “red agent” or “blue agent”, since each could realize that the other was speaking to it. However, in threeagent environments, the agents did develop symbols that clearly mapped to these categories, to specify who directions were being given to. The authors also tried cutting off verbal communication; in these situations, the agents used gaze and movement to try to signal what they wanted other agents to do. Probably most entertainingly, when neither verbal nor visual communication was allowed, agents would move to and “physically” push other agents to the location where their action needed to be performed. 
[link]
This paper continues in the tradition of curiositybased models, which try to reward models for exploring novel parts of their environment, in the hopes this can intrinsically motivate learning. However, this paper argues that it’s insufficient to just treat novelty as an occasional bonus on top of a normal reward function, and that instead you should figure out a process that’s more specifically designed to increase novelty. Specifically: you should design a policy whose goal is to experience transitions and worldstates that are high novelty. In this setup, like in other curiositybased papers, “high novelty” is defined in terms of a state being unpredictable given a prior state, history, and action. However, where other papers saw novelty reward as something only applied when the agent arrived at somewhere novel, here, the authors build a model (technically, an ensemble of models) to predict the state at various future points. The ensemble is important here because it’s (quasi) bootstrapped, and thus gives us a measure of uncertainty. States where the predictions of the ensemble diverge represent places of uncertainty, and thus of high value to explore. I don’t 100% follow the analytic specification of this idea (even though the heuristic/algorithmic description makes sense). The authors frame the Utility function of a state and action as being equivalent to the Jenson Shannon Divergence (~distance between probability distributions) shown below. https://i.imgur.com/YIuomuP.png Here, P(S  S, a, T) is the probability of a state given prior state and action under a given model of the environment (Transition Model), and P(gamma) is the distribution over the space of possible transition models one might learn. A “model” here is one network out of the ensemble of networks that makes up our bootstrapped (trained on different sets) distribution over models. Conceptually, I think this calculation is measuring “how different is each sampled model/state distribution from all the other models in the distribution over possible models”. If the models within the distribution diverge from one another, that indicates a location of higher uncertainty. What’s important about this is that, by building a full transition model, the authors can calculate the expected novelty or “utility” of future transitions it might take, because it can make a best guess based on this transition model (which, while called a “prior”, is really something trained on all data up to this current iteration). My understanding is that these kinds of models function similarly to a Q(s,a) or V(s) in a purereward case: they estimate the “utility reward” of different states and actions, and then the policy is updated to increase that expected reward. I’ve recently read papers on ICM, and I was a little disappointed that this paper didn’t appear to benchmark against that, but against Bootstrapped DQN and Exploration Bonus DQN, which I know less well and can less speak to the conceptual differences from this approach. Another difficulty in actually getting a good sense of results was that the task being tested on is fairly specific, and different from RL results coming out of the world of e.g. Atari and Deep Mind Labs. All of that said, this is a cautiously interesting idea, if the results generate to beat more baselines on more environments.
2 Comments

[link]
This paper proposes a new curiositybased intrinsic reward technique that seeks to address one of the failure modes of previous curiosity methods. The basic idea of curiosity is that, often, exploring novel areas of an environment can be correlated with gaining reward within that environment, and that we can find ways to incentivize the former that don’t require a handdesigned reward function. This is appealing because many usefultolearn environments either lack inherent reward altogether, or have reward that is very sparse (i.e. no signal until you reach the end, at which point you get a reward of 1). In both of these cases, supplementing with some kind of intrinsic incentive towards exploration might improve performance. The existing baseline curiosity technique is called ICM, and works based on “surprisal”: asking the agent to predict the next state as a function of its current state, and incentivizing exploration of areas where the gap between these two quantities is high, to promote exploration of hardertopredict (and presumably more poorly sampled) locations. However, one failure mode of this approach is something called the “noisy TV” problem, whereby if the environment contains something analogous to a television where one can press a button and go to a random channel, that is highly unpredictable, and thus a source of easy rewards, and thus liable to distract the agent from any other actions. As an alternative, the authors here suggest a different way of defining novelty: rather than something that is unpredictable, novelty should be seen as something far away from what I as an agent have seen before. This is more direct than the prior approach, which takes ‘hard to predict’ as a proxy for ‘somewhere I haven’t explored’, which may not necessary be a reasonable assumption. https://i.imgur.com/EfcAOoI.png They implement this idea by keeping a memory of past (embedded) observations that the agent has seen during this episode, and, at each step, check whether the current observation is predicted to be more than K steps away than any of the observations in memory (more on that in a moment). If so, a bonus reward is added, and this observation is added to the aforementioned memory. (Which, waving hands vigorously, kind of ends up functioning as a spanning set of prior experience). https://i.imgur.com/gmHE11s.png The question of “how many steps is observation A from observation B” is answered by a separate Comparator network which is trained in pretty straightforward fashion: a randomsamplling policy is used to collect trajectories, which are then turned into pairs of observations as input, and a 1 if they occurred > k + p steps apart, and a 0 if they occurred < k steps apart. Then, these paired states are passed into a sharedweight convolutional network, which creates an embedding, and, from that embedding, a prediction is made as to whether they’re closer than the thresholds or farther away. This network is pretrained before the actual RL training starts. (Minor sidenote: at RLtraining time, the network is chopped into two, and the embedding read out and stored, and then input as a pair with each current observation to make the prediction). https://i.imgur.com/1oUWKyb.png Overall, the authors find that their method works better than both ICM and nointrinsicreward for VizDoom (a maze + shooting game), and the advantage is stronger in situations more sparse settings of the external reward. https://i.imgur.com/4AURZbX.png On DeepMind Lab tasks, they saw no advantage on tasks with alreadydense extrinsic rewards, and little advantage on the “normally sparse”, which they suggest may be due to it actually being easier than expected. They added doors to a maze navigation task, to ensure the agent couldn’t find the target right away, and this situation brought better performance of their method. They also tried a fully noextrinsicreward situation, and their method strongly performed both the ICM baseline and (obviously) the onlyextrinsicreward baseline, which was basically an untrained random policy in this setting. Regarding the poor performance of the ICM baseline in this environment, “we hypothesise that the agent can most significantly change its current view when it is close to the wall — thus increasing onestep prediction error — so it tends to get stucknear “interesting” diverse textures on the walls.”. 
[link]
I really enjoyed this paper  in addition to being a clean, fundamentally empirical work, it was also clearly written, and had some pretty delightful moments of quotable zen, which I’ll reference at the end. The paper’s goal is to figure out how far curiositydriven learning alone can take reinforcement learning systems, without the presence of an external reward signal. “Intrinsic” reward learning is when you construct a reward out of internal, inherent features of the environment, rather than using an explicit reward function. In some ways, intrinsic learning in RL can be thought of as analogous to unsupervised learning in classification problems, since reward functions are not inherent to most useful environments, and (when outside of game environments that inherently provide rewards), frequently need to be handdesigned. Curiositydriven learning is a subset of intrinsic learning, which uses as a reward signal the difference between a prediction made by the dynamics model (predicting next state, given action) and the true observed next state. Situations where the this prediction area are high generate high reward for the agent, which incentivizes it to reach those states, which allows the dynamics model to then make everbetter predictions about them. Two key questions this paper raises are: 1) Does this approach even work when used on it’s own? Curiosity had previously most often been used as a supplement to extrinsic rewards, and the authors wanted to know how far it could go separately. 2) What is the best feature to do this “surprisal difference” calculation in? Predicting raw pixels is a highdimensional and noisy process, so naively we might want something with fewer, more informationallydense dimensions, but it’s not obvious which methods that satisfy these criteria will work the best, so the paper empirically tried them. The answer to (1) seems to be: yes, at least in the video games tested. Impressively, when you track against extrinsic reward (which, again, these games have, but we’re just ignoring in a curiosityonly setting), the agents manage to increase it despite not optimizing against it directly. There were some Atari games where this effect was stronger than others, but overall performance was stronger than might have been naively expected. One note the authors made, worth keeping in mind, is that it’s unclear how much of this is an artifact of the constraints and incentives surrounding game design, which might reflect back a preference for graduallyincreasing novelty because humans find it pleasant. https://i.imgur.com/zhl39vo.png As for (2), another interesting result of this paper is that random features performed consistently well as a feature space to do these prediction/reality comparisons in. Random features here is really just as simple as “design a convolutional net that compresses down to some dimension, randomly initialize it, and then use those randomly initialized weights to run forward passes of the network to get your lowerdimensional state”. This has the strong disadvantage of (presumably) not capturing any meaningful information about the state, but also has the advantage of being stable: the other techniques tried, like pulling out the center of a VAE bottleneck, changed over time as they were being trained on new states, so they were informative, but nonstationary. My two favorite quotable moments from this paper were: 1) When the authors noted that they had removed the “done” signal associated with an agent “dying,” because it is itself a sort of intrinsic reward. However, “in practice, we do find that the agent avoids dying in the games since that brings it back to the beginning of the game, an area it has already seen many times and where it can predict the dynamics well.”. Short and sweet: “Avoiding death, because it’s really boring” https://i.imgur.com/SOfML8d.png 2) When they noted that an easy way to hack the motivation structure of a curiositydriven agent was through a “noisy tv”, which, every time you pressed the button, jumped to a random channel. As expected, when they put this distraction inside a maze, the agent spent more time jacking up reward through that avenue, rather than exploring. Any resemblance to one’s Facebook feed is entirely coincidental. 
[link]
This paper posits that one of the central problems stopping multitask RL  that is, single models trained to perform multiple tasks well  from reaching better performance, is the inability to balance model resources and capacity between the different tasks the model is being asked to learn. Empirically, prior to this paper, multitask RL could reach ~50% of human accuracy on Atari and Deepmind Lab tasks. The fact that this is lower than human accuracy is actually somewhat less salient than the fact that it’s quite a lot lower than singletask RL  how a single model trained to perform only that task could do. When learning a RL model across multiple tasks, the reward structures of the different tasks can vary dramatically. Some can have highmagnitude, sparse rewards, some can have low magnitude rewards throughout. If a model learns it can gain what it thinks is legitimately more reward by getting better at a game with an average reward of 2500 than it does with an average reward of 15, it will put more capacity into solving the former task. Even if you apply normalization strategies like reward clipping (which treats all rewards as a binary signal, regardless of magnitude, and just seeks to increase the frequency of rewards), that doesn’t deal with some environments having more frequent rewards than others, and thus more total reward when summed over timestep. The authors here try to solve this problem by performing a specific kind of normalization, called Pop Art normalization, on the problem. PopArt normalization (don’t worry about the name) works by adaptively normalizing both the target and the estimate of the target output by the model, at every step. In the ActorCritic case that this model is working on, the target and estimate that are being normalized are, respectively, 1) the aggregated rewards of the trajectories from state S onward, and 2) the value estimate at state S. If your value function is perfect, these two things should be equivalent, and so you optimize your value function to be closer to the true rewards under your policy. And, then, you update your policy to increase probability of actions with higher advantage (expected reward with that action, relative to the baseline Value(S) of that state). The “adaptive” part of that refers to correcting for the fact when you’re estimating, say, a Value function to predict the total future reward of following a policy at a state, that V(S) will be strongly nonstationary, since by improving your policy you are directly optimizing to increase that value. This is done by calculating “scale” and “shift” parameters off of a recent data. The other part of the PopArt algorithm works by actually updating the estimate our model is producing, to stay normalized alongside the continuallybeingrenormalized target. https://i.imgur.com/FedXTfB.png It does this by taking the new and old versions of scale (sigma) and shift (mu) parameters (which will be used to normalize the target) and updates the weights and biases of the last layer, such that the movement of the estimator moves along with the movement in the target. Using this toolkit, this paper proposes learning one *policy* that’s shared over all task, but keeping shared value estimation functions for each task. Then, it normalizes each task’s values independently, meaning that each task ends up contributing equal weight to the gradient updates of the model (both for the Value and Policy updates). In doing this, the authors find dramatically improved performance at both Atari and Deepmind, relative to prior IMPALA work https://i.imgur.com/nnDcjNm.png https://i.imgur.com/Z6JClo3.png 
[link]
This reinforcement learning paper starts with the constraints imposed an engineering problem  the need to scale up learning problems to operate across many GPUs  and ended up, as a result, needing to solve an algorithmic problem along with it. In order to massively scale up their training to be able to train multiple problem domains in a single model, the authors of this paper implemented a system whereby many “worker” nodes execute trajectories (series of actions, states, and reward) and then send those trajectories back to a “learner” node, that calculates gradients and updates a central policy model. However, because these updates are queued up to be incorporated into the central learner, it can frequently happen that the policy that was used to collect the trajectories is a few steps behind from the policy on the central learner to which its gradients will be applied (since other workers have updated the learner since this worker last got a policy download). This results in a need to modify the policy network model design accordingly. IMPALA (Importance Weighted Actor Learner Architectures) uses an “Actor Critic” model design, which means you learn both a policy function and a value function. The policy function’s job is to choose which actions to take at a given state, by making some higher probability than others. The value function’s job is to estimate the reward from a given state onward, if a certain policy p is followed. The value function is used to calculate the “advantage” of each action at a given state, by taking the reward you receive through action a (and reward you expect in the future), and subtracting out the value function for that state, which represents the average future reward you’d get if you just sampled randomly from the policy from that point onward. The policy network is then updated to prioritize actions which are higheradvantage. If you’re onpolicy, you can calculate a value function without needing to explicitly calculate the probabilities of each action, because, by definition, if you take actions according to your policy probabilities, then you’re sampling each action with a weight proportional to its probability. However, if your actions are calculated offpolicy, you need correct for this, typically by calculating an “importance sampling” ratio, that multiplies all actions by a probability under the desired policy divided by the probability under the policy used for sampling. This cancels out the implicit probability under the sampling policy, and leaves you with your actions scaled in proportion to their probability under the policy you’re actually updating. IMPALA shares the basic structure of this solution, but with a few additional parameters to dynamically trade off between the bias and variance of the model. The first parameter, rho, controls how much bias you allow into your model, where bias here comes from your model not being fully corrected to “pretend” that you were sampling from the policy to which gradients are being applied. The tradeoff here is that if your policies are far apart, you might downweight its actions so aggressively that you don’t get a strong enough signal to learn quickly. However, the policy you learn might be statistically biased. Rho does this by weighting each value function update by: https://i.imgur.com/4jKVhCe.png where rhobar is a hyperparameter. If rhobar is high, then we allow stronger weighting effects, whereas if it’s low, we put a cap on those weights. The other parameter is c, and instead of weighting each value function update based on policy drift at that state, it weights each timestep based on how likely or unlikely the action taken at that timestep was under the true policy. https://i.imgur.com/8wCcAoE.png Timesteps that much likelier under the true policy are upweighted, and, once again, we use a hyperparameter, cbar, to put a cap on the amount of allowed upweighting. Where the prior parameter controlled how much bias there was in the policy we learn, this parameter helps control the variance  the higher cbar, the higher the amount of variance there will be in the updates used to train the model, and the longer it’ll take to converge. 
[link]
This paper’s highlevel goal is to evaluate how well GANtype structures for generating text are performing, compared to more traditional maximum likelihood methods. In the process, it zooms into the ways that the current set of metrics for comparing text generation fail to give a wellrounded picture of how models are performing. In the old paradigm, of maximum likelihood estimation, models were both trained and evaluated on a maximizing the likelihood of each word, given the prior words in a sequence. That is, models were good when they assigned high probability to true tokens, conditioned on past tokens. However, GANs work in a fundamentally new framework, in that they aren’t trained to increase the likelihood of the next (ground truth) word in a sequence, but to generate a word that will make a discriminator more likely to see the sentence as realistic. Since GANs don’t directly model the probability of token t, given prior tokens, you can’t evaluate them using this maximum likelihood framework. This paper surveys a range of prior work that has evaluated GANs and MLE models on two broad categories of metrics, occasionally showing GANs to perform better on one or the other, but not really giving a way to trade off between the two.  The first type of metric, shorthanded as “quality”, measures how aligned the generated text is with some reference corpus of text: to what extent your generated text seems to “come from the same distribution” as the original. BLEU, a heuristic frequently used in translation, and also leveraged here, measures how frequently certain sets of ngrams occur in the reference text, relative to the generated text. N typically goes up to 4, and so in addition to comparing the distributions of single tokens in the reference and generated, BLEU also compares shared bigrams, trigrams, and quadgrams (?) to measure more precise similarity of text.  The second metric, shorthanded as “diversity” measures how different generated sentences are from one another. If you want to design a model to generate text, you presumably want it to be able to generate a diverse range of text  in probability terms, you want to fully sample from the distribution, rather than just taking the expected or mean value. Linguistically, this would be show up as a generator that just generates the same sentence over and over again. This sentence can be highly representative of the original text, but lacks diversity. One metric used for this is the same kind of BLEU score, but for each generated sentence against a corpus of prior generated sentences, and, here, the goal is for the overlap to be as low as possible The trouble with these two metrics is that, in their raw state, they’re pretty incommensurable, and hard to trade off against one another. The authors of this paper try to address this by observing that all models trade off diversity and quality to some extent, just by modifying the entropy of the conditional token distribution they learn. If a distribution is high entropy, that is, if it spreads probability out onto more tokens, it’s likelier to bounce off into a random place, which increases diversity, but also can make the sentence more incoherent. By contrast, if a distribution is too low entropy, only ever putting probability on one or two words, then it will be only ever capable of carving out a small number of distinct paths through word space. The below table shows a good example of what language generation can look like at high and low levels of entropy https://i.imgur.com/YWGXDaJ.png The entropy of a softmax distribution be modified, without changing the underlying model, by changing the *temperature* of the softmax calculation. So, the authors do this, and, as a result, they can chart out that model’s curve on the quality/diversity axis. Conceptually, this is asking “at a range of different quality thresholds, how good is this model’s diversity,” and vice versa. I mentally analogize this to a ROC curve, where it’s not really possible to compare, say, precision of models that use different thresholds, and so you instead need to compare the curve over a range of different thresholds, and compare models on that. https://i.imgur.com/C3zdEjm.png When they do this for GANs and MLEs, they find that, while GANs might dominate on a single metric at a time, when you modulate the temperature of MLE models, they’re able to achieve superior quality when you tune them to commensurate levels of diversity. 
[link]
GANs for images have made impressive progress in recent years, reaching everhigher levels of subjective realism. It’s also interesting to think about domains where the GAN architecture is less of a good fit. An example of one such domain is natural language. As opposed to images, which are made of continuous pixel values, sentences are fundamentally sequences of discrete values: that is, words. In a GAN, when the discriminator makes its assessment of the realness of the image, the gradient for that assessment can be backpropagated through to the pixel level. The discriminator can say “move that pixel just a bit, and this other pixel just a bit, and then I’ll find the image more realistic”. However, there is no smoothly flowing continuous space of words, and, even if you use continuous embeddings of words, it’s still the case that if you tried to apply a small change to a embedding vector, you almost certainly wouldn’t end up with another word, you’d just be somewhere in the middle of nowhere in word space. In short: the discrete nature of language sequences doesn’t allow for gradient flow to propagate backwards through to the generator. The authors of this paper propose a solution: instead of trying to treat their GAN as one big differentiable system, they framed the problem of “generate a sequence that will seem realistic to the discriminator” as a reinforcement learning problem? After all, this property  of your reward just being generated *somewhere* in the environment, not something analytic, not something you can backprop through  is one of the key constraints of reinforcement learning. Here, the more real the discriminator finds your sequence, the higher the reward. One approach to RL, and the one this paper uses, is that of a policy network, where your parametrized network produces a distribution over actions. You can’t update your model to deterministically increase reward, but you can shift around probability in your policy such that your expected reward of following that policy is higher. This key kernel of an idea  GANs for language, but using a policy network framework to get around not having backpropable loss/reward gets you most of the way to understanding what these authors did, but it’s still useful to mechanically walk through specifics. https://i.imgur.com/CIFuGCG.png At each step, the “state” is the existing words in the sequence, and the agent’s “action” the choosing of its next word  The Discriminator can only be applied to completed sequences, since it's difficult to determine whether an incoherent halfsentence is realistic language. So, when the agent is trying to calculate the reward of an action at a state, it uses Monte Carlo Tree Search: randomly “rolling out” many possible futures by randomly sampling from the policy, and then taking the average Discriminator judgment of all those futures resulting from each action as being its expected reward  The Generator is a LSTM that produces a softmax over words, which can be interpreted as a policy if it’s sampled from randomly  One of the nice benefits of this approach is that it can work well for cases where we don't have a handcrafted quality assessment metric, the way we have BLEU score for translation
1 Comments

[link]
I should say from the outset: I have a lot of fondness for this paper. It goes upstream of a lot of researchcommunity incentives: It’s not methodologically flashy, it’s not about beating the State of the Art with a bigger, better model (though, those papers certainly also have their place). The goal of this paper was, instead, to dive into a test set used to evaluate performance of models, and try to understand to what extent it’s really providing a rigorous test of what we want out of model behavior. Test sets are the ofteninvisible foundation upon which ML research is based, but like realworld foundations, if there are weaknesses, the research edifice built on top can suffer. Specifically, this paper discusses the Winograd Schema, a clever test set used to test what the NLP community calls “common sense reasoning”. An example Winograd Schema sentence is: The delivery truck zoomed by the school bus because it was going so fast. A model is given this task, and asked to predict which token the underlined “it” refers to. These cases are specifically chosen because of their syntactic ambiguity  nothing structural about the order of the sentence requires “it” to refer to the delivery truck here. However, the underlying meaning of the sentence is only coherent under that parsing. This is what is meant by “commonsense” reasoning: the ability to understand meaning of a sentence in a way deeper than that allowed by simple syntactic parsing and word cooccurrence statistics. Taking the existing Winograd examples (and, when I said tiny, there are literally 273 of them) the authors of this paper surface some concerns about ways these examples might not be as difficult or representative of “common sense” abilities as we might like.  First off, there is the basic, previously mentioned fact that there are so few examples that it’s possible to perform well simply by random chance, especially over combinatorially large hyperparameter optimization spaces. This isn’t so much an indictment of the set itself as it is indicative of the work involved in creating it.  One of the two distinct problems the paper raises is that of “associativity”. This refers to situations where simple cooccurance counts between the description and the correct entity can lead the model to the correct term, without actually having to parse the sentence. An example here is: “I’m sure that my map will show this building; it is very famous.” Treasure maps aside, “famous buildings” are much more generally common than “famous maps”, and so being able to associate “it” with a building in this case doesn’t actually require the model to understand what’s going on in this specific sentence. The authors test this by creating a threshold for cooccurance, and, using that threshold, call about 40% of the examples “associative”  The second problem is that of predictable structure  the fact that the “hinge” adjective is so often the last word in the sentence, making it possible that the model is brittle, and just attending to that, rather than the sentence as a whole The authors perform a few tests  examining results on associative vs nonassociative examples, and examining results if you switch the ordering (in cases like “Emma did not pass the ball to Janie although she saw that she was open,” where it’s syntactically possible), to ensure the model is not just anchoring on the identity of the correct entity, regardless of its place in the sentence. Overall, they found evidence that some of the state of the art language models perform well on the Winograd Schema as a whole, but do less well (and in some cases even less well than the baselines they otherwise outperform) on these more rigorous examples. Unfortunately, these tests don’t lead us automatically to a better solution  design of examples like this is still tricky and hard to scale  but does provide valuable caution and food for thought. 
[link]
For solving sequence modeling problems, recurrent architectures have been historically the most commonly used solution, but, recently, temporal convolution networks, especially with dilations to help capture longer term dependencies, have gained prominence. RNNs have theoretically much larger capacity to learn long sequences, but also have a lot of difficulty propagating signal forward through long chains of recurrent operations. This paper, which suggests the approach of Trellis Networks, places itself squarely in the middle of the debate between these two paradigms. TrellisNets are designed to be a theoretical bridge between between temporal convolutions and RNNs  more specialized than the former, but more generalized than the latter. https://i.imgur.com/J2xHYPx.png The architecture of TrellisNets is very particular, and, unfortunately, somewhat hard to internalize without squinting at diagrams and equations for awhile. Fundamentally:  At each layer in a TrellisNet, the network creates a “candidate preactivation” by combining information from the input and the layer below, for both the current and former time step.  This candidate preactivation is then nonlinearly combined with the prior layer, priortimestep hidden state  This process continues for some desired number of layers. https://i.imgur.com/f96QgT8.png At first glance, this structure seems pretty arbitrary: a lot of quantities connected together, but without a clear mechanic for what’s happening. However, there are a few things interesting to note here, which will help connect these dots, to view TrellisNet as either a kind of RNN or a kind of CNN:  TrellisNet uses the same weight matrices to process prior and current timestep inputs/hidden states, no matter which timestep or layer it’s on. This is strongly reminiscent of a recurrent architecture, which uses the same calculation loop at each timestep  TrellisNets also reinsert the model’s input at each layer. This also gives it more of a RNNlike structure, where the prior layer’s values act as a kind of “hidden state”, which are then combined with an input value  At a given layer, each timestep only needs access to two elements of the prior layer (in addition to the input); it does not require access to all the priortimestep values of it’s own layer. This is important because it means that you can calculate an entire layer’s values at once, given the values of the prior layer: this means these models can be more easily parallelized for training Seeing TrellisNets as a kind of Temporal CNN is fairly straightforward: each timestep’s value, at a given layer, is based on a “filter” of the lowerlayer value at the current and prior timestep, and this filter is shared across the whole sequence. Framing them as a RNN is certainly trickier, and anyone wanting to understand it in full depth is probably best served by returning to the paper’s equations. At at high level, the authors show that TrellisNets can represent a specific kind of RNN: a truncated RNN, where each timestep only uses history from the prior M time steps, rather than the full sequence. This works by sort of imagining the RNN chains as existing along the diagonals of a TrellisNet architecture diagram: as you reach higher levels, you can also reach farther back in time. Specifically, a TrellisNet that wants to represent a depth K truncated RNN, which is allowed to unroll through M steps of history, can do so using M + K  1 layers. Essentially, by using a fixed operation across layers and timesteps, the TrellisNet authors blur the line between layer and timestep: any chain of operations, across layers, is fundamentally a series of the same operation, performed many times, and is in that way RNNlike. The authors have not yet taken a stab at translation, but tested their model on a number of word and characterlevel language modeling tasks (predicting the next word or character, given prior ones), and were able to successfully beat SOTA on many of them. I’d be curious to see more work broadly in this domain, and also gain a better understanding of areas in which a fixed, recurrentlyused layer operation, like the ones used in RNNs and this paper, is valuables, and areas (like a “normal” CNN) where having specific weights for different levels of the hierarchy is valable. 
[link]
This paper is, on the whole, a refreshing jaunt into the applied side of the research word. It isn’t looking to solve a fundamental machine learning problem in some new way, but it does highlight and explore one potential beneficial application of a common and widely used technique: specifically, combining word embeddings with contextfree grammars (such as: regular expressions), to make the latter less rigid. Regular expressions work by specifying specific hardcoded patterns of symbols, and matching against any strings in some search set that match those patterns. They don’t need to specify specific characters  they can work at higher levels of generality, like “uppercase alphabetic character” or “any character”, but they’re still fundamentally hardcoded, in that the designer of the expression needs to create a specification that will affirmatively catch all the desired cases. This can be a particular challenging task when you’re trying to find  for example  all sentences that match the pattern of someone giving someone else a compliment. You might want to match against “I think you’re smart” and also “I think you’re clever”. However, in the normal use of regular expressions, something like this would be nearly impossible to specify, short of writing out every synonym for “intelligent” that you can think of. The “Embedding Grammars” paper proposes a solution to this problem: instead of enumerating a list of synonyms, simply provide one example term, or, even better, a few examples, and use those term’s word embedding representation to define a “synonym bubble” (my word, not theirs) in continuous space around those examples. This is based on the oftremarkedupon fact that, because word embedding systems are generally trained to push together words that can be used in similar contexts, closeness in word vector space frequently corresponds to words being synonyms, or close in some other sense. So, if you “match” to any term that is sufficiently nearby to your exemplar terms, you are performing something similar to the task of enumerating all of a term’s syllables. Once this general intuition is in hand, the details of the approach are fairly straightforward: the authors try a few approaches, and find that constructing a bubble of some epsilon around each example’s word vector, and matching to anything inside that bubble, works the best as an approach. https://i.imgur.com/j9OSNuE.png Overall, this seems like a clever idea; one imagines that the notion of word embeddings will keep branching out into ever more farflung application as time goes on. There are reasons to be skeptical of this paper, though. Fundamentally, word embedding space is a “here there be dragons” kind of place: we may be able to observe broad patterns, and might be able to say that “nearby words tend to be synonyms,” but we can’t give any kind of guarantee of that being the case. As an example of this problem, often the nearest thing to an example, after direct synonyms, are direct antonyms, so if you set too high a threshold, you’ll potentially match to words exactly the opposite of what you expect. We are probably still a ways away from systems like this one being broady useful, for this and other reasons, but I do think it’s valuable to try to understand what questions we’d want answered, what features of embedding space we’d want more elucidated, before applications like these would be more stably usable. 
[link]
I admit it  the title of the paper pulled me in, existing as it does in the chain of weirdly insidermeme papers, starting with Vaswani’s 2017 “Attention Is All You Need”. That paper has been hugely influential, and the domain of machine translation as a whole has begun to move away from processing (or encoding) source sentences with recurrent architectures, to instead processing them using selfattention architectures. (Selfattention is a little too nuanced to go into in full depth here, but the basic idea is: instead of summarizing varyinglength sequences by feeding each timestep into a recurrent loop and building up hidden states, generate a query, and weight the contribution of each timestep to each “hidden state” based on the dot product between that query and each timestep’s representation). There has been an overall move in recent years away from recurrence being the accepted default for sequence data, and towards attention and (often dilated) convolution taking up more space. I find this an interesting set of developments, and had hopes that this paper would address that arc. However, unfortunately, the title was quite out of sync with the actual focus of the paper  instead of addressing the contribution of attention mechanisms vs recurrence, or even directly addressing any of the particular ideas posed in the “Attention is All You Need” paper, this YMNNA instead takes aim at a more fundamental structural feature of translation models: the encoder/decoder structure. The basic idea of an encoder/decoder approach, in a translation paradigm, is that you process the entire source sentence before you start generating the tokens of the predicted, otherlanguage target sentence. Initially, this would work by running a RNN over the full sentence, and using the final hidden state of that RNN as a compressed representation of the full sentence. More recently, the norm has been to use multiple layers of RNN, and to represent the source sentence via the hidden states at each timestep (so: as many hidden states as you have input tokens), and then at each step in the decoding process, calculate an attentionweighted average over all of those hidden states. But, fundamentally, both of these structures share the fact that some kind of global representation is calculated and made available to the decoder before it starts predicting words in the output sentence. This makes sense for a few reasons. First, and most obviously, languages aren’t naturally aligned with one another, in the sense of one word in language X corresponding to one word in language Y. It’s not possible for you to predict a word in the target sentence if its corresponding source sentence token has not yet been processed. For another, there can be contextual information from the sentence as a whole that can disambiguate between different senses of a word, which may have different translations  think Teddy Bear vs Teddy Roosevelt. However, this paper poses the question: how well can you do if you throw away this structure, and build a model that continually emits tokens of the target sequence as it reads in the source sentence? Using a recurrent model, the YMNNA model takes, at each timestep, the new source token, the previous target token, and the prior hidden state from the last time step of the RNN, and uses that to predict a token. However, that problem mentioned earlier  of languages not natively being aligned such that you have the necessary information to predict a word by the time you get to its point in the target sequence  hasn’t gone away, and is still alive and kicking. This paper solves it in a pretty unsatisfying way  by relying on an external tool, fastalign, that does the work of guessing which source tokens correspond to which target tokens, and inserting buffer tokens into the target, so that you don’t need to predict a word until it’s already been seen by the sourcereading RNN; until then you just predict the buffer. This is fine and clever as a practical heuristic, but it really does make their comparisons against models that do alignment and translation jointly feel a little weak. https://i.imgur.com/Gitpxi7.png An additional heuristic that makes the overall narrative of the paper less compelling is the fact that, in order to get comparable performance to their baselines, they padded the target sequences with between 3 and 5 buffer tokens, meaning that the models learned that they could process the first 35 tokens of the sentence before they need to start emitting the target. Again, there’s nothing necessarily wrong with this, but, since they are consuming a portion of the sentence before they start emitting translations, it does make for a less stark comparison with the “read the whole sentence” encoder/decoder framework. A few other frustrations, and notes from the paper’s results section: As earlier mentioned, the authors don’t actually compare their work against the “Attention is All You Need” paper, but instead to a 2014 paper. This is confusing both in terms of using an old baseline for SOTA, and also in terms of their title implicitly arguing they are refuting a paper they didn’t compare to Comparing against their old baseline, their eager translation model performs worse on all sentences less than 60 tokens in length (which makes up the vast majority of all the sentences there are), and only beats the baseline on sentences > 60 tokens in length Additionally, they note as a sort of throwaway line that their model took almost three times as long to train as the baseline, with the same amount of parameters, simply because it took so much longer to converge. Being charitable, it seems like there is some argument that an eager translation framework performs well on long sentences, and can do so while only keeping a hidden state in memory, rather than having to keep the hidden states for each source sequence element around, like attentionbased decoders require. However, overall, I found this paper to be a frustrating letdown, that used too many heuristics and hacks to be a compelling comparison to prior work.
1 Comments

[link]
The last two years have seen a number of improvements in the field of language model pretraining, and BERT  Bidirectional Encoder Representations from Transformers  is the most recent entry into this canon. The general problem posed by language model pretraining is: can we leverage huge amounts of raw text, which aren’t labeled for any specific classification task, to help us train better models for supervised language tasks (like translation, question answering, logical entailment, etc)? Mechanically, this works by either 1) training word embeddings and then using those embeddings as input feature representations for supervised models, or 2) treating the problem as a transfer learning problem, and finetune to a supervised task  similar to how you’d finetune a model trained on ImageNet by carrying over parameters, and then training on your new task. Even though the text we’re learning on is strictly speaking unsupervised (lacking a supervised label), we need to design a task on which we calculate gradients in order to train our representations. For unsupervised language modeling, that task is typically structured as predicting a word in a sequence given prior words in that sequence. Intuitively, in order for a model to do a good job at predicting the word that comes next in a sentence, it needs to have learned patterns about language, both on grammatical and semantic levels. A notable change recently has been the shift from learning unconditional word vectors (where the word’s representation is the same globally) to contextualized ones, where the representation of the word is dependent on the sentence context it’s found in. All the baselines discussed here are of this second type. The two main baselines that the BERT model compares itself to are OpenAI’s GPT, and Peters et al’s ELMo. The GPT model uses a selfattentionbased Transformer architecture, going through each word in the sequence, and predicting the next word by calculating an attentionweighted representation of all prior words. (For those who aren’t familiar, attention works by multiplying a “query” vector with every word in a variablelength sequence, and then putting the outputs of those multiplications into a softmax operator, which inherently gets you a weighting scheme that adds to one). ELMo uses models that gather context in both directions, but in a fairly simple way: it learns one deep LSTM that goes from left to right, predicting word k using words 0k1, and a second LSTM that goes from right to left, predicting word k using words k+1 onward. These two predictions are combined (literally: just summed together) to get a representation for the word at position k. https://i.imgur.com/2329e3L.png BERT differs from prior work in this area in several small ways, but one primary one: instead of representing a word using only information from words before it, or a simple sum of prior information and subsequent information, it uses the full context from before and after the word in each of its multiple layers. It also uses an attentionbased Transformer structure, but instead of incorporating just prior context, it pulls in information from the full sentence. To allow for a model that actually uses both directions of context at a time in its unsupervised prediction task, the authors of BERT slightly changed the nature of that task: it replaces the word being predicted with the “mask” token, so that even with multiple layers of context aggregation on both sides, the model doesn’t have any way of knowing what the token is. By contrast, if it weren’t masked, after the first layer of context aggregation, the representations of other words in the sequence would incorporate information about the predicted word k, making it trivial, if another layer were applied on top of that first one, for the model to directly have access to the value it’s trying to predict. This problem can either be solved by using multiple layers, each of which can only see prior context (like GPT), by learning fully separate LR and RL models, and combining them at the final layer (like ELMo) or by masking tokens, and predicting the value of the masked tokens using the full remainder of the context. This task design crucially allows for a multilayered bidirectional architecture, and consequently a much richer representation of context in each word’s pretrained representation. BERT demonstrates dramatic improvements over prior work when fine tuned on a small amount of supervised data, suggesting that this change added substantial value. 
[link]
This recent paper, a collaboration involving some of the authors of MAML, proposes an intriguing application of techniques developed in the field of meta learning to the problem of unsupervised learning  specifically, the problem of developing representations without labeled data, which can then be used to learn quickly from a small amount of labeled data. As a reminder, the idea behind meta learning is that you train models on multiple different tasks, using only a small amount of data from each task, and update the model based on the test set performance of the model. The conceptual advance proposed by this paper is to adopt the broad strokes of the meta learning framework, but apply it to unsupervised data, i.e. data with no predefined supervised tasks. The goal of such a project is, as so often is the case with unsupervised learning, to learn representations, specifically, representations we believe might be useful over a whole distribution of supervised tasks. However, to apply traditional meta learning techniques, we need that aforementioned distribution of tasks, and we’ve defined our problem as being over unsupervised data. How exactly are we supposed to construct the former out of the latter? This may seem a little circular, or strange, or definitionally impossible: how can we generate supervised tasks without supervised labels? https://i.imgur.com/YaU1y1k.png The artificial tasks created by this paper are rooted in mechanically straightforward operations, but conceptually interesting ones all the same: it uses an off the shelf unsupervised learning algorithm to generate a fixedwidth vector embedding of your input data (say, images), and then generates multiple different clusterings of the embedded data, and then uses those cluster IDs as labels in a fauxsupervised task. It manages to get multiple different tasks, rather than just one  remember, the premise of meta learning is in models learned over multiple tasks  by randomly up and downscaling dimensions of the embedding before clustering is applied. Different scalings of dimensions means different points close to one another, which means the partition of the dataset into different clusters. With this distribution of “supervised” tasks in hand, the paper simply applies previously proposed meta learning techniques  like MAML, which learns a model which can be quickly fine tuned on a new task, or prototypical networks, which learn an embedding space in which observations from the same class, across many possible class definitions are close to one another. https://i.imgur.com/BRcg6n7.png An interesting note from the evaluation is that this method  which is somewhat amusingly dubbed “CACTUs”  performs best relative to alternative baselines in cases where the true underlying class distribution on which the model is metatrained is the most different from the underlying class distribution on which the model is tested. Intuitively, this makes reasonable sense: meta learning is designed to trade off knowledge of any given specific task against the flexibility to be performant on a new class division, and so it gets the most value from trade off where a genuinely dissimilar class split is seen during testing. One other quick thing I’d like to note is the set of implicit assumptions this model builds on, in the way it creates its unsupervised tasks. First, it leverages the smoothness assumptions of classes  that is, it assumes that the kinds of classes we might want our model to eventually perform on are close together, in some idealized conceptual space. While not a perfect assumption (there’s a reason we don’t use KNN over embeddings for all of our ML tasks) it does have a general reasonableness behind it, since rarely are the kinds of classes very conceptually heterogenous. Second, it assumes that a truly unsupervised learning method can learn a representation that, despite being itself suboptimal as a basis for supervised tasks, is a wellenough designed feature space for the general heuristic of “nearby things are likely of the same class” to at least approximately hold. I find this set of assumptions interesting because they are so simplifying that it’s a bit of a surprise that they actually work: even if the “classes” we metatrain our model on are defined with simple Euclidean rules, optimizing to be able to perform that separation using little data does indeed seem to transfer to the general problem of “separating real world, messierinembeddingspace classes using little data”. 
[link]
This paper argues for the use of normalizing flows  a way of building up new probability distributions by applying multiple sets of invertible transformations to existing distributions  as a way of building more flexible variational inference models. The central premise of a variational autoencoder is that of learning an approximation to the posterior distribution of latent variables  p(zx)  and parameterizing that distribution according to values produced by a neural network. In typical practice, this has meant that VAEs are limited in terms of the complexity of latent variable distributions they can encode, since using an analytically specified distribution tends to limit you to simpler distributional shapes  Gaussians, uniform, and the like. Normalizing flows are here proposed as a way to allow for the model to learn more complex forms of posterior distribution. Normalizing flows work off of a fairly simple intuition: if you take samples from a distribution p(x), and then apply a function f(x) to each x in that sample, you can calculate the expected value of your new distribution f(x) by calculating the expectation of f(x) under the old distribution p(x). That is to say: https://i.imgur.com/NStm7zN.png This mathematical transformation has a pretty delightful name  The Law of the Unconscious Statistician  that came from the fact that so many statisticians just treated this identity as a definitional fact, rather than something actually in need of proving (I very much fall into this bucket as well). The implication of this is that if you apply many transformations in sequence to the draws from some simple distribution, you can work with that distribution without explicitly knowing its analytical formulation, just by being able to evaluate  and, importantly  invert the function. The ability to invert the function is key, because of the way you calculate the derivative: by taking the inverse of the determinant of the derivative of your function f(z) with respect to z. (Note here that q(z) is the original distribution you sampled under, and q’(z) is the implicit density you’re trying to estimate, after your function has been applied). https://i.imgur.com/8LmA0rc.png Combining these ideas together: a variational flow autoencoder works by having an encoder network define the parameters of a simple distribution (Gaussian or Uniform), and then running the samples from that distribution through a series of k transformation layers. This final transformed density over z is then given to the decoder to work with. Some important limitations are in place here, the most salient of which is that in order to calculate derivatives, you have to be able to calculate the determinant of the derivative of a given transformation. Due to this constraint, the paper only tests a few transformations where this is easy to calculate analytically  the planar transformation and radial transformation. If you think about transformations of density functions as fundamentally stretching or compressing regions of density, the planar transformation works by stretching along an axis perpendicular to some parametrically defined plane, and the radial transformation works by stretching outward in a radial way around some parametrically defined point. Even though these transformations are individually fairly simple, when combined, they can give you a lot more flexibility in distributional space than a simple Gaussian or Uniform could. https://i.imgur.com/Xf8HgHl.png 
[link]
This paper draws from two strains of recent work: the hierarchical music modeling of MusicVAE  which intentionally model musical structure at both local and more global levels  , and the discrete autoencoder approaches of Vector Quantized VAEs  which seek to maintain the overall structure of a VAE, but apply a less aggressive form of regularization. The goal of this paper is to build a model that can generate music, not from that music’s symbolic representation  lists of notes  but from actual waveform audio. This is a more difficult task because the model now has to learn mappings between waveforms and symbolic notes, but confers the advantage of being able to model expressive dimensions of music that are difficult to capture in a pure symbolic representation. Models of pure waveform data have been used before  Wavenet is a central example  but typically they are learned alongside some kind of text conditioning structure, which is to say, you tell the model to say “Hello there, world” and the model is only responsible for building local mappings between those phonemes and waveforms, not actually modeling coherent words to follow after “Hello”. To try to address this problem, the authors of the paper propose the solution of learning an autoencoded representation over the full music sample, to try to capture global structure. Each predicted value of the global structure sequence then represents some number of timesteps of the generated sequence: say, 20. The idea here is: learn a global model that produces 1/N (1/20, in this case) fewer sequence points, whose job is ensuring long term consistency. Then, the authors also suggest the use of a lower level decoder model that uses the conditioning information from the autoencoder, and, in a similar fashion to a text to speech wavenet, captures a high fidelity mapping between that conditioning and the output waveform. This overall structure has a lot in common with the recently released MusicVAE paper. The most salient architectural change proposed by this paper is that of Argmax VAEs, rather than VQ VAEs. Overall, the reason for training discrete autoencoders is to have a more easily adjustable way of regularizing the bottlenecked representation, to avoid the fact that for some challenging problems, excessively strong VAE regularization can lead to that high level representational space just not being used. To understand the difference, it’s worth understanding that VQ VAEs work by generating a continuous encoding vector (the same as a typical VAE) but then instead of passing that continuous vector itself directly on to the decoder, the VQ VAE instead fits what is basically a K means operation: it maps the continuous vector to one of it’s “prototypical” or “codebook” vectors based on closeness in Euclidean distance (these codebook vectors are learned in a separate trading loop, in a K Means style algorithm). The Argmax VAE is similar, but instead of needing to take that alternating step of learning the codebook vectors via K Means, it performs a much simpler quantization operation: just taking the argmax of indices across the continuous vector, so that the output is the onehot vector closest to the continuous input. While this reduces the capacity of the model, it also limits the problem of “codebook collapse”, which is a failure mode that can happen during the K Means iteration (I’m actually not entirely clear on the prototypical example of codebook collapse, or exactly why it happens). https://i.imgur.com/H5YqSZG.png Combining these ideas together: this paper’s model works by learning an Argmax VAE over a larger and courser timeframe of the model, and then learning a local, high resolution decoder  similar to Wavenet  over the smaller time scales, conditioned on the output of the Argmax VAE making high level decisions. This combination balances the needs of coherent musical structure and local fidelity, and allows for different weighing of those tradeoffs in a fairly flexible way, by changing the frequency at which you produce Argmax VAE conditioning output. 
[link]
A central question of this paper is: under what circumstances will you see agents that have been trained to optimize their own reward implement strategies  like tit for tat  that are are more sophisticated and higher overall reward than each agent simply pursuing its dominant strategy. The games under consideration here are “general sum” games like Iterated Prisoner’s Dilemma, where each agent’s dominant strategy is to defect, but with some amount of coordination or reciprocity, better overall outcomes are possible. Previously, models have achieved this via explicit hardcoding, but this paper strove to use a simpler, more general approach: allowing each agent A to optimize its reward not only with regard to a fixed opponent, but with regard to an opponent that will make a predictable update move in response to the action A is about to take. Specifically, this model  shorthanded as LOLA, Learning with OpponentLearning Awareness  maximizes a given agent’s expected discount reward, but looks at reward *conditional on* the ways the opponent will update to a given action. In a simplified world where the explicit reward function is known, it’s possible to literally take the derivative through the opponent’s expected update step, taking into account the ways your expected reward is changed by the response you expect of your opponent. Outside of this simplified framework, in the world of policy gradients, there’s no analytic loss function; you can no longer directly differentiate your reward function with respect to your opponent’s actions, but you can differentiate your expected reward estimator with respect to them. This concept is quite similar to a 2016 paper, Metz et al, that used this concept to train a more effective GAN, by allowing each network in the adversarial pair to “look ahead” to their opponent’s expected response as a way to avoid getting stuck in repetitive action/response cycles. In circumstances where the parameters of the opponent are not known  obviously closer to realistic for an adversarial scenario  the paper demonstrates proof of concept ability to model an opponent’s strategy based on their past actions, and use that to conduct responsestep estimates. https://i.imgur.com/5xddJRj.png It should of course be said in all this: even though this setup did produce results closer to what we would expect in rational reciprocity, it’s still very simplified. In most of the experiments, each agent had perfect knowledge of the opponent’s priorities and likely responses; in most game theory scenarios, constructing a model of your opponent is a nontrivial part of the difficulty. Nonetheless, I found it a 
[link]
The overall goal of the paper is measure how similar different layer activation profiles are to one another, in hopes of being able to quantify the similarity of the representations that different layers are learning. If you had a measure that captured this, you could ask questions like: “how similar are the representations that are learned by different networks on the same task”, and “what is the dynamic of representational change in a given layer throughout training”? Canonical Correlation Analysis is one way of approaching this question, and the way taken by this paper. The premise of CCA is that you have two multidimensional variable sets, where each set is made up of vectors representing dimensions within that variable set. Concretely, in this paper, the sets under examination are the activation profiles of two layers (either the same layer at different points in training, or different layers in the same network, or layers in different networks). An activation profile is thought of in terms of multiple vectors, where each vector represents a given neuron’s activation value, evaluated over some observation set X. Importantly, for the two layers that you’re comparing, the set of observations X needs to be of the same length, but the layers can have different number of neurons (and, consequently, different numbers of vectors making up that layer’s multivariate set). Given this setup, the goal of CCA is to find vectors that are linear combinations of the basis vectors of each set, to satisfy some constraint. In that broad sense, this is similar to the project of PCA, which also constructs linearcombination principal components to better represent the underlying data space. However, in PCA, the constraints that define these combinations are based on one multidimensional feature space, not two. In CCA, instead of generating k principal components, you generate k *pairs* of canonical correlates. Each canonical correlate pair, (U1, V1) is a linear combination of the activation vectors of sets L1 and L2 respectively, and is chosen with the goal of minimizing the the angle (cosine) distance between the correlates in each pair. If you think about L1 and L2 each only having two activations (that is: if you think about them as being twodimensional spaces) then the goal of CCA is to find the cosine distance between the planes defined by the two activation spaces. An important intuition here is that in this framing, vector sets that are just linear transformations of one another (scalings, rotations, swaps in the arbitrary order of activations) will look the same, which wouldn’t be the case if you just looked at raw correlations between the individual activations. This is connected to the linear algebra idea that, if you have two vectors, and a third that is just a linear combination of the first two, the span of those vectors is still just that twodimensional space. This property is important for the analysis of neural network representations because it means it will be able to capture similarities between representational spaces that have fundamental geometric similarities, even if they’re different on a more surface level. In prior papers, CCA had been used by calculating the CCA vectors between varying sets of layers, and then taking the mean CCA value over all of the pairs of vectors. This paper argues against that approach, on the theory that network layers are probably not using the full representational capacity of their activation dimensions (think, as analogy: a matrix with three columns, that only actually spans two), and so including in your average very loworder correlations is mostly adding uninformative noise to your similarity measure. Instead, this paper weights the correlation coefficients according to the magnitudes of the correlate vectors in the pair; as best I can tell, this is roughly analogous to weighting according to eigenvalues, in a PCA setting. Using this weightedaverage similarity measure, the authors do some really interesting investigations into learning dynamics. These include: * Comparing the intermediatelayer representations learned by networks that achieve low train error via memorization vs via actuallygeneralizing solutions, and show that, during training, the intermediate representations of generalizing networks are more similar to one another than memorizing networks are to one another. Intuitively, this aligns with the idea that there are many ways to noisily memorize, but a more constrained number of ways to actually learn meaningful information about a dataset. A super interesting implication of this is the idea that representational similarity *on the training set* across multiple bootstrapped or randomized trainings could be used as a proxy for test set performance, which could be particularly valuable in contexts where test data is limited https://i.imgur.com/JwyHFmN.png * Across networks, lower layers tend to be more similar to one another than layers closer to the output; said another way, the very simple (e.g. edge detectors) tend to be quite similar across networks, but the higher level representations are more divergent and influenceable by smaller quirks of the training set. * Within a given dataset, you can cluster learned internal representations across many training sets and recover groups trained with the same learning rate, even though the final layer softmax is inherently similar across models that achieve the same training error. This implies that metrics like this can give us some idea of the different minima that the optimization algorithm finds, as a function of different learning rates. Overall, I found this paper a great example of a straightforward idea used to clearly answer important and interesting questions, which is always refreshing amidst a sea of “tiny hack for an extra 0.05 accuracy”. 