ShortScience.org Latest SummariesShortScience.org Latest Summaries
http://www.shortscience.org/
60Tue, 04 Aug 2020 11:31:01 +00001809.01999journals/corr/1809.019992Recurrent World Models Facilitate Policy EvolutionPaul Barde## General Framework
The take-home message is that the challenge of Reinforcement Learning for environments with high-dimensional and partial observations is learning a good representation of the environment. This means learning a sensory features extractor V to deal with the highly dimensional observation (pixels for example). But also learning a temporal representation M of the environment dynamics to deal with the partial observability. If provided with such representations, learning a contr...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1809.01999#muntermulehitch
http://www.shortscience.org/paper?bibtexKey=journals/corr/1809.01999#muntermulehitchMon, 27 Jul 2020 13:05:14 +00001907.03976journals/corr/1907.039762Better-than-Demonstrator Imitation Learning via Automatically-Ranked DemonstrationsPaul Barde## General Framework
Extends T-REX (see [summary]()) so that preferences (rankings) over demonstrations are generated automatically (back to the common IL/IRL setting where we only have access to a set of unlabeled demonstrations). Also derives some theoretical requirements and guarantees for better-than-demonstrator performance.
## Motivations
* Preferences over demonstrations may be difficult to obtain in practice.
* There is no theoretical understanding of the requirements that lead to out...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1907.03976#muntermulehitch
http://www.shortscience.org/paper?bibtexKey=journals/corr/1907.03976#muntermulehitchMon, 27 Jul 2020 02:22:27 +00001904.06387journals/corr/1904.063872Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from ObservationsPaul Barde## General Framework
Only access to a finite set of **ranked demonstrations**. The demonstrations only contains **observations** and **do not need to be optimal** but must be (approximately) ranked from worst to best.
The **reward learning part is off-line** but not the policy learning part (requires interactions with the environment).
In a nutshell: learns a reward models that looks at observations. The reward model is trained to predict if a demonstration's ranking is greater than another on...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1904.06387#muntermulehitch
http://www.shortscience.org/paper?bibtexKey=journals/corr/1904.06387#muntermulehitchMon, 27 Jul 2020 02:18:47 +000010.15607/rss.2016.xii.0292Planning for Autonomous Cars that Leverage Effects on Human ActionsPaul Barde## General Framework
*wording: car = the autonomous car, driver = the other car it is interacting with*
Builds a model of an **autonomous car's influence over the behavior of an interacting driver** (human or simulated) that the autonomous car can leverage to plan more efficiently. The driver is modeled by the policy that maximizes his defined objective. In brief, a **linear reward function is learned off-line with IRL on human demonstrations** and the modeled policy takes the actions that max...
http://www.shortscience.org/paper?bibtexKey=10.15607/rss.2016.xii.029#muntermulehitch
http://www.shortscience.org/paper?bibtexKey=10.15607/rss.2016.xii.029#muntermulehitchMon, 27 Jul 2020 02:14:17 +00001406.5979journals/corr/1406.59792Reinforcement and Imitation Learning via Interactive No-Regret LearningPaul Barde## General Framework
Really **similar to DAgger** (see [summary]()) but considers **cost-sensitive classification** ("some mistakes are worst than others": you should be more careful in imitating that particular action of the expert if failing in doing so incurs a large cost-to-go). By doing so they improve from DAgger's bound of $\epsilon_{class}uT$ where $u$ is the difference in cost-to-go (between the expert and one error followed by expert policy) to $\epsilon_{class}T$ where $\epsilon_{cla...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1406.5979#muntermulehitch
http://www.shortscience.org/paper?bibtexKey=journals/corr/1406.5979#muntermulehitchMon, 27 Jul 2020 02:08:30 +00001011.0686journals/corr/1011.06862A Reduction of Imitation Learning and Structured Prediction to No-Regret Online LearningPaul Barde## General Framework
The imitation learning problem is here cast into a classification problem: label the state with the corresponding expert action. With this, you can see structured prediction (predict next label knowing your previous prediction) as a degenerated IL problem. They make the **reduction assumption** that you can make the probability of mistake $\epsilon$ as small as desired on the **training distribution** (expert or mixture). They also assume that the difference in the cost-to-g...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1011.0686#muntermulehitch
http://www.shortscience.org/paper?bibtexKey=journals/corr/1011.0686#muntermulehitchMon, 27 Jul 2020 01:53:35 +00001611.03530journals/corr/1611.035302Understanding deep learning requires rethinking generalizationANIRUDH NJ## Summary
The broad goal of this paper is to understand how a neural network learns the underlying distribution of the input data and the properties of the network that describes its generalization power.
Previous literature tries to use statistical measures like Rademacher complexity, uniform stability and VC dimension to explain the generalization error of the model. These methods explain generalization in terms of the number of parameters in the model along with the applied regularizat...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1611.03530#anirudhnj
http://www.shortscience.org/paper?bibtexKey=journals/corr/1611.03530#anirudhnjFri, 26 Jun 2020 15:33:03 +0000journals/af/Maymin112Markets are efficient if and only if P = NPquaxtonIs the market efficient? This is perhaps the most prevalent question in all of finance. While this paper does not aim to answer that question, it does frame it in an information-theoretic context. Mainly, Maymin shows that at least the weak form of the efficient market hypothesis (EMH) holds if and only if P = NP.
First, he defines what efficient market means:
"The weakest form of the EMH states that future prices cannot be predicted by analyzing prices from the past. Therefore, technical ana...
http://www.shortscience.org/paper?bibtexKey=journals/af/Maymin11#jyang772
http://www.shortscience.org/paper?bibtexKey=journals/af/Maymin11#jyang772Thu, 04 Jun 2020 02:53:53 +0000conf/iclr/RendaFC203Comparing Rewinding and Fine-tuning in Neural Network PruningCodyWildThis is an interestingly pragmatic paper that makes a super simple observation. Often, we may want a usable network with fewer parameters, to make our network more easily usable on small devices. It's been observed (by these same authors, in fact), that pruned networks can achieve comparable weights to their fully trained counterparts if you rewind and retrain from early in the training process, to compensate for the loss of the (not ultimately important) pruned weights. This observation has bee...
http://www.shortscience.org/paper?bibtexKey=conf/iclr/RendaFC20#decodyng
http://www.shortscience.org/paper?bibtexKey=conf/iclr/RendaFC20#decodyngFri, 15 May 2020 03:18:21 +00002004.13649journals/corr/2004.136492Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from PixelsCodyWildOne of the most notable flaws of modern model-free reinforcement learning is its sample inefficiency; where humans can learn a new task with relatively few examples, model that learn policies or value functions directly from raw data need huge amounts of data to train properly. Because the model isn't given any semantic features, it has to learn a meaningful representation from raw pixels using only the (often sparse, often noisy) signal of reward. Some past approaches have tried learning repres...
http://www.shortscience.org/paper?bibtexKey=journals/corr/2004.13649#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/2004.13649#decodyngSun, 10 May 2020 05:46:18 +00001903.11981journals/corr/abs-1903-119813Regularizing Trajectory Optimization with Denoising AutoencodersRobert MüllerThe typical model based reinforcement learning (RL) loop consists of collecting data, training a model of the environment, using the model to do model predictive control (MPC). If however the model is wrong, for example for state-action pairs that have been barely visited, the dynamics model might be very wrong and the MPC fails as the imagined model and the reality align to longer. Boney et a. propose to tackle this with a denoising autoencoder for trajectory regularization according to the fam...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1903-11981#robertmueller
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1903-11981#robertmuellerThu, 07 May 2020 08:08:00 +00001912.05500journals/corr/abs-1912-055002What Can Learned Intrinsic Rewards Capture?CodyWildThis paper out of DeepMind is an interesting synthesis of ideas out of the research areas of meta learning and intrinsic rewards. The hope for intrinsic reward structures in reinforcement learning - things like uncertainty reduction or curiosity - is that they can incentivize behavior like information-gathering and exploration, which aren't incentivized by the explicit reward in the short run, but which can lead to higher total reward in the long run. So far, intrinsic rewards have mostly been ...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1912-05500#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1912-05500#decodyngTue, 05 May 2020 06:22:03 +0000conf/icml/FinnAL172Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksAndrea Walter Ruggerini## TL;DR
The paper presents a model-agnostic strategy to perform few-shot learning taking advantage of prior knowledge acquired during in multitask learning. Such prior knowledge derives from priors acquired about generalized model parameters (e.g. weights or hyperparameters) during the Model Agnostic Meta-Learning (MAML) algorithm. The strategy can be applied to any algorithm trained with gradient descent (not only neural networks) being more general and perhaps effective than transfer learnin...
http://www.shortscience.org/paper?bibtexKey=conf/icml/FinnAL17#andreaw
http://www.shortscience.org/paper?bibtexKey=conf/icml/FinnAL17#andreawSun, 03 May 2020 14:29:05 +00002001.04451journals/corr/2001.044512Reformer: The Efficient TransformerCodyWildThe Transformer architecture - which uses a structure entirely based on key-value attention mechanisms to process sequences such as text - has taken over the worlds of language modeling and NLP in the past three years. However, Transformers at the scale used for large language models have huge computational and memory requirements.
This is largely driven by the fact that information at every step in the sequence (or, in the so-far-generated sequence during generation) is used to inform the rep...
http://www.shortscience.org/paper?bibtexKey=journals/corr/2001.04451#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/2001.04451#decodyngSun, 03 May 2020 05:14:23 +00001909.11655journals/corr/abs-1909-116552Augmenting Genetic Algorithms with Deep Neural Networks for Exploring the Chemical SpaceCodyWildI found this paper a bit difficult to fully understand. Its premise, as far as I can follow, is that we may want to use genetic algorithms (GA), where we make modifications to elements in a population, and keep elements around at a rate proportional to some set of their desirable properties. In particular we might want to use this approach for constructing molecules that have properties (or predicted properties) we want. However, a downside of GA is that its easy to end up in local minima, where...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1909-11655#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1909-11655#decodyngFri, 01 May 2020 05:38:46 +0000conf/nips/KumarFSTL193Stabilizing Off-Policy Q-Learning via Bootstrapping Error ReductionRobert MüllerKumar et al. propose an algorithm to learn in batch reinforcement learning (RL), a setting where an agent learns purely form a fixed batch of data, $B$, without any interactions with the environments. The data in the batch is collected according to a batch policy $\pi_b$. Whereas most previous methods (like BCQ) constrain the learned policy to stay close to the behavior policy, Kumar et al. propose bootstrapping error accumulation reduction (BEAR), which constrains the newly learned policy to pl...
http://www.shortscience.org/paper?bibtexKey=conf/nips/KumarFSTL19#robertmueller
http://www.shortscience.org/paper?bibtexKey=conf/nips/KumarFSTL19#robertmuellerThu, 30 Apr 2020 13:31:29 +000010.1101/2020.03.03.9721332AI-aided design of novel targeted covalent inhibitors against SARS-CoV-2CodyWildThis preprint is a bit rambling, and I don't know that I fully followed what it was doing, but here's my best guess:
- We think it's probably the case that SARS-COV2 (COVID19) uses a protease (enzyme involved in its reproduction) that isn't available and co-optable in the human body, and is also quite similar to the comparable protease protein in the original SARS virus. Therefore, it is hoped that we might be able to take inhibitors that bind to SARS, and modify them in small ways to make t...
http://www.shortscience.org/paper?bibtexKey=10.1101/2020.03.03.972133#decodyng
http://www.shortscience.org/paper?bibtexKey=10.1101/2020.03.03.972133#decodyngThu, 30 Apr 2020 04:36:33 +00002003.03123journals/corr/abs-2003-031232Directional Message Passing for Molecular GraphsCodyWildThis paper, presented this week at ICLR 2020, builds on existing applications of message-passing Graph Neural Networks (GNN) for molecular modeling (specifically: for predicting quantum properties of molecules), and extends them by introducing a way to represent angles between atoms, rather than just distances between them, as current methods are limited to.
The basic version of GNNs on molecule data works by creating features attached to atoms at each level (starting at level 0 with the eleme...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-2003-03123#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-2003-03123#decodyngWed, 29 Apr 2020 03:42:52 +00001911.11361journals/corr/abs-1911-113613Behavior Regularized Offline Reinforcement LearningRobert MüllerWu et al. provide a framework (behavior regularized actor critic (BRAC)) which they use to empirically study the impact of different design choices in batch reinforcement learning (RL). Specific instantiations of the framework include BCQ, KL-Control and BEAR.
Pure off-policy rl describes the problem of learning a policy purely from a batch $B$ of one step transitions collected with a behavior policy $\pi_b$. The setting allows for no further interactions with the environment. This learning re...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1911-11361#robertmueller
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1911-11361#robertmuellerMon, 27 Apr 2020 13:02:23 +00001908.06760journals/corr/abs-1908-067602Self-Attention Based Molecule Representation for Predicting Drug-Target InteractionCodyWildIn the last three years, Transformers, or models based entirely on attention for aggregating information from across multiple places in a sequence, have taken over the world of NLP. In this paper, the authors propose using a Transformer to learn a molecular representation, and then building a model to predict drug/target interaction on top of that learned representation. A drug/target interaction model takes in two inputs - a protein involved in a disease pathway, and a (typically small) molecul...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1908-06760#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1908-06760#decodyngSun, 26 Apr 2020 06:39:30 +000010.1038/s41586-019-1923-72Improved protein structure prediction using potentials from deep learningCodyWildIn January of this year (2020), DeepMind released a model called AlphaFold, which uses convolutional networks atop sequence-based and evolutionary features to predict protein folding structure. In particular, their model was designed to predict a distribution for how far away each pair of amino acids will be from one another in the final folded structure. Given such a trained model, you can score a candidate structure according to how likely it is under the model, and - if your process for gener...
http://www.shortscience.org/paper?bibtexKey=10.1038/s41586-019-1923-7#decodyng
http://www.shortscience.org/paper?bibtexKey=10.1038/s41586-019-1923-7#decodyngFri, 24 Apr 2020 04:38:20 +0000journals/iacr/BellareRRS092Format-Preserving EncryptionquaxtonFormat-preserving encryption is a deterministic encryption scheme that encrypts plaintext of some specified format into ciphertext of the same format. This has a lot of practical use cases such as storing SSN or credit card information, without having to change the underlying schematics of the database or application that stores the data. The protected data is in-differentiable from unprotected data, and still enables some analytics over it, such as with masking (ie, displaying last four digits ...
http://www.shortscience.org/paper?bibtexKey=journals/iacr/BellareRRS09#jyang772
http://www.shortscience.org/paper?bibtexKey=journals/iacr/BellareRRS09#jyang772Thu, 23 Apr 2020 22:05:16 +0000conf/ac/Rasmussen033Gaussian Processes in Machine LearningFriedrich-Maximilian WeberlingIn this tutorial paper, Carl E. Rasmussen gives an introduction to Gaussian Process Regression focusing on the definition, the hyperparameter learning and future research directions.
A Gaussian Process is completely defined by its mean function $m(\pmb{x})$ and its covariance function (kernel) $k(\pmb{x},\pmb{x}')$. The mean function $m(\pmb{x})$ corresponds to the mean vector $\pmb{\mu}$ of a Gaussian distribution whereas the covariance function $k(\pmb{x}, \pmb{x}')$ corresponds to the covari...
http://www.shortscience.org/paper?bibtexKey=conf/ac/Rasmussen03#fweberling1995
http://www.shortscience.org/paper?bibtexKey=conf/ac/Rasmussen03#fweberling1995Tue, 21 Apr 2020 20:05:41 +00001903.08254journals/corr/abs-1903-082543Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context VariablesRobert MüllerRakelly et al. propose a method to do off-policy meta reinforcement learning (rl). The method achieves a 20-100x improvement on sample efficiency compared to on-policy meta rl like MAML+TRPO.
The key difficulty for offline meta rl arises from the meta-learning assumption, that meta-training and meta-test time match. However during test time the policy has to explore and sees as such on-policy data which is in contrast to the off-policy data that should be used at meta-training. The key contrib...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1903-08254#robertmueller
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1903-08254#robertmuellerTue, 21 Apr 2020 08:39:21 +000010.1093/bioinformatics/bty5732Predicting protein–protein interactions through sequence-based deep learningCodyWildMost of the interesting mechanics within living things are mediated by interactions between proteins, making it important and useful to have good predictive models of whether proteins will interact with one another, for validating possible interaction graph structures.
Prior methods for this problem - which takes as its input sequence representations of two proteins, and outputs a probability of interaction - have pursued different ideas for how to combine information from the two proteins. On...
http://www.shortscience.org/paper?bibtexKey=10.1093/bioinformatics/bty573#decodyng
http://www.shortscience.org/paper?bibtexKey=10.1093/bioinformatics/bty573#decodyngTue, 21 Apr 2020 06:36:31 +00001906.05374journals/corr/1906.053743Meta-Learning via Learned LossRobert MüllerBechtle et al. propose meta learning via learned loss ($ML^3$) and derive and empirically evaluate the framework on classification, regression, model-based and model-free reinforcement learning tasks.
The problem is formalized as learning parameters $\Phi$ of a meta loss function $M_\phi$ that computes loss values $L_{learned} = M_{\Phi}(y, f_{\theta}(x))$. Following the outer-inner loop meta algorithm design the learned loss $L_{learned}$ is used to update the parameters of the learner in the...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1906.05374#robertmueller
http://www.shortscience.org/paper?bibtexKey=journals/corr/1906.05374#robertmuellerMon, 20 Apr 2020 16:28:20 +00001802.04364journals/corr/abs-1802-043642Junction Tree Variational Autoencoder for Molecular Graph GenerationCodyWildPrior to this paper, most methods that used machine learning to generate molecular blueprints did so using SMILES representations - a string format with characters representing different atoms and bond types. This preference came about because ML had existing methods for generating strings that could be built on for generating SMILES (a particular syntax of string). However, an arguably more accurate and fundamental way of representing molecules is as graphs (with atoms as nodes and bonds as edg...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1802-04364#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1802-04364#decodyngMon, 20 Apr 2020 04:48:28 +00001705.10843journals/corr/GuimaraesSFA172Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation ModelsCodyWildThis paper's proposed method, the cleverly named ORGAN, combines techniques from GANs and reinforcement learning to generate candidate molecular sequences that incentivize desirable properties while still remaining plausibly on-distribution.
Prior papers I've read on molecular generation have by and large used approaches based in maximum likelihood estimation (MLE) - where you construct some distribution over molecular representations, and maximize the probability of your true data under that ...
http://www.shortscience.org/paper?bibtexKey=journals/corr/GuimaraesSFA17#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/GuimaraesSFA17#decodyngSat, 18 Apr 2020 04:57:12 +0000journals/jcheminf/OlivecronaBEC172Molecular de-novo design through deep reinforcement learningCodyWildOver the past few days, I've been reading about different generative neural networks being tried out for molecular generation. So far this has mostly focused on latent variable space models like autoencoders, but today I shifted attention to a different approach rooted in reinforcement learning. The goal of most of these methods is 1) to build a generative model that can sample plausible molecular structures, but more saliently 2) specifically generate molecules optimized to exhibit some propert...
http://www.shortscience.org/paper?bibtexKey=journals/jcheminf/OlivecronaBEC17#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/jcheminf/OlivecronaBEC17#decodyngFri, 17 Apr 2020 06:00:27 +00001908.09791journals/corr/abs-1908-097912Once for All: Train One Network and Specialize it for Efficient Deploymentameroyer**Summary**: The goal of this work is to propose a "Once-for-all” (OFA) network: a large network which is trained such that its subnetworks (subsets of the network with smaller width, convolutional kernel sizes, shallower units) are also trained towards the target task. This allows to adapt the architecture to a given budget at inference time while preserving performance.
**Elastic Parameters.**
The goal is to train a large architecture that contains several well-trained subnetworks with dif...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1908-09791#ameroyer
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1908-09791#ameroyerThu, 16 Apr 2020 17:48:55 +00001610.02415journals/corr/Gomez-Bombarelli163Automatic chemical design using a data-driven continuous representation of moleculesCodyWildI'll admit that I found this paper a bit of a letdown to read, relative to expectations rooted in its high citation count, and my general excitement and interest to see how deep learning could be brought to bear on molecular design. But before a critique, let's first walk through the mechanics of how the authors' approach works.
The method proposed is basically a very straightforward Variational Auto Encoder, or VAE. It takes in a textual SMILES string representation of a molecular structure,...
http://www.shortscience.org/paper?bibtexKey=journals/corr/Gomez-Bombarelli16#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/Gomez-Bombarelli16#decodyngWed, 15 Apr 2020 03:11:44 +0000journals/iacr/BrakerskiV112Efficient Fully Homomorphic Encryption from (Standard) LWEquaxtonBrakerski and Vaikuntanathan introduce a fully homomorphic encryption scheme (FHE) based solely on the decisional learning with errors (LWE) security assumptions. Moving away from the relatively obscure mathematics of ideal lattices. They introduce relinearization and modulus switching techniques for dimensionality reduction and for removing the “squashing” step of Craig Gentry’s FHE scheme. BV11 and other similar schemes are commonly referred to as “Second generation FHE” schemes.
R...
http://www.shortscience.org/paper?bibtexKey=journals/iacr/BrakerskiV11#jyang772
http://www.shortscience.org/paper?bibtexKey=journals/iacr/BrakerskiV11#jyang772Mon, 13 Apr 2020 02:16:23 +00001704.01212journals/corr/GilmerSRVD174Neural Message Passing for Quantum ChemistryCodyWildIn the years before this paper came out in 2017, a number of different graph convolution architectures - which use weight-sharing and order-invariant operations to create representations at nodes in a graph that are contextualized by information in the rest of the graph - had been suggested for learning representations of molecules. The authors of this paper out of Google sought to pull all of these proposed models into a single conceptual framework, for the sake of better comparing and testing ...
http://www.shortscience.org/paper?bibtexKey=journals/corr/GilmerSRVD17#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/GilmerSRVD17#decodyngFri, 10 Apr 2020 06:05:16 +00001708.09259journals/corr/1708.092592Efficient Convolutional Network Learning using Parametric Log based Dual-Tree Wavelet ScatterNethanoch kremerScatterNets incorporates geometric knowledge of images to produce discriminative and invariant (translation and rotation) features i.e. edge information. The same outcome as CNN's first layers hold. So why not replace that first layer/s with an equivalent, fixed, structure and let the optimizer find the best weights for the CNN with its leading-edge removed.
The main motivations of the idea of replacing the first convolutional, ReLU and pooling layers of the CNN with a two-layer parametric log-b...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1708.09259#hanochkremer
http://www.shortscience.org/paper?bibtexKey=journals/corr/1708.09259#hanochkremerThu, 09 Apr 2020 12:05:38 +000010.1111/j.1467-9965.1991.tb00002.x3Universal PortfoliosquaxtonCover's Universal Portfolio is an information-theoretic portfolio optimization algorithm that utilizes constantly rebalanced porfolios (CRP). A CRP is one in which the distribution of wealth among stocks in the portfolio remains the same from period to period. Universal Portfolio strictly performs rebalancing based on historical pricing, making no assumptions about the underlying distribution of the prices.
The wealth achieved by a CRP over n periods is:
$S_n(b,x^n) = \displaystyle \prod_{n}...
http://www.shortscience.org/paper?bibtexKey=10.1111/j.1467-9965.1991.tb00002.x#jyang772
http://www.shortscience.org/paper?bibtexKey=10.1111/j.1467-9965.1991.tb00002.x#jyang772Wed, 08 Apr 2020 23:17:22 +00001611.03199journals/corr/Altae-TranRPP162Low Data Drug Discovery with One-shot LearningCodyWildThe goal of one-shot learning tasks is to design a learning structure that can perform a new task (or, more canonically, add a new class to an existing task) using only one a small number of examples of the new task or class. So, as an example: you'd want to be able to take one positive and one negative example of a given task and correctly classify subsequent points as either positive or negative. A common way of achieving this, and the way that the paper builds on, is to learn a parametrized f...
http://www.shortscience.org/paper?bibtexKey=journals/corr/Altae-TranRPP16#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/Altae-TranRPP16#decodyngWed, 08 Apr 2020 05:11:54 +00001703.00564journals/corr/WuRFGGPLP172MoleculeNet: A Benchmark for Molecular Machine LearningCodyWildThis is a paper released by the creators of the DeepChem library/framework, explaining the efforts they've put into facilitating straightforward and reproducible testing of new methods. They advocate for consistency between tests on three main axes.
1. On the most basic level, that methods evaluate on the same datasets
2. That they use canonical train/test splits
3. That they use canonical metrics.
To that end, they've integrated a framework they call "MoleculeNet" into DeepChem, containing ...
http://www.shortscience.org/paper?bibtexKey=journals/corr/WuRFGGPLP17#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/WuRFGGPLP17#decodyngTue, 07 Apr 2020 04:15:48 +00001509.09292journals/corr/DuvenaudMAGHAA153Convolutional Networks on Graphs for Learning Molecular FingerprintsCodyWildIf you read modern (that is, 2018-2020) papers using deep learning on molecular inputs, almost all of them use some variant of graph convolution. So, I decided to go back through the citation chain and read the earliest papers that thought to apply this technique to molecules, to get an idea of lineage of the technique within this domain.
This 2015 paper, by Duvenaud et al, is the earliest one I can find. It focuses the entire paper on comparing differentiable, message-passing networks to the ...
http://www.shortscience.org/paper?bibtexKey=journals/corr/DuvenaudMAGHAA15#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/DuvenaudMAGHAA15#decodyngMon, 06 Apr 2020 16:05:21 +00001603.00856journals/corr/KearnesMBPR163Molecular Graph Convolutions: Moving Beyond FingerprintsCodyWildThis paper was published after the 2015 Duvenaud et al paper proposing a differentiable alternative to circular fingerprints of molecules: substituting out exact-match random hash functions to identify molecular structures with learned convolutional-esque kernels. As far as I can tell, the Duvenaud paper was the first to propose something we might today recognize as graph convolutions on atoms. I hoped this paper would build on that one, but it seems to be coming from a conceptually different di...
http://www.shortscience.org/paper?bibtexKey=journals/corr/KearnesMBPR16#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/KearnesMBPR16#decodyngMon, 06 Apr 2020 06:30:03 +00001608.04844journals/corr/1608.048442Boosting Docking-based Virtual Screening with Deep LearningCodyWildMy objective in reading this paper was to gain another perspective on, and thus a more well-grounded view of, machine learning scoring functions for docking-based prediction of ligand/protein binding affinity. As quick background context, these models are useful because many therapeutic compounds act by binding to a target protein, and it can be valuable to prioritize doing wet lab testing on compounds that are predicted to have a stronger binding affinity. Docking systems work by predicting the...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.04844#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.04844#decodyngSat, 04 Apr 2020 05:03:25 +00001910.02845journals/corr/1910.028453Combining docking pose rank and structure with deep learning improves protein-ligand binding mode predictionCodyWildThis paper focuses on the application of deep learning to the docking problem within rational drug design. The overall objective of drug design or discovery is to build predictive models of how well a candidate compound (or "ligand") will bind with a target protein, to help inform the decision of what compounds are promising enough to be worth testing in a wet lab. Protein binding prediction is important because many small-molecule drugs, which are designed to be small enough to get through cell...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1910.02845#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/1910.02845#decodyngFri, 03 Apr 2020 05:28:05 +00001910.01708journals/corr/1910.017083Benchmarking Batch Deep Reinforcement Learning AlgorithmsRobert MüllerThe authors propose a unified setting to evaluate the performance of batch reinforcement learning algorithms. The proposed benchmark is discrete and based on the popular Atari Domain. The authors review and benchmark several current batch RL algorithms against a newly introduced version of BCQ (Batch Constrained Deep Q Learning) for discrete environments.
Note in line 5 that the policy chooses actions with a restricted argmax operation, eliminating actions that have not enough support in the...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1910.01708#robertmueller
http://www.shortscience.org/paper?bibtexKey=journals/corr/1910.01708#robertmuellerFri, 27 Mar 2020 14:40:38 +0000conf/icml/FujimotoMP193Off-Policy Deep Reinforcement Learning without ExplorationRobert MüllerInteracting with the environment comes sometimes at a high cost, for example in high stake scenarios like health care or teaching. Thus instead of learning online, we might want to learn from a fixed buffer $B$ of transitions, which is filled in advance from a behavior policy.
The authors show that several so called off-policy algorithms, like DQN and DDPG fail dramatically in this pure off-policy setting.
They attribute this to the extrapolation error, which occurs in the update of a value es...
http://www.shortscience.org/paper?bibtexKey=conf/icml/FujimotoMP19#robertmueller
http://www.shortscience.org/paper?bibtexKey=conf/icml/FujimotoMP19#robertmuellerWed, 25 Mar 2020 10:07:55 +00002003.05856journals/corr/2003.058565Online Fast Adaptation and Knowledge Accumulation: a New Approach to Continual LearningMassimo Cacciadisclaimer: I'm the first author of the paper
## TL;DR
We have made a lot of progress on catastrophic forgetting within the standard evaluation protocol,
i.e. sequentially learning a stream of tasks and testing our models' capacity to remember them all.
We think it's time a new approach to Continual Learning (CL), coined OSAKA, which is more aligned with real-life applications of CL. It brings CL closer to Online Learning and Open-World Learning.
main modifications we propose:
- bring CL cl...
http://www.shortscience.org/paper?bibtexKey=journals/corr/2003.05856#mcaccia
http://www.shortscience.org/paper?bibtexKey=journals/corr/2003.05856#mcacciaThu, 19 Mar 2020 16:41:59 +00001905.12558journals/corr/1905.125583Limitations of the Empirical Fisher Approximation for Natural Gradient DescentRobert MüllerThe authors analyse in the very well written paper the relation between Fisher $F(\theta) = \sum_n \mathbb{E}_{p_{\theta}(y \vert x)}[\nabla_{\theta} \log(p_{\theta}(y \vert x_n))\nabla_{\theta} \log(p_{\theta}(y \vert x_n))^T] $ and empirical Fisher $\bar{F}(\theta) = \sum_n [\nabla_{\theta} \log(p_{\theta}(y_n \vert x_n))\nabla_{\theta} \log(p_{\theta}(y_n \vert x_n))^T] $, which has recently seen a surge in interest. . The definitions differ in that $y_n$ is a training label instead of a samp...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1905.12558#robertmueller
http://www.shortscience.org/paper?bibtexKey=journals/corr/1905.12558#robertmuellerThu, 19 Mar 2020 08:59:52 +0000conf/nips/BafnaMV183Thwarting Adversarial Examples: An L_0-Robust Sparse Fourier TransformDavid StutzBafna et al. show that iterative hard thresholding results in $L_0$ robust Fourier transforms. In particular, as shown in Algorithm 1, iterative hard thresholding assumes a signal $y = x + e$ where $x$ is assumed to be sparse, and $e$ is assumed to be sparse. This translates to noise $e$ that is bounded in its $L_0$ norm, corresponding to common adversarial attacks such as adversarial patches in computer vision. Using their algorithm, the authors can provably reconstruct the signal, specifically...
http://www.shortscience.org/paper?bibtexKey=conf/nips/BafnaMV18#davidstutz
http://www.shortscience.org/paper?bibtexKey=conf/nips/BafnaMV18#davidstutzSat, 14 Mar 2020 23:31:48 +00001809.08758journals/corr/1809.087582Low Frequency Adversarial PerturbationDavid StutzGuo et al. propose to augment black-box adversarial attacks with low-frequency noise to obtain low-frequency adversarial examples as shown in Figure 1. To this end, the boundary attack as well as the NES attack are modified to sample from a low-frequency Gaussian distribution instead from Gaussian noise directly. This is achieved through an inverse discrete cosine transform as detailed in the paper.
Figure 1: Example of a low-frequency adversarial example.
Also find this summary at [davidstut...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1809.08758#davidstutz
http://www.shortscience.org/paper?bibtexKey=journals/corr/1809.08758#davidstutzSat, 14 Mar 2020 23:27:21 +000010.1109/cvprw.2018.002123Semantic Adversarial ExamplesDavid StutzHosseini and Poovendran propose semantic adversarial examples by randomly manipulating hue and saturation of images. In particular, in an iterative algorithm, hue and saturation are randomly perturbed and projected back to their valid range. If this results in mis-classification the perturbed image is returned as the adversarial example and the algorithm is finished; if not, another iteration is run. The result is shown in Figure 1. As can be seen, the structure of the images is retained while h...
http://www.shortscience.org/paper?bibtexKey=10.1109/cvprw.2018.00212#davidstutz
http://www.shortscience.org/paper?bibtexKey=10.1109/cvprw.2018.00212#davidstutzSat, 14 Mar 2020 23:17:20 +0000conf/icml/KarmonZG182LaVAN: Localized and Visible Adversarial NoiseDavid StutzKarmon et al. propose a gradient-descent based method for obtaining adversarial patch like localized adversarial examples. In particular, after selecting a region of the image to be modified, several iterations of gradient descent are run in order to maximize the probability of the target class and simultaneously minimize the probability in the true class. After each iteration, the perturbation is masked to the patch and projected onto the valid range of [0,1] for images. On ImageNet, the author...
http://www.shortscience.org/paper?bibtexKey=conf/icml/KarmonZG18#davidstutz
http://www.shortscience.org/paper?bibtexKey=conf/icml/KarmonZG18#davidstutzSat, 14 Mar 2020 23:13:00 +00001904.00759journals/corr/abs-1904-007592Adversarial camera stickers: A physical camera-based attack on deep learning systemsDavid StutzLi et al. propose camera stickers that when computed adversarially and physically attached to the camera leads to mis-classification. As illustrated in Figure 1, these stickers are realized using circular patches of uniform color. These individual circular stickers are computed in a gradient-descent fashion by optimizing their location, color and radius. The influence of the camera on these stickers is modeled realistically in order to guarantee success.
Figure 1: Illustration of adversarial s...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1904-00759#davidstutz
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1904-00759#davidstutzSat, 14 Mar 2020 22:54:51 +000010.1109/wacv.2019.001432Local Gradients Smoothing: Defense Against Localized Adversarial AttacksDavid StutzNaseer et al. propose to smooth local gradients as defense against adversarial patches. In particular, as illustrated in Figure 1, the local image gradient is computed through convolution. Then, in local, overlapping windows, the gradients are set to zero if the total sum of absolute gradient values exceeds a specific threshold. The remaining gradient map is supposed to indicate regions where it is likely that adversarial patches can be found. Using this gradient map, the image is smoothed, i.e....
http://www.shortscience.org/paper?bibtexKey=10.1109/wacv.2019.00143#davidstutz
http://www.shortscience.org/paper?bibtexKey=10.1109/wacv.2019.00143#davidstutzSat, 14 Mar 2020 22:51:20 +0000conf/raid/ZuoYL0192Exploiting the Inherent Limitation of L0 Adversarial ExamplesDavid StutzZuo et al. propose a two-stage system for detecting $L_0$ adversarial examples. Their system is based on the following two observations: (a) $L_0$ adversarial examples often result in very drastic changes of individual pixels and (b) these pixels are usually isolated and scattered over the image. Thus, they propose to train a siamese network to detect adversarial examples. To this end, they use a pre-processor and train the network to detect adversarial examples by taking the input and the pre-p...
http://www.shortscience.org/paper?bibtexKey=conf/raid/ZuoYL019#davidstutz
http://www.shortscience.org/paper?bibtexKey=conf/raid/ZuoYL019#davidstutzSat, 14 Mar 2020 22:48:50 +0000conf/iclr/LeeAJ193Towards Robust, Locally Linear Deep NetworksDavid StutzLee et al. propose a regularizer to increase the size of linear regions of rectified deep networks around training and test points. Specifically, they assume piece-wise linear networks, in its most simplistic form consisting of linear layers (fully connected layers, convolutional layers) and ReLU activation functions. In these networks, linear regions are determined by activation patterns, i.e., a pattern indicating which neurons have value greater than zero. Then, the goal is to compute, and la...
http://www.shortscience.org/paper?bibtexKey=conf/iclr/LeeAJ19#davidstutz
http://www.shortscience.org/paper?bibtexKey=conf/iclr/LeeAJ19#davidstutzFri, 13 Mar 2020 22:25:08 +0000conf/aaai/LiuYLSCL192DPATCH: An Adversarial Patch Attack on Object DetectorsDavid StutzLiu et al. propose DPatch, adversarial patches against state-of-the-art object detectors. Similar to existing adversarial patches, where a patch with fixed pixels is placed in an image in order to evade (or change) classification, the authors compute their DPatch using an optimization procedure. During optimization, the patch to be optimized is placed in random locations on all images of, e.g. on PASCAL VOC 2007, and the pixels are updated in order to maximize the loss of the classifier (either ...
http://www.shortscience.org/paper?bibtexKey=conf/aaai/LiuYLSCL19#davidstutz
http://www.shortscience.org/paper?bibtexKey=conf/aaai/LiuYLSCL19#davidstutzFri, 13 Mar 2020 22:16:25 +0000conf/nips/SalmanLRZZBY192Provably Robust Deep Learning via Adversarially Trained Smoothed ClassifiersDavid StutzSalman et al. combined randomized smoothing with adversarial training based on an attack specifically designed against smoothed classifiers. Specifically, they consider the formulation of randomized smoothing by Cohen et al. [1]; here, Gaussian noise around the input (adversarial or clean) is sampled and the classifier takes a simple majority vote. In [1], Cohen et al. show that this results in good bounds on robustness. In this paper, Salman et al. propose an adaptive attack against randomized ...
http://www.shortscience.org/paper?bibtexKey=conf/nips/SalmanLRZZBY19#davidstutz
http://www.shortscience.org/paper?bibtexKey=conf/nips/SalmanLRZZBY19#davidstutzFri, 13 Mar 2020 22:07:15 +0000conf/ccs/LambVKB192Interpolated Adversarial Training: Achieving Robust Neural Networks Without Sacrificing Too Much AccuracyDavid StutzLamb et al. propose interpolated adversarial training to increase robustness against adversarial examples. Particularly, a $50\%/50\%$ variant of adversarial training is used, i.e., in each iteration the batch consists of $50\%$ clean and $50\%$ adversarial examples. The loss is then computed on these both parts, encouraging the network to predict the correct labels on the adversarial examples, and averaged afterwards. In interpolated adversarial training, the loss is adapted according to the Mi...
http://www.shortscience.org/paper?bibtexKey=conf/ccs/LambVKB19#davidstutz
http://www.shortscience.org/paper?bibtexKey=conf/ccs/LambVKB19#davidstutzFri, 13 Mar 2020 21:59:51 +0000conf/nips/Bartlett962For Valid Generalization the Size of the Weights is More Important than the Size of the NetworkDavid StutzBarlett shows that lower generalization bounds for multi-layer perceptrons with limited sizes of the weights can be found using the so-called fat-shattering dimension. Similar to the classical VC dimensions, the fat shattering dimensions quantifies the expressiveness of hypothesis classes in machine learning. Specifically, considering a sequence of points $x_1, \ldots, x_d$, a hypothesis class $H$ is said to shatter this sequence if, for any label assignment $b_1, \ldots, b_d \in \{-1,1\}$, a fu...
http://www.shortscience.org/paper?bibtexKey=conf/nips/Bartlett96#davidstutz
http://www.shortscience.org/paper?bibtexKey=conf/nips/Bartlett96#davidstutzFri, 13 Mar 2020 21:55:39 +00001905.03837duesterwald2019exploring3Exploring the Hyperparameter Landscape of Adversarial RobustnessDavid StutzDuesterwald et al. study the influence of hyperparameters on adversarial training and its robustness as well as accuracy. As shown in Figure 1, the chosen parameters, the ratio of adversarial examples per batch and the allowed perturbation $\epsilon$, allow to control the trade-off between adversarial robustness and accuracy. Even for larger $\epsilon$, at least on MNIST and SVHN, using only few adversarial examples per batch increases robustness significantly while only incurring a small loss i...
http://www.shortscience.org/paper?bibtexKey=duesterwald2019exploring#davidstutz
http://www.shortscience.org/paper?bibtexKey=duesterwald2019exploring#davidstutzThu, 12 Mar 2020 22:07:26 +00001901.09878journals/corr/abs-1901-098782CapsAttacks: Robust and Imperceptible Adversarial Attacks on Capsule NetworksDavid StutzMarchisio et al. propose a black-box adversarial attack on Capsule Networks. The main idea of the attack is to select pixels based on their local standard deviation. Given a window of allowed pixels to be manipulated, these are sorted based on standard deviation and possible impact on the predicted probability (i.e., gap between target class probability and maximum other class probability). A subset of these pixels is then manipulated by a fixed noise value $\delta$. In experiments, the attack i...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1901-09878#davidstutz
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1901-09878#davidstutzThu, 12 Mar 2020 22:00:51 +00001704.03453journals/corr/TramerPGBM172The Space of Transferable Adversarial ExamplesDavid StutzTramer et al. study adversarial subspaces, subspaces of the input space that are spanned by multiple, orthogonal adversarial examples. This is achieved by iteratively searching for orthogonal adversarial examples, relative to a specific test example. This can, for example, be done using classical second- or first-order optimization methods for finding adversarial examples with the additional constraint of finding orthogonal adversarial examples. However, the authors also consider different attac...
http://www.shortscience.org/paper?bibtexKey=journals/corr/TramerPGBM17#davidstutz
http://www.shortscience.org/paper?bibtexKey=journals/corr/TramerPGBM17#davidstutzThu, 12 Mar 2020 21:50:49 +00001906.05419journals/corr/abs-1906-054192Efficient Evaluation-Time Uncertainty Estimation by Improved DistillationDavid StutzEnglesson and Azizpour propose an adapted knowledge distillation version to improve confidence calibration on out-of-distribution examples including adversarial examples. In contrast to vanilla distillation, they make the following changes: First, high capacity student networks are used, for example, by increasing depth or with. Then, the target distribution is “sharpened” using the true label by reducing the distributions overall entropy. Finally, for wrong predictions of the teacher model,...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1906-05419#davidstutz
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1906-05419#davidstutzMon, 09 Mar 2020 21:59:38 +0000conf/iclr/HendrycksD192Benchmarking Neural Network Robustness to Common Corruptions and PerturbationsDavid StutzHendrycks and Dietterich propose ImageNet-C and ImageNet-P benchmarks for corruption and perturbation robustness evaluation. Both datasets come in various sizes, and corruptions always come in different difficulties. The used corruptions include many common, realistic noise types such as various types of blur and random noise, brightness changes and compression artifacts. ImageNet-P differs from ImageNet-C in that sequences of perturbations are generated. This means, for a specific perturbation ...
http://www.shortscience.org/paper?bibtexKey=conf/iclr/HendrycksD19#davidstutz
http://www.shortscience.org/paper?bibtexKey=conf/iclr/HendrycksD19#davidstutzMon, 09 Mar 2020 21:57:45 +00001905.06455journals/corr/abs-1905-064552On Norm-Agnostic Robustness of Adversarial TrainingDavid StutzLi et al. evaluate adversarial training using both $L_2$ and $L_\infty$ attacks and proposes a second-order attack. The main motivation of the paper is to show that adversarial training cannot increase robustness against both $L_2$ and $L_\infty$ attacks. To this end, they propose a second-order adversarial attack and experimentally show that ensemble adversarial training can partly solve the problem.
Also find this summary at [davidstutz.de]().
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1905-06455#davidstutz
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1905-06455#davidstutzMon, 09 Mar 2020 21:41:28 +00001906.02611journals/corr/abs-1906-026112Improving Robustness Without Sacrificing Accuracy with Patch Gaussian AugmentationDavid StutzLopes et al. propose patch-based Gaussian data augmentation to improve accuracy and robustness against common corruptions. Their approach is intended to be an interpolation between Gaussian noise data augmentation and CutOut. During training, random patches on images are selected and random Gaussian noise is added to these patches. With increasing noise level (i.e., its standard deviation) this results in CutOut; with increasing patch size, this results in regular Gaussian noise data augmentatio...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1906-02611#davidstutz
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1906-02611#davidstutzMon, 09 Mar 2020 21:33:59 +00001906.02337journals/corr/abs-1906-023373MNIST-C: A Robustness Benchmark for Computer VisionDavid StutzMu and Gilmer introduce MNIST-C, an MNIST-based corruption benchmark for out-of-distribution evaluation. The benchmark includes various corruption types including random noise (shot and impulse noise), blur (glass and motion blur), (affine) transformations, “striping” or occluding parts of the image, using Canny images or simulating fog. These corruptions are also shown in Figure 1. The transformations have been chosen to be semantically invariant, meaning that the true class of the image do...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1906-02337#davidstutz
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1906-02337#davidstutzMon, 09 Mar 2020 21:27:36 +0000conf/icml/TeyeAS183Bayesian Uncertainty Estimation for Batch Normalized Deep NetworksDavid StutzTeye et al. show that neural networks with batch normalization can be used to give uncertainty estimates through Monte Carlo sampling. In particular, instead of using the test mode of batch normalization, where the statistics (mean and variance) of each batch normalization layer are fixed, these statistics are computed per batch, as in training mode. To this end, for a specific query image, random batches from the training set are sampled, and prediction uncertainty is estimated using Monte Carl...
http://www.shortscience.org/paper?bibtexKey=conf/icml/TeyeAS18#davidstutz
http://www.shortscience.org/paper?bibtexKey=conf/icml/TeyeAS18#davidstutzMon, 09 Mar 2020 21:19:42 +00001607.06450journals/corr/1607.064502Layer NormalizationDavid StutzBa et al. propose layer normalization, normalizing the activations of a layer by its mean and standard deviation. In contrast to batch normalization, this scheme does not depend on the current batch; thus, it performs the same computation at training and test time. The general scheme, however, is very similar. Given the $l$-th layer of a multi-layer perceptron,
$a_i^l = (w_i^l)^T h^l$ and $h_i^{l + 1} = f(a_i^l + b_i^l)$
with $W^l$ being the weight matrix, the activations $a_i^l$ are normalize...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1607.06450#davidstutz
http://www.shortscience.org/paper?bibtexKey=journals/corr/1607.06450#davidstutzSun, 08 Mar 2020 19:20:46 +00001802.08760journals/corr/1802.087602Sensitivity and Generalization in Neural Networks: an Empirical StudyDavid StutzNovak et al. study the relationship between neural network sensitivity and generalization. Here, sensitivity is measured in terms of the Frobenius gradient of the network’s probabilities (resulting in a Jacobian matrix, not depending on the true label) or based on a coding scheme of activations. The latter is intended to quantify transitions between linear regions of the piece-wise linear model. To this end, all activations are assigned either $0$ or $1$ depending on their ReLU output. Based o...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1802.08760#davidstutz
http://www.shortscience.org/paper?bibtexKey=journals/corr/1802.08760#davidstutzSun, 08 Mar 2020 18:34:58 +00001607.08022journals/corr/1607.080222Instance Normalization: The Missing Ingredient for Fast StylizationDavid StutzIn the context of stylization, Ulyanov et al. propose to use instance normalization instead of batch normalization. In detail, instance normalization does not compute the mean and standard deviation used for normalization over the current mini-batch in training. Instead, these statistics are computed per instance individually. This also has the benefit of having the same training and test procedure, meaning that normalization is the same in both cases – in contrast to batch normalization.
Als...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1607.08022#davidstutz
http://www.shortscience.org/paper?bibtexKey=journals/corr/1607.08022#davidstutzSun, 08 Mar 2020 18:21:50 +00001803.08494journals/corr/1803.084942Group NormalizationDavid StutzWu and He propose group normalization as alternative to batch normalization. Instead of computing the statistics used for normalization based on the current mini-batch, group normalization computes these statistics per instance but in groups of channels (for convolutional layers). Specifically, given activations $x_i$ with $i = (i_N, i_C, i_H, i_W)$ indexing along batch size, channels, height and width, batch normalization computes
$\mu_i = \frac{1}{|S|}\sum_{k \in S} x_k$ and $\sigma_i = \sqrt...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1803.08494#davidstutz
http://www.shortscience.org/paper?bibtexKey=journals/corr/1803.08494#davidstutzSun, 08 Mar 2020 18:10:53 +0000conf/nips/ZhangS182Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy LabelsDavid StutzZhang and Sabuncu propose a generalized cross entropy loss for robust learning on noisy labels. The approach is based on the work by Gosh et al. [1] showing that the mean absolute error can be robust to label noise. Specifically, they show that a symmetric loss, under specific assumptions on the label noise, is robust. Here, symmetry corresponds to
$\sum_{j=1}^c \mathcal{L}(f(x), j) = C$ for all $x$ and $f$
where $c$ is the number of classes and $C$ some constant. The cross entropy loss is not...
http://www.shortscience.org/paper?bibtexKey=conf/nips/ZhangS18#davidstutz
http://www.shortscience.org/paper?bibtexKey=conf/nips/ZhangS18#davidstutzSun, 08 Mar 2020 18:03:29 +00001903.06293journals/corr/abs-1903-062932A Research Agenda: Dynamic Models to Defend Against Correlated AttacksDavid StutzGoodfellow motivates the use of dynamical models as “defense” against adversarial attacks that violate both the identical and independent assumptions in machine learning. Specifically, he argues that machine learning is mostly based on the assumption that the data is samples identically and independently from a data distribution. Evasion attacks, meaning adversarial examples, mainly violate the assumption that they come from the same distribution. Adversarial examples computed within an $\ep...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1903-06293#davidstutz
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1903-06293#davidstutzSun, 08 Mar 2020 18:00:51 +000010.1109/ijcnn.2019.88522962On Correlation of Features Extracted by Deep Neural NetworksDavid StutzAyinde et al. study the impact of network architecture and weight initialization on learning redundant features. To empirically estimate the number of redundant features, the authors use an agglomerative clustering approach to cluster features based on their cosine similarity. Essentially, given a set of features, these are merged as long as their (average) cosine similarity is within some threshold $\tau$. Then, this number is compared across network architectures. Figure 1, for example, shows ...
http://www.shortscience.org/paper?bibtexKey=10.1109/ijcnn.2019.8852296#davidstutz
http://www.shortscience.org/paper?bibtexKey=10.1109/ijcnn.2019.8852296#davidstutzSun, 08 Mar 2020 17:55:03 +0000conf/icml/DinhPBB173Sharp Minima Can Generalize For Deep NetsDavid StutzDinh et al. show that it is unclear whether flat minima necessarily generalize better than sharp ones. In particular, they study several notions of flatness, both based on the local curvature and based on the notion of “low change in error”. The authors show that the parameterization of the network has a significant impact on the flatness; this means that functions leading to the same prediction function (i.e., being indistinguishable based on their test performance) might have largely varyi...
http://www.shortscience.org/paper?bibtexKey=conf/icml/DinhPBB17#davidstutz
http://www.shortscience.org/paper?bibtexKey=conf/icml/DinhPBB17#davidstutzSun, 08 Mar 2020 17:50:08 +00001901.10513journals/corr/abs-1901-105132Adversarial Examples Are a Natural Consequence of Test Error in NoiseDavid StutzFord et al. show that the existence of adversarial examples can directly linked to test error on noise and other types of random corruption. Additionally, obtaining model robust against random corruptions is difficult, and even adversarially robust models might not be entirely robust against these corruptions. Furthermore, many “defenses” against adversarial examples show poor performance on random corruption – showing that some defenses do not result in robust models, but make attacking t...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1901-10513#davidstutz
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1901-10513#davidstutzSun, 08 Mar 2020 17:44:47 +00001905.09747journals/corr/abs-1905-097472Adversarially Robust DistillationDavid StutzGoldblum et al. show that distilling robustness is possible, however, depends on the teacher model and the considered dataset. Specifically, while classical knowledge distillation does not convey robustness against adversarial examples, distillation with a robust teacher model might increase robustness of the student model – even if trained on clean examples only. However, this seems to depend on both the dataset as well as the teacher model, as pointed out in experiments on Cifar100. Unfortun...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1905-09747#davidstutz
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1905-09747#davidstutzSun, 08 Mar 2020 17:42:15 +0000conf/nips/GargSZV182A Spectral View of Adversarially Robust FeaturesDavid StutzGarg et al. propose adversarially robust features based on a graph interpretation of the training data. In this graph, training points are connected based on their distance in input space. Robust features are obtained using the eigenvectors of the Laplacian of the graph. It is theoretically shown that these features are robust, based on some assumptions on the graph. For example, the bound obtained on robustness depends on the gap between second and third eigenvalue.
Also find this summary at [...
http://www.shortscience.org/paper?bibtexKey=conf/nips/GargSZV18#davidstutz
http://www.shortscience.org/paper?bibtexKey=conf/nips/GargSZV18#davidstutzSun, 08 Mar 2020 17:39:15 +0000conf/nips/LittwinW182Regularizing by the Variance of the Activations' Sample-VariancesDavid StutzLittwin and Wolf propose a activation variance regularizer that is shown to have a similar, even better, effect than batch normalization. The proposed regularizer is based on an analysis of the variance of activation values; the idea is that the measured variance of these variances is low if the activation values come from a distribution with few modes. Thus, the intention of the regularizer is to encourage distributions of activations with only few modes. This is achieved using the regularizers...
http://www.shortscience.org/paper?bibtexKey=conf/nips/LittwinW18#davidstutz
http://www.shortscience.org/paper?bibtexKey=conf/nips/LittwinW18#davidstutzSun, 08 Mar 2020 17:33:21 +00002002.05616grathwohl2020cutting3Cutting out the Middle-Man: Training and Evaluating Energy-Based Models without SamplingAdamoThe authors introduce a new, sampling-free method for training and evaluating energy-based models (aka EBMs, aka unnormalized density models). There are two broad approches for training EBMs. Sampling-based approaches like contrastive divergence try to estimate the likelihood with MCMC, but can be biased if the chain is not sufficiently long. The speed of training also greatly depends on the sampling parameters. Other approches, like score matching, avoid sampling by solving a surrogate objectiv...
http://www.shortscience.org/paper?bibtexKey=grathwohl2020cutting#adamoyoung
http://www.shortscience.org/paper?bibtexKey=grathwohl2020cutting#adamoyoungThu, 05 Mar 2020 15:24:32 +00001702.08591journals/corr/1702.085912The Shattered Gradients Problem: If resnets are the answer, then what is the question?Gavin GrayImagine you make a neural network mapping a scalar to a scalar. After you initialise this network in the traditional way, randomly with some given variance, you could take the gradient of the input with respect to the output for all reasonable values (between about -3 and 3 because networks typically assume standardised inputs). As the value increases, different rectified linear units in the network will randomly switch on, drawing a random walk in the gradients; another name for which is brown ...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1702.08591#gngdb
http://www.shortscience.org/paper?bibtexKey=journals/corr/1702.08591#gngdbWed, 26 Feb 2020 22:21:42 +00001810.00597journals/corr/1810.005972Taming VAEsGavin GrayThe paper provides derivations and intuitions about the learning dynamics for VAEs based on observations about [$\beta$-VAEs][beta]. Using this they derive an alternative way to constrain the training of VAEs that doesn't require typical heuristics, such as warmup or adding noise to the data.
How exactly would this change a typical implementation? Typically, SGD is used to [optimize the ELBO directly](). Using GECO, I keep a moving average of my constraint $C$ (chosen based on what I want the V...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1810.00597#gngdb
http://www.shortscience.org/paper?bibtexKey=journals/corr/1810.00597#gngdbMon, 24 Feb 2020 21:54:36 +0000conf/iclr/LuoSumo20203SUMO: Unbiased Estimation of Log Marginal Probability for Latent Variable ModelsChin-WeiIn this note, I'll implement the [Stochastically Unbiased Marginalization Objective (SUMO)]() to estimate the log-partition function of an energy funtion.
Estimation of log-partition function has many important applications in machine learning. Take latent variable models or Bayeisian inference. The log-partition function of the posterior distribution $$p(z|x)=\frac{1}{Z}p(x|z)p(z)$$ is the log-marginal likelihood of the data $$\log Z = \log \int p(x|z)p(z)dz = \log p(x)$$.
More generally, l...
http://www.shortscience.org/paper?bibtexKey=conf/iclr/LuoSumo2020#cw
http://www.shortscience.org/paper?bibtexKey=conf/iclr/LuoSumo2020#cwMon, 17 Feb 2020 05:27:33 +000010.1109/tvcg.2019.28932472Interaction-based Human Activity ComparisonOleksandr BailoThis paper proposes an approach to measure motion similarity between human-human and human-object interaction. The authors claim that human activities are usually defined by the interaction between individual characters, such as a high-five interaction.
As the interaction datasets are not available authors provide multiple small-scale interaction datasets:
where:
- 2C = a Character-Character (2C) database using kick-boxing motions
- CRC = Character-Retargeted Character where the size of charac...
http://www.shortscience.org/paper?bibtexKey=10.1109/tvcg.2019.2893247#ukrdailo
http://www.shortscience.org/paper?bibtexKey=10.1109/tvcg.2019.2893247#ukrdailoTue, 04 Feb 2020 08:51:20 +0000Kool2020Estimating2Estimating Gradients for Discrete Random Variables by Sampling without ReplacementGavin GrayIt's a shame that the authors weren't able to continue their series of [great][reinforce] [paper][attention] [titles][beams], although it looks like they thought about calling this paper **"Put Replacement In Your Basement"**. Also, although they don't say it in the title or abstract, this paper introduces an estimator the authors call the **"unordered set estimator"** which, as a name, is not the best. However, this is one of the most exciting estimators for gradients of non-differentiable expe...
http://www.shortscience.org/paper?bibtexKey=Kool2020Estimating#gngdb
http://www.shortscience.org/paper?bibtexKey=Kool2020Estimating#gngdbMon, 03 Feb 2020 14:53:31 +0000sammon1969mapping2A Nonlinear Mapping for Data Structure AnalysisJoseph Paul CohenThis paper presents what is known as `Sammon's mapping`. This method produces points in any $\mathbb{R}^n$ space using only a distance function between points. You can define any distance function $d^*$ that represents relationships between points. This function can even be non-symmetric. The power is that any relationship encoded into a distance function or distance matrix can be visualized.
For mapping $n$ points from some dimension in another the algorithm starts by generating $n$ random poi...
http://www.shortscience.org/paper?bibtexKey=sammon1969mapping#joecohen
http://www.shortscience.org/paper?bibtexKey=sammon1969mapping#joecohenTue, 21 Jan 2020 05:22:57 +0000conf/nips/ZhangM182Generalizing Tree Probability Estimation via Bayesian NetworksGavin GrayA common problem in phylogenetics is:
1. I have $p(\text{DNA sequences} | \text{tree})$ and $p(\text{tree})$.
2. I've used these to run an MCMC algorithm and generate many (approximate) samples from $p(\text{tree} | \text{DNA sequences})$.
3. I want to evaluate $p(\text{tree} | \text{DNA sequences})$.
The first solution you might think of is to add up how many times you saw each *tree topology* and divide by the total number of MCMC samples; referred to in this paper as *simple sample relative...
http://www.shortscience.org/paper?bibtexKey=conf/nips/ZhangM18#gngdb
http://www.shortscience.org/paper?bibtexKey=conf/nips/ZhangM18#gngdbTue, 14 Jan 2020 16:36:42 +000010.1145/3178876.31861542Latent Relational Metric Learning via Memory-based Attention for Collaborative RankingDarelThis work is a direct improvement of Collaborative Metric Learning. While CML tries to retrieve user and item embeddings in a direct way by placing them in metric space and adjusting with triplet loss, this paper focuses on introduction of latent relational vectors.
A relational vector $r$ must describe relation between user $p$ and item $q$ in a way that $s(p,q)=\parallel \ p + r - q \parallel \approx 0$.
Vectors $r$ are introduced as a softmax-weighted linear combination of vectors from La...
http://www.shortscience.org/paper?bibtexKey=10.1145/3178876.3186154#darel
http://www.shortscience.org/paper?bibtexKey=10.1145/3178876.3186154#darelFri, 10 Jan 2020 14:26:12 +0000conf/asunam/JamshidiRL183Trojan Horses in Amazon's Castle: Understanding the Incentivized Online ReviewsSOJADuring the past few years, sellers have increasingly offered discounted or free products to selected reviewers of ecommerce platforms in exchange for their reviews. Such incentivized (and often very positive) reviews can improve the rating of a product which in turn sways other users’ opinions about the product.
Here, we examine the problem of detecting and characterizing incentivized reviews in two primary categories of Amazon products. We show that the key features of EIRs and normal revi...
http://www.shortscience.org/paper?bibtexKey=conf/asunam/JamshidiRL18#soja
http://www.shortscience.org/paper?bibtexKey=conf/asunam/JamshidiRL18#sojaThu, 09 Jan 2020 23:47:17 +00001811.11804journals/corr/1811.11804219 dubious ways to compute the marginal likelihood of a phylogenetic tree topologyGavin GrayThis paper compares methods to calculate the marginal likelihood, $p(D | \tau)$, when you have a tree topology $\tau$ and some data $D$ and you need to marginalise over the possible branch lengths $\mathbf{\theta}$ in the process of Bayesian inference. In other words, solving the following integral:
$$
\int_{ [ 0, \infty ]^{2S - 3} } p(D | \mathbf{\theta}, \tau ) p( \mathbf{\theta} | \tau) d \mathbf{\theta}
$$
There are some details about this problem that are common to phylogenetic problems, ...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1811.11804#gngdb
http://www.shortscience.org/paper?bibtexKey=journals/corr/1811.11804#gngdbFri, 27 Dec 2019 16:32:04 +0000conf/www/HsiehYCLBE172Collaborative Metric LearningDarel## Idea
Use implicit feedback and item features to project users and items into the same latent space to use with kNN later. Learned metric encodes user-item, user-user and item-item relationships.
## Loss
Users and items are represented by vectors $u_i \in \mathcal{R}^r, v_i \in \mathcal{R}^r$.
We define euclidean distance as $d(i,j)= \parallel u_i-v_j\ \parallel$
Loss function consists of 3 parts:
$$\mathcal{L}=\mathcal{L}_m + \lambda_f\mathcal{L}_f + \lambda_c\mathcal{L}_c$$
### Weighted ...
http://www.shortscience.org/paper?bibtexKey=conf/www/HsiehYCLBE17#darel
http://www.shortscience.org/paper?bibtexKey=conf/www/HsiehYCLBE17#darelFri, 27 Dec 2019 15:46:33 +00001911.13299ramanujan2019whats2What's Hidden in a Randomly Weighted Neural Network?devin132The paper: "Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask" by Zhou et al., 2019 found that by just learning binary masks one can find random subnetworks that do much better than chance on a task. This new paper builds on this method by proposing a strong algorithm than Zhou et al. for finding these high-performing subnetworks.
The intuition follows: "If a neural network with random weights (center) is sufficiently overparameterized, it will contain a subnetwork (right) that pe...
http://www.shortscience.org/paper?bibtexKey=ramanujan2019whats#devin132
http://www.shortscience.org/paper?bibtexKey=ramanujan2019whats#devin132Wed, 25 Dec 2019 16:45:12 +00001805.06370journals/corr/1805.063704Progress & Compress: A scalable framework for continual learningdevin132Proposes a two-stage approach for continual learning. An active learning phase and a consolidation phase. The active learning stage optimizes for a specific task that is then consolidated into the knowledge base network via Elastic Weight Consolidation (Kirkpatrick et al., 2016). The active learning phases uses a separate network than the knowledge base, but is not always trained from scratch - authors suggest a heuristic based on task-similarity. Improves EWC by deriving a new online method so ...
http://www.shortscience.org/paper?bibtexKey=journals/corr/1805.06370#devin132
http://www.shortscience.org/paper?bibtexKey=journals/corr/1805.06370#devin132Wed, 25 Dec 2019 16:10:54 +0000conf/recsys/XinMPLA172Folding: Why Good Models Sometimes Make Spurious RecommendationsDarelOne bad item can reduce perceived quality of recommendation list. Sometimes this may be particularly undesirable such as recommending horror movies to children. Authors argue that this happens when missing not at random data is handled improperly and separate groups of users and items overlap during the process of dimensionality reduction and computation of embeddings. Folding is a metric that measures the severity of described effect in a recommendation model.
To calculate folding we must intr...
http://www.shortscience.org/paper?bibtexKey=conf/recsys/XinMPLA17#darel
http://www.shortscience.org/paper?bibtexKey=conf/recsys/XinMPLA17#darelTue, 24 Dec 2019 22:13:20 +0000conf/um/FrumermanSSS193Are All Rejected Recommendations Equally Bad?: Towards Analysing Rejected RecommendationsDarel## Idea
When we recommend items to users, some of them are not chosen by the user. These rejected recommendations are usually treated as hard mistakes.
Authors argue that these bad recommendations still may influence user's choice even though they were not picked. For example user didn't click on "Die Hard" but watched another Bruce Willis movie. This seems to be a not so bad recommendation after all and maybe we should not penalize it as hard as we usually do.
Ultimate goal is to invent a me...
http://www.shortscience.org/paper?bibtexKey=conf/um/FrumermanSSS19#darel
http://www.shortscience.org/paper?bibtexKey=conf/um/FrumermanSSS19#darelFri, 20 Dec 2019 15:47:44 +00001906.05243journals/corr/abs-1906-052433When to use parametric models in reinforcement learning?CodyWildThis paper is a bit provocative (especially in the light of the recent DeepMind MuZero paper), and poses some interesting questions about the value of model-based planning. I'm not sure I agree with the overall argument it's making, but I think the experience of reading it made me hone my intuitions around why and when model-based planning should be useful.
The overall argument of the paper is: rather than learning a dynamics model of the environment and then using that model to plan and learn...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1906-05243#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1906-05243#decodyngFri, 29 Nov 2019 17:48:19 +00001905.12506journals/corr/abs-1905-125065Are Disentangled Representations Helpful for Abstract Visual Reasoning?CodyWildArguably, the central achievement of the deep learning era is multi-layer neural networks' ability to learn useful intermediate feature representations using a supervised learning signal. In a supervised task, it's easy to define what makes a feature representation useful: the fact that's easier for a subsequent layer to use to make the final class prediction. When we want to learn features in an unsupervised way, things get a bit trickier. There's the obvious problem of what kinds of problem st...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1905-12506#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1905-12506#decodyngFri, 29 Nov 2019 07:38:52 +00001906.02768journals/corr/abs-1906-027682Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLPCodyWildSummary: An odd thing about machine learning these days is how far you can get in a line of research while only ever testing your method on image classification and image datasets in general. This leads one occasionally to wonder whether a given phenomenon or advance is a discovery of the field generally, or whether it's just a fact about the informatics and learning dynamics inherent in image data.
This paper, part of a set of recent papers released by Facebook centering around the Lottery Ti...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1906-02768#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1906-02768#decodyngThu, 28 Nov 2019 18:44:16 +00001906.02425journals/corr/abs-1906-024252Uncertainty-guided Continual Learning with Bayesian Neural NetworksMassimo Caccia## Introduction
Bayesian Neural Networks (BNN): intrinsic importance model based on weight uncertainty; variational inference can approximate posterior distributions using Monte Carlo sampling for gradient estimation; acts like an ensemble method in that they reduce the prediction variance but only uses 2x the number of parameters.
The idea is to use BNN's uncertainty to guide gradient descent to not update the important weight when learning new tasks.
## Bayes by Backprop (BBB):
Where $q...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1906-02425#mcaccia
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1906-02425#mcacciaWed, 27 Nov 2019 23:18:04 +00001906.02773journals/corr/abs-1906-027732One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizersCodyWildIn my view, the Lottery Ticket Hypothesis is one of the weirder and more mysterious phenomena of the last few years of Machine Learning. We've known for awhile that we can take trained networks and prune them down to a small fraction of their weights (keeping those weights with the highest magnitudes) and maintain test performance using only those learned weights. That seemed somewhat surprising, in that there were a lot of weights that weren't actually necessary to encoding the learned function...
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1906-02773#decodyng
http://www.shortscience.org/paper?bibtexKey=journals/corr/abs-1906-02773#decodyngWed, 27 Nov 2019 01:41:31 +0000