Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1521 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Tumor Phylogeny Topology Inference via Deep Learning

Erfan Sadeqi Azer and Mohammad Haghir Ebrahimabadi and Salem Malikić and Roni Khardon and S. Cenk Sahinalp

bioRxiv: The preprint server for biology - 0 via Local CrossRef

Keywords:

Erfan Sadeqi Azer and Mohammad Haghir Ebrahimabadi and Salem Malikić and Roni Khardon and S. Cenk Sahinalp

bioRxiv: The preprint server for biology - 0 via Local CrossRef

Keywords:

[link]
A very simple (but impractical) discrete model of subclonal evolution would include the following events: * Division of a cell to create two cells: * **Mutation** at a location in the genome of the new cells * Cell death at a new timestep * Cell survival at a new timestep Because measurements of mutations are usually taken at one time point, this is taken to be at the end of a time series of these events, where a tiny of subset of cells are observed and a **genotype matrix** $A$ is produced, in which mutations and cells are arbitrarily indexed such that $A_{i,j} = 1$ if mutation $j$ exists in cell $i$. What this matrix allows us to see is the proportion of cells which *both have mutation $j$*. Unfortunately, I don't get to observe $A$, in practice $A$ has been corrupted by IID binary noise to produce $A'$. This paper focuses on difference inference problems given $A'$, including *inferring $A$*, which is referred to as **`noise_elimination`**. The other problems involve inferring only properties of the matrix $A$, which are referred to as: * **`noise_inference`**: predict whether matrix $A$ would satisfy the *three gametes rule*, which asks if a given genotype matrix *does not describe a branching phylogeny* because a cell has inherited mutations from two different cells (which is usually assumed to be impossible under the infinite sites assumption). This can be computed exactly from $A$. * **Branching Inference**: it's possible that all mutations are inherited between the cells observed; in which case there are *no branching events*. The paper states that this can be computed by searching over permutations of the rows and columns of $A$. The problem is to predict from $A'$ if this is the case. In both problems inferring properties of $A$, the authors use fully connected networks with two hidden layers on simulated datasets of matrices. For **`noise_elimination`**, computing $A$ given $A'$, the authors use a network developed for neural machine translation called a [pointer network][pointer]. They also find it necessary to map $A'$ to a matrix $A''$, turning every element in $A'$ to a fixed length row containing the location, mutation status and false positive/false negative rate. Unfortunately, reported results on real datasets are reported only for branching inference and are limited by the restriction on input dimension. The inferred branching probability reportedly matches that reported in the literature. [pointer]: https://arxiv.org/abs/1409.0473 |

Data-Efficient Hierarchical Reinforcement Learning

Ofir Nachum and Shixiang Gu and Honglak Lee and Sergey Levine

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

**First published:** 2020/10/28 (just now)

**Abstract:** Hierarchical reinforcement learning (HRL) is a promising approach to extend
traditional reinforcement learning (RL) methods to solve more complex tasks.
Yet, the majority of current HRL methods require careful task-specific design
and on-policy training, making them difficult to apply in real-world scenarios.
In this paper, we study how we can develop HRL algorithms that are general, in
that they do not make onerous additional assumptions beyond standard RL
algorithms, and efficient, in the sense that they can be used with modest
numbers of interaction samples, making them suitable for real-world problems
such as robotic control. For generality, we develop a scheme where lower-level
controllers are supervised with goals that are learned and proposed
automatically by the higher-level controllers. To address efficiency, we
propose to use off-policy experience for both higher and lower-level training.
This poses a considerable challenge, since changes to the lower-level behaviors
change the action space for the higher-level policy, and we introduce an
off-policy correction to remedy this challenge. This allows us to take
advantage of recent advances in off-policy model-free RL to learn both higher-
and lower-level policies using substantially fewer environment interactions
than on-policy algorithms. We term the resulting HRL agent HIRO and find that
it is generally applicable and highly sample-efficient. Our experiments show
that HIRO can be used to learn highly complex behaviors for simulated robots,
such as pushing objects and utilizing them to reach target locations, learning
from only a few million samples, equivalent to a few days of real-time
interaction. In comparisons with a number of prior HRL methods, we find that
our approach substantially outperforms previous state-of-the-art techniques.
more
less

Ofir Nachum and Shixiang Gu and Honglak Lee and Sergey Levine

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

[link]
# Keypoints - Proposes the HIerarchical Reinforcement learning with Off-policy correction (**HIRO**) algorithm. - Does not require careful task-specific design. - Generic goal representation to make it broadly applicable, without any manual design of goal spaces, primitives, or controllable dimensions. - Use of off-policy experience using a novel off-policy correction. - A two-level hierarchy architecture - A higher-level controller outputs a goal for the lower-level controller every **c** time steps and collects the rewards given by the environment, being the goal the desired change in state space - The lower level controller has the goal given added to its input and acts directly in the environment, the reward received is parametrized from the current state and the goal. # Background This paper adopts a standard continuous control reinforcement learning setting, in which an agent acts on an environment that yields a next state and a reward from unknown functions. This paper utilizes the TD3 learning algorithm. ## General and Efficient Hierarchical Reinforcement Learning https://i.imgur.com/zAHoWWO.png ## Hierarchy of Two Policies The higher-level policy $\mu^{hi}$ outputs a goal $g_t$, which correspond directly to desired relative changes in state that the lower-level policy $\mu^{lo}$ attempts to reach. $\mu^{hi}$ operates at a time abstraction, updating the goal $g_t$ and collecting the environment rewards $R_t$ every $c$ environment steps, the higher-level transition $(s_{t:t+c−1},g_{t:t+c−1},a_{t:t+c−1},R_{t:t+c−1},s_{t+c})$ is stored for off-policy training. The lower-level policy $\mu^{lo}$ outputs an action to be applied directly to the environment, having as input the current environment observations $s_t$ and the goal $g_t$. The goal $g_t$ is given by $\mu^{hi}$ every $c$ environment time steps, for the steps in between, the goal $g_t$ used by $\mu^{lo}$ is given by the transition function $g_t=h(s_{t−1},g_{t−1},s_t)$, the lower-level controller reward is provided by the parametrized reward function $r_t=r(s_t,g_t,a_t,s_{t+1})$. The lower-level transition $(s_t,g_t,a_t,r_t,s_{t+1}, g_{t+1})$ is stored for off-policy training. ## Parameterized Rewards The goal $g_t$ indicates a desired relative changes in state observations, the lower-level agent task is to take actions from state $s_t$ that yield it an observation $s_{t+c}$ that is close to $s_t+g_t$. To maintain the same absolute position of the goal regardless of state change, the goal transition model, used between $\mu^{hi}$ updates every $c$ steps, is defined as: $h(s_t,g_t,s_{t+1}) =s_t+g_t−s_{t+1}$ And the reward given to the lower-level controller is defined as to reinforce reaching a state closer to the goal $g_t$, this paper parametrizes it by the function: $r(s_t,g_t,a_t,s_{t+1}) =−||s_t+g_t−s_{t+1}||_2$. ## Off-Policy Corrections for Higher-Level Training The higher-level transitions stored $(s_{t:t+c−1},g_{t:t+c−1},a_{t:t+c−1},R_{t:t+c−1},s_{t+c})$ have to be converted to state-action-reward transitions $(s_t,g_t,∑R_{t:t+c−1},s_{t+c})$ as they can be used in standard off-policy RL algorithms, however, since the lower-level controller is evolving, these past transitions do not accurately represent the actions tha would be taken by the current lower-level policy and must be corrected. This paper correction technique used is to change the goal $g_t$ of past transitions using an out of date lower-level controller to a relabeled goal $g ̃_t$ which is likely to induce the same lower-level behavior with the updated $\mu^{lo}$. In other words, we want to find a goal $g ̃_t$ which maximizes the probability $μ_{lo}(a_{t:t+c−1}|s_{t:t+c−1},g ̃_{t:t+c−1})$, in which the $\mu^{lo}$ is the current policy and the actions $a_{t:t+c−1}$ and states $s_{t:t+c−1}$ are from the stored high level transition. To approximately maximize this quantity in practice, the authors calculated the probability for 10 candidates $g ̃_t$, eight candidate goals sampled randomly from a Gaussian centered at $s_{t+c}−s_t$, the original goal $g_t$ and a goal corresponding to the difference $s_{t+c}−s_t$. # Experiments https://i.imgur.com/iko9nCd.png https://i.imgur.com/kGx8fZv.png The authors compared the $HIRO$ method to prior method in 4 different environments: - Ant Gather; - Ant Maze; - Ant Push; - Ant Fall. They also performed an ablative analysis with the following variants: - With lower-level re-labeling; - With pre-training; - No off-policy correction; - No HRL. # Closing Points - The method proposed is interesting in the hierarchical reinforcement learning setting for not needing a specific design, the generic goal representation enables applicability without the need of designing a goal space manually; - The off-policy correction method enables this algorithm to be sample efficient; - The hierarchical structure with intermediate goals on state-space enables to better visualize the agent goals; - The paper Appendix elaborates on possible alternative off-policy corrections. |

Bayesian Skip-Autoencoders for Unsupervised Hyperintense Anomaly Detection in High Resolution Brain Mri

Christoph Baur and Benedikt Wiestler and Shadi Albarqouni and Nassir Navab

2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI) - 2020 via Local CrossRef

Keywords:

Christoph Baur and Benedikt Wiestler and Shadi Albarqouni and Nassir Navab

2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI) - 2020 via Local CrossRef

Keywords:

[link]
The reconstruction of high-fidelity resolution brain MR images is especially challenging because of the highly complex brain structure. Most promising approaches for this task are autoencoders and generative models such as Variational Autoencoders (VAE) or Generative Adversarial Networks (GAN). In Unsupervised Anomaly Detection (UAD), these architectures are only trained with images of healthy brain anatomy and not with images containing anomalies such as lesions. Therefore, processing an anomalous input image $\pmb{x}$ with an architecture trained to reconstruct healthy MR images should produce a high reconstruction error $r$ indicating that the input MR image is likely to exhibit anomalies. So far developed models for this task are either limited to low resolution reconstructions so that quite a lot of information is lost or to only small image regions. Baur et al. hypothesize that these restrictions are caused by the low dimensionality of the latent space of the models even though a high capacity would be necessary to reconstruct highly complex brain MR images. Thus, Baur et al. introduce skip connections in the autoencoder to enable detailed high resolution reconstructions due to the enhanced gradient flow. In addition, the application of dropout on the skip connections shall prevent the model from learning the identity of the input image since only the corresponding healthy anatomy of the possibly anomalous input image shall be reconstructed. Sampling the networks parameter $\pmb{\theta}$ from a Bernoulli distribution $\pmb{\theta} \sim \mathcal{B}(p)$ with $p$ being the dropout rate also turns the Skip-Autoencoder into a Bayesian neural network by using a Monte Carlo dropout at test time. The proposed Skip Autoencoder and the Bayesian Skip-Autoencoder are tested on Mutiple Sclerosis and Glioblastoma datasets and evaluated using the Precision-Recall-Curve (PRC), the AUPRC, and the Dice Score. An investigation of the Bernoulli dropout yields that no input identity is learnt independent on the dropout. A random weight initialization seems to be sufficient. Considering the skip-connections, the best performance is achieved by a single skip-connection close to the bottleneck layer. Applying more skip connections improves the performance on the Glioblastoma test set but reduces the performance on the Multiple Sclerosis test set. In general, the non-Bayesian Skip-Autoencoder exceeds the performance of the Bayesian Skip-Autoencoder because in the calculation of the reconstruction residual $r = \rvert \pmb{x} - \pmb{\hat{x}} \rvert$ for the Bayesian Skip-Autoencoder the predictive mean of $n$ Monte Carlo samples is applied as reconstructed image $\pmb{\hat{x}}$ which causes a blurriness in the reconstructions $\pmb{\hat{x}}$. However, the Bayesian Skip-Autoencoder enables to quantify the epistemic uncertainty because the pixels of hyperintense anomalies have a low variance compared to normal tissue. |

Recurrent World Models Facilitate Policy Evolution

David Ha and Jürgen Schmidhuber

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, stat.ML

**First published:** 2018/09/04 (2 years ago)

**Abstract:** A generative recurrent neural network is quickly trained in an unsupervised
manner to model popular reinforcement learning environments through compressed
spatio-temporal representations. The world model's extracted features are fed
into compact and simple policies trained by evolution, achieving state of the
art results in various environments. We also train our agent entirely inside of
an environment generated by its own internal world model, and transfer this
policy back into the actual environment. Interactive version of paper at
https://worldmodels.github.io
more
less

David Ha and Jürgen Schmidhuber

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, stat.ML

[link]
## General Framework The take-home message is that the challenge of Reinforcement Learning for environments with high-dimensional and partial observations is learning a good representation of the environment. This means learning a sensory features extractor V to deal with the highly dimensional observation (pixels for example). But also learning a temporal representation M of the environment dynamics to deal with the partial observability. If provided with such representations, learning a controller so as to maximize a reward is really easy (single linear layer evolved with CMA-ES). Authors call these representations a *World model* since they can use the learned environment's dynamics to simulate roll-outs. They show that policies trained inside the world model transfer well back to the real environment provided that measures are taken to prevent the policy from exploiting the world model's inaccuracies. ## Method **Learning the World Model** ![](https://i.imgur.com/tgV17k4.png =600x) In this work they propose to learn these representations off-line in an unsupervized manner in order to be more efficient. They use a VAE for V that they train exclusively with the reconstruction loss, that way the learned representations are independent of the reward and can be used alongside any reward. They then train M as Mixture-Density-Network-RNN to predict the next sensory features (as extracted by the VAE) --and possibly the done condition and the reward-- and thus learn the dynamics of the environment in the VAE's latent space (which is likely simpler there than in the pixel space). Note that the VAE's latent space is a single Gaussian (adding stochasticity makes it more robust to the "next state" outputs of M), whereas M outputs next states in a mixture of Gaussians. Indeed, an image is likely to have one visual encoding, yet it can have multiple and different future scenarii which are captured by the multimodal output of M. **Training the policy** ![](https://i.imgur.com/H5vpb2H.png) * In the real env: The agent is provided with the visual features and M's hidden state (temporal features). * In the world model: To avoid that the agent exploits this imperfect simulator they increase its dynamics' stochasticity by playing with $\tau$ the sampling temperature of $z_{t+1}$ in M. ## Limitations If exploration is important in the environment the initial random policy might fail to collect data in all the relevant part of the environment and an iterative version of Algorithm 1 might be required (see https://worldmodels.github.io/ for a discussion on the different iterative methods) for the data collection. By training V independently of M it might fail to encode all the information relevant to the task. Another option would be to train V and M concurrently so that the reward and $z_{t+1}$'s prediction loss (or next state reconstruction loss) of M flows through V (that would also be trained with its own reconstruction loss). The trade-off is that now V is tuned to a particular reward and cannot be reused. The authors argue that since $h_t$ is such that it can predict $z_{t+1}$, it contains enough insight about the future for the agent not needing to *plan ahead* and just doing reflexive actions based on $h_t$. This is interesting but the considered tasks (driving, dodging fireball) are still very reflexive and do not require much planning. ## Results When trained on the true env, a simple controller with the V and M representations achieve SOTA on car-racing. V + M is better than V alone. When trained inside the world model, its dynamics' stochasticity must be tuned in order for the policy to transfer well and perform well on the real env: too little stochasticity and the agent overfits to the world model flaws and does not transfer to the real env, too much and the agent becomes risk-averse and robust but suboptimal. ![](https://i.imgur.com/e8ETSjQ.png) ## Additional ressources Thorough interactive blog post with additional experiments and discussions: https://worldmodels.github.io/ |

Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstrations

Daniel S. Brown and Wonjoon Goo and Scott Niekum

arXiv e-Print archive - 2019 via Local arXiv

Keywords: cs.LG, stat.ML

**First published:** 2019/07/09 (1 year ago)

**Abstract:** The performance of imitation learning is typically upper-bounded by the
performance of the demonstrator. While recent empirical results demonstrate
that ranked demonstrations allow for better-than-demonstrator performance,
preferences over demonstrations may be difficult to obtain, and little is known
theoretically about when such methods can be expected to successfully
extrapolate beyond the performance of the demonstrator. To address these
issues, we first contribute a sufficient condition for better-than-demonstrator
imitation learning and provide theoretical results showing why preferences over
demonstrations can better reduce reward function ambiguity when performing
inverse reinforcement learning. Building on this theory, we introduce
Disturbance-based Reward Extrapolation (D-REX), a ranking-based imitation
learning method that injects noise into a policy learned through behavioral
cloning to automatically generate ranked demonstrations. These ranked
demonstrations are used to efficiently learn a reward function that can then be
optimized using reinforcement learning. We empirically validate our approach on
simulated robot and Atari imitation learning benchmarks and show that D-REX
outperforms standard imitation learning approaches and can significantly
surpass the performance of the demonstrator. D-REX is the first imitation
learning approach to achieve significant extrapolation beyond the
demonstrator's performance without additional side-information or supervision,
such as rewards or human preferences. By generating rankings automatically, we
show that preference-based inverse reinforcement learning can be applied in
traditional imitation learning settings where only unlabeled demonstrations are
available.
more
less

Daniel S. Brown and Wonjoon Goo and Scott Niekum

arXiv e-Print archive - 2019 via Local arXiv

Keywords: cs.LG, stat.ML

[link]
## General Framework Extends T-REX (see [summary](https://www.shortscience.org/paper?bibtexKey=journals/corr/1904.06387&a=muntermulehitch)) so that preferences (rankings) over demonstrations are generated automatically (back to the common IL/IRL setting where we only have access to a set of unlabeled demonstrations). Also derives some theoretical requirements and guarantees for better-than-demonstrator performance. ## Motivations * Preferences over demonstrations may be difficult to obtain in practice. * There is no theoretical understanding of the requirements that lead to outperforming demonstrator. ## Contributions * Theoretical results (with linear reward function) on when better-than-demonstrator performance is possible: 1- the demonstrator must be suboptimal (room for improvement, obviously), 2- the learned reward must be close enough to the reward that the demonstrator is suboptimally optimizing for (be able to accurately capture the intent of the demonstrator), 3- the learned policy (optimal wrt the learned reward) must be close enough to the optimal policy (wrt to the ground truth reward). Obviously if we have 2- and a good enough RL algorithm we should have 3-, so it might be interesting to see if one can derive a requirement from only 1- and 2- (and possibly a good enough RL algo). * Theoretical results (with linear reward function) showing that pairwise preferences over demonstrations reduce the error and ambiguity of the reward learning. They show that without rankings two policies might have equal performance under a learned reward (that makes expert's demonstrations optimal) but very different performance under the true reward (that makes the expert optimal everywhere). Indeed, the expert's demonstration may reveal very little information about the reward of (suboptimal or not) unseen regions which may hurt very much the generalizations (even with RL as it would try to generalize to new states under a totally wrong reward). They also show that pairwise preferences over trajectories effectively give half-space constraints on the feasible reward function domain and thus may decrease exponentially the reward function ambiguity. * Propose a practical way to generate as many ranked demos as desired. ## Additional Assumption Very mild, assumes that a Behavioral Cloning (BC) policy trained on the provided demonstrations is better than a uniform random policy. ## Disturbance-based Reward Extrapolation (D-REX) ![](https://i.imgur.com/9g6tOrF.png) ![](https://i.imgur.com/zSRlDcr.png) They also show that the more noise added to the BC policy the lower the performance of the generated trajs. ## Results Pretty much like T-REX. |

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations

Daniel S. Brown and Wonjoon Goo and Prabhat Nagarajan and Scott Niekum

arXiv e-Print archive - 2019 via Local arXiv

Keywords: cs.LG, stat.ML

**First published:** 2019/04/12 (1 year ago)

**Abstract:** A critical flaw of existing inverse reinforcement learning (IRL) methods is
their inability to significantly outperform the demonstrator. This is because
IRL typically seeks a reward function that makes the demonstrator appear
near-optimal, rather than inferring the underlying intentions of the
demonstrator that may have been poorly executed in practice. In this paper, we
introduce a novel reward-learning-from-observation algorithm, Trajectory-ranked
Reward EXtrapolation (T-REX), that extrapolates beyond a set of (approximately)
ranked demonstrations in order to infer high-quality reward functions from a
set of potentially poor demonstrations. When combined with deep reinforcement
learning, T-REX outperforms state-of-the-art imitation learning and IRL methods
on multiple Atari and MuJoCo benchmark tasks and achieves performance that is
often more than twice the performance of the best demonstration. We also
demonstrate that T-REX is robust to ranking noise and can accurately
extrapolate intention by simply watching a learner noisily improve at a task
over time.
more
less

Daniel S. Brown and Wonjoon Goo and Prabhat Nagarajan and Scott Niekum

arXiv e-Print archive - 2019 via Local arXiv

Keywords: cs.LG, stat.ML

[link]
## General Framework Only access to a finite set of **ranked demonstrations**. The demonstrations only contains **observations** and **do not need to be optimal** but must be (approximately) ranked from worst to best. The **reward learning part is off-line** but not the policy learning part (requires interactions with the environment). In a nutshell: learns a reward models that looks at observations. The reward model is trained to predict if a demonstration's ranking is greater than another one's. Then, once the reward model is learned, one simply uses RL to learn a policy. This latter outperform the demonstrations' performance. ## Motivations Current IRL methods cannot outperform the demonstrations because they seek a reward function that makes the demonstrator optimal and thus do not infer the underlying intentions of the demonstrator that may have been poorly executed. In practice, high quality demonstrations may be difficult to provide and it is often easier to provide demonstrations with a ranking of their relative performance (desirableness). ## Trajectory-ranked Reward EXtrapolation (T-REX) ![](https://i.imgur.com/cuL8ZFJ.png =400x) Uses ranked demonstrations to extrapolate a user's underlying intent beyond the best demonstrations by learning a reward that assigns greater return to higher-ranked trajectories. While standard IRL seeks a reward that **justifies** the demonstrations, T-REX tries learns a reward that **explains** the ranking over demonstrations. ![](https://i.imgur.com/4IQ13TC.png =500x) Having rankings over demonstrations may remove the reward ambiguity problem (always 0 reward cannot explain the ranking) as well as provide some "data-augmentation" since from a few ranked demonstrations you can define many pair-wise comparisons. Additionally, suboptimal demonstrations may provide more diverse data by exploring a larger area of the state space (but may miss the parts relevant to solving the task...) ## Tricks and Tips Authors used ground truth reward to rank trajectories, but they also show that approximate ranking does not hurt the performance much. To avoid overfitting they used an ensemble of 5 Neural Networks to predict the reward. For episodic tasks, they compare subtrajectories that correspond to similar timestep (better trajectory is a bit later in the episode than the one it is compared against so that reward increases as the episode progresses). At RL training time, the learned reward goes through a sigmoid to avoid large changes in the reward scale across time-steps. ## Results ![](https://i.imgur.com/7ysYZKd.png) ![](https://i.imgur.com/CHO9aVT.png) ![](https://i.imgur.com/OzVD9sf.png =600x) Results are quite positive and performance can be good even when the learned reward is not really correlated with the ground truth (cf. HalfCheetah). They also show that T-REX is robust to different ranking-noises: random-swapping of pair-wise ranking, ranking by humans that only have access to a description of the task and not the ground truth reward. **They also automatically rank the demonstrations using the number of learning steps of a learning expert: therefore T-REX could be used as an intrinsic reward alongside the ground-truth to accelerate training.** ![](https://i.imgur.com/IfOeLY6.png =500x) **Limitations** They do not show than T-REX can match an optimal expert, maybe ranking demonstrations hurt when all the demos are close to optimality? |

Planning for Autonomous Cars that Leverage Effects on Human Actions

Dorsa Sadigh and Shankar Sastry and Sanjit A. Seshia and Anca D. Dragan

Robotics: Science and Systems XII - 0 via Local CrossRef

Keywords:

Dorsa Sadigh and Shankar Sastry and Sanjit A. Seshia and Anca D. Dragan

Robotics: Science and Systems XII - 0 via Local CrossRef

Keywords:

[link]
## General Framework *wording: car = the autonomous car, driver = the other car it is interacting with* Builds a model of an **autonomous car's influence over the behavior of an interacting driver** (human or simulated) that the autonomous car can leverage to plan more efficiently. The driver is modeled by the policy that maximizes his defined objective. In brief, a **linear reward function is learned off-line with IRL on human demonstrations** and the modeled policy takes the actions that maximize the driver's return under this reward function (computed with **Model Predictive Control (MPC)**). They show that a car planning while using this driver learns ways to **influence the driver either towards specified behaviors** or in ways that **achieve higher payoff**. These results also hold when interacting with human **drivers which are loosely approximated by this driver model**. I believe the **key to this success lies in a learned reward that generalizes well** (linear function with few, clever hand-designed features) to new states as well as the **use of MPC**: by **focusing on modeling goals it captures a reasonable driver's policy** (since the learned reward is accurate) and does not have to deal with concurrent learning dynamics. On the other hand, IL would try to match behavior instead of goals and might not generalize as well to new (unseen) situations. Additionally, MPC enables to coordinate the car-driver interactions over **extended time horizons** (through planning). **This shows that leveraging an influence model is promising for communication emergence.** *Parallel with: Promoting influence like Jaques et al. is more effective than hoping agents figure it out by themselves like MADDPG* ## Motivations Previous methods use simplistic influence models ("will keep constant velocity"), yet the car behavior influences the driver whether the car is aware of it or not. This simplistic model only leads to "defensive behaviors" where the car avoids disturbing the driver and therefore does not interact with it and yields suboptimal strategies. Additionally, a simplistic interaction model seems to lead to exclusively to functional actions instead of communicative ones. ## Assumptions * **Sequential decision making**: the car acts first which forces a driver response which makes the world transition to a new state (could be alleviated by considering influence over time steps: car's action a time t influences driver's action at time t+1) * Approximates the **human as an optimal planner** (but with limited recursive induction) without uncertainty (deterministic model) etc. * The car uses **1st-order recursive induction** ("my action influences the driver but the driver believes that its action won't influence me (that my action is constant no matter what it does)"): *I see this as assuming "defensive" driver and "aggressive" car.* * **Fully observable** environment (not sure how difficult it would be to wave this) * Model Predictive Control (MPC) with L-BFGS requires access to the **differentiable transition function**. ## Method Model the interaction as a dynamical system where the car's action influences both the next state and the driver's action. Uses Theano to backprop through the transition function, and implicit function theorem to derive a symbolic expression of the gradient. ## Results * Car leverages driver's model to **influence** the simulated drivers (car has **perfect model**) in **specified ways** (modifying the car's reward function). * Car leverages driver's model to **influence** the simulated drivers (car has **perfect model**) in order to be **more efficient wrt its own objective**. * Car leverages driver's model to **influence** the human drivers (car has **imperfect model**) in **specified ways**. *colors: autonomous car is yellow, driver's car is red* ![](https://i.imgur.com/RK1Gx2P.png) ![](https://i.imgur.com/F77hSfp.png) ![](https://i.imgur.com/3elOG9O.png) ![](https://i.imgur.com/yeSsjiP.png) |

Reinforcement and Imitation Learning via Interactive No-Regret Learning

Stephane Ross and J. Andrew Bagnell

arXiv e-Print archive - 2014 via Local arXiv

Keywords: cs.LG, stat.ML

**First published:** 2014/06/23 (6 years ago)

**Abstract:** Recent work has demonstrated that problems-- particularly imitation learning
and structured prediction-- where a learner's predictions influence the
input-distribution it is tested on can be naturally addressed by an interactive
approach and analyzed using no-regret online learning. These approaches to
imitation learning, however, neither require nor benefit from information about
the cost of actions. We extend existing results in two directions: first, we
develop an interactive imitation learning approach that leverages cost
information; second, we extend the technique to address reinforcement learning.
The results provide theoretical support to the commonly observed successes of
online approximate policy iteration. Our approach suggests a broad new family
of algorithms and provides a unifying view of existing techniques for imitation
and reinforcement learning.
more
less

Stephane Ross and J. Andrew Bagnell

arXiv e-Print archive - 2014 via Local arXiv

Keywords: cs.LG, stat.ML

[link]
## General Framework Really **similar to DAgger** (see [summary](https://www.shortscience.org/paper?bibtexKey=journals/corr/1011.0686&a=muntermulehitch)) but considers **cost-sensitive classification** ("some mistakes are worst than others": you should be more careful in imitating that particular action of the expert if failing in doing so incurs a large cost-to-go). By doing so they improve from DAgger's bound of $\epsilon_{class}uT$ where $u$ is the difference in cost-to-go (between the expert and one error followed by expert policy) to $\epsilon_{class}T$ where $\epsilon_{class}$ is the error due to the lack of expressiveness of the policy class. In brief, by accounting for the effect of a mistake on the cost-to-go they remove the cost-to-go contribution to the bound (difference in the performance of the learned policy vs. expert policy) and thus have a tighter bound. In the paper they use the word "regret" for two distinct concepts which is confusing to me: one for the no-regret online learning meta-approach to IL (similar to DAgger) and another one because Aggrevate aims at minimizing the cost-to-go difference with the expert (cost-to-go difference: the sum of the cost I endured because I did not behave like the expert once = *regret*) compared to DAgger that aims at minimizing the error rate wrt. the expert. Additionally, the paper extends the view of Imitation learning as an online learning procedure to Reinforcement Learning. ## Assumptions **Interactive**: you can re-query the expert and thus reach $\epsilon T$ bounds instead of $\epsilon T^2$ like with non-interactive methods (Behavioral Cloning) due to compounding error. Additionally, one also needs a **reward/cost** that **cannot** be defined relative to the expert (no 0-1 loss wrt expert for ex.) since the cost-to-go is computed under the expert policy (would always yield 0 cost). ## Other methods **SEARN**: does also reason about **cost-to-go but under the current policy** instead of the expert's (even if you can use the expert's in practice and thus becomes really similar to Aggrevate). SEARN uses **stochastic policies** and can be seen as an Aggrevate variant where stochastic mixing is used to solve the online learning problem instead of **Regularized-Follow-The-Leader (R-FTL)**. ## Aggrevate - IL with cost-to-go ![](https://i.imgur.com/I1otJwV.png) Pretty much like DAgger but one has to use a no-regret online learning algo to do **cost-sensitive** instead of regular classification. In the paper, they use the R-FTL algorithm and train the policy on all previous iterations. Indeed, using R-FTL with strongly convex loss (like the squared error) with stable batch leaner (like stochastic gradient descent) ensures the no-regret property. In practice (to deal with infinite policy classes and knowing the cost of only a few actions per state) they reduce cost-sensitive classification to an **argmax regression problem** where they train a model to match the cost given state-action (and time if we want nonstationary policies) using the collected datapoints and the (strongly convex) squared error loss. Then, they argmin this model to know which action minimizes the cost-to-go (cost-sensitive classification). This is close to what we do for **Q-learning** (DQN or DDPG): fit a critic (Q-values) with the TD-error (instead of full rollouts cost-to-go of expert), argmax your critic to get your policy. Similarly to DQN, the way you explore the actions of which you compute the cost-to-go is important (in this paper they do uniform exploration). **Limitations** If the policy class is not expressive enough and cannot match the expert policy performance this algo may fail to learn a reasonable policy. Example: the task is to go for point A to point B, there exist a narrow shortcut and a safer but longer road. The expert can handle both roads so it prefers taking the shortcut. Even if the learned policy class can handle the safer road it will keep trying to use the narrow one and fail to reach the goal. This is because all the costs-to-go are computed under the expert's policy, thus ignoring the fact that they cannot be achieved by any policy of the learned policy class. ## RL via No-Regrety Policy Iteration -- NRPI ![](https://i.imgur.com/X4ckv1u.png) NRPI does not require an expert policy anymore but only a **state exploration distribution**. NRPI can also be preferred when no policy in the policy class can match the expert's since it allows for more exploration by considering the **cost-to-go of the current policy**. Here, the argmax regression equivalent problem is really similar to Q-learning (where we use sampled cost-to-go from rollouts instead of Bellman errors) but where **the cost-to-go** of the aggregate dataset corresponds to **outdated policies!** (in contrast, DQN's data is comprised of rewards instead of costs-to-go). Yet, since R-FTL is a no-regret online learning method, the learned policy performs well under all the costs-to-go of previous iterations and the policies as well as the costs-to-go converge. The performance of NRPI is strongly limited to the quality of the exploration distribution. Yet if the exploration distribution is optimal, then NRPI is also optimal (the bound $T\epsilon_{regret} \rightarrow 0$ with enough online iterations). This may be a promising method for not interactive, state-only IL (if you have access to a reward). ## General limitations Both methods are much less sample efficient than DAgger as they require costs-to-go: one full rollout for one data-point. ## Broad contribution Seeing iterative learning methods such as Q-learning in the light of online learning methods is insightful and yields better bounds and understanding of why some methods might work. It presents a good tool to analyze the dynamics that interleaves learning and execution (optimizing and collecting data) for the purpose of generalization. For example, the bound for NRPI can seem quite counter-intuitive to someone familiar with on-policy/off-policy distinction, indeed NRPI optimizes a policy wrt to **costs-to-go of other policies**, yet R-FTL tells us that it converges towards what we want. Additionally, it may give a practical advantage for stability as the policy is optimized with larger batches and thus as to be good across many states and many cost-to-go formulations. |

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

Stephane Ross and Geoffrey J. Gordon and J. Andrew Bagnell

arXiv e-Print archive - 2010 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

**First published:** 2010/11/02 (9 years ago)

**Abstract:** Sequential prediction problems such as imitation learning, where future
observations depend on previous predictions (actions), violate the common
i.i.d. assumptions made in statistical learning. This leads to poor performance
in theory and often in practice. Some recent approaches provide stronger
guarantees in this setting, but remain somewhat unsatisfactory as they train
either non-stationary or stochastic policies and require a large number of
iterations. In this paper, we propose a new iterative algorithm, which trains a
stationary deterministic policy, that can be seen as a no regret algorithm in
an online learning setting. We show that any such no regret algorithm, combined
with additional reduction assumptions, must find a policy with good performance
under the distribution of observations it induces in such sequential settings.
We demonstrate that this new approach outperforms previous approaches on two
challenging imitation learning problems and a benchmark sequence labeling
problem.
more
less

Stephane Ross and Geoffrey J. Gordon and J. Andrew Bagnell

arXiv e-Print archive - 2010 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

[link]
## General Framework The imitation learning problem is here cast into a classification problem: label the state with the corresponding expert action. With this, you can see structured prediction (predict next label knowing your previous prediction) as a degenerated IL problem. They make the **reduction assumption** that you can make the probability of mistake $\epsilon$ as small as desired on the **training distribution** (expert or mixture). They also assume that the difference in the cost-to-go (Q-value) between expert-action followed by expert policy and any action followed by expert policy is small (which does not hold if one mistake makes you for fall from a cliff for example!). ## Problem definition There are three problems that are highlighted in this paper **1-interaction between policy and distribution**, **2-violation of the i.i.d assumption**, and **3-distributional shift and resulting compounding error of BC**. The paper implies that **1** implies **2** that implies **3** but they eventually only address **3** with the algorithm they propose (DAgger). 1. **interaction between policy and distribution**: the fact that the learned policy is present in the expectation (through the state-action distribution) as well as in the loss makes the **objective non-convex** and this even for convex loss functions. **Yet this also implies non i.i.d. samples!** Indeed, even if one could directly sample from the state-action distribution (like having its analytical form or an infinite experience replay buffer) and thus draw i.i.d. samples, the dependency will occur across optimization steps: if I draw a sample and use it to update my policy, I also update the distribution from which I will draw my next sample and then my next sample depends on my previous sample (since it conditioned my policy update). 2. **violation of the i.i.d. assumption**: So we just discussed that 1. implies non-iid samples across updates (batches) yet there is another form of dependency (inside a batch this time) as in practice we collect samples using rollouts and in a trajectory s' depends on (s,a) so we have a dependency because we collected the **samples sequentially**. 3. **distribution-shift and compounding error of Behavior cloning (supervised learning)**: During training, you assume iid data and you do not model that the next state you will have to classify (act upon) is influenced by your previous action. But at execution samples are not iid: if you do a mistake you reach a state that you may have never seen (distribution shift) and you cannot recover(compounding error). If the classifier's probability of making a mistake is $\epsilon$ and the episode length is $T$, then BC can make as many as $\epsilon T^2$ mistakes (CF Levine's class example with the funambulist and the worst-case scenario: as soon as you make a mistake, you cannot recover and you make mistakes until the end of the episode). **This is what this paper (and the papers it compares to) addresses, they propose a worst-case scenario linear in both** $\epsilon$ and $T$. **Additionnal assumptions compared to BC:**During training you are allowed to query the expert as much as you want. ## Other methods * **Forward Training**: trains a **non-stationary policy** (one policy for each time step). Each policy is trained to match the expert on the distribution of states encountered at that time-step when doing rollouts with the previously learned policies (there is no distribution shift since each policy is trained on the actual distribution of states it will see). **Problem**: you have to train $T$ different policies, you cannot stop until you trained all the policies. * **Stochastic Mixing Iterative Learning (SMILe)**: iteratively trains a **stochastic mixture of policies**. Each new policy is trained to match the expert on the state distribution of the current mixture, therefore iteratively accounting for the distribution drift, eventually, the distribution should converge. This yields near-linear regret on $T$ and $\epsilon$. **Problem**: At execution time, we sample a policy from the mixture, which can perform poorly on the given state, especially if it corresponded to a different distribution (early iteration for example). ## DAgger -- Dataset Aggregation **Algo** ![](https://i.imgur.com/FpQAyI2.png =400x) $\beta_i$ is related to the proportion of state distribution that we want to collect similar to the expert distribution. This may help in the early iterations, where the policy is bad and may spend a lot of time collecting datapoints in states that are irrelevant. In practice using $\beta_i=I(i=0)$ works well. Intuitively, by training on generated transitions, DAgger reduces the distribution shift. **Link to no Regret Online Learning** DAgger can be seen as a *Follow-the-Leader* (FTL) algorithm where at iteration $n$ we pick the best policy in hindsight, i.e. with respect to all the transitions seen so far (in previous iterations). The use of $\beta_i$ is an "extra trick" that doesn't exist as is in FTL algos. *Regret*: the difference between what I did and the best thing that I could have done. Each iteration of DAgger can be seen as a step of Online Learning (treat mini-batches of trajectories under a single policy as a single online-learning example) where the online algo used is FTL but could actually be any no-regret algo. Indeed, by using the no-regret algo view, the authors show that using no-regret algos (regret decays with $1/N$) as a meta-algorithm for IL (just like FTL for DAgger) yield that, for N big enough (in the order of T), we have a worst-case performance linear in $\epsilon$, $u$ and $T$. Where it is **assumed** that we can reach an error probability of $\epsilon$ on any training distribution and a difference in the cost-to-go (-Q-value) of $u$. In other words, if we can guarantee a classification error rate of $\epsilon$ **on the training distribution** and that the expert is able to recover from a mistake ($u$ is bounded) then the use of a **no-regret online learning algorithm** will yield a policy that will make at most (proportional to) $\epsilon T u$ errors **on its own state distribution (test distribution)**. **Requirements** * An error rate of $\epsilon$ on the training distribution * A strongly convex loss function between expert and learner labels (0-1 loss is not but binary cross-entropy that is usually used for classification is) so that FTL is indeed a no-regret algo. If the loss in not strongly convex, FTL has no guarantees and we must use another no-regret method * The difference in the cost-to-go (Q-value) between expert-action followed by expert policy and any action followed by the expert policy must be small (expert must be able to recover: this does not account for falling from a cliff for example!). **Results** ![](https://i.imgur.com/DZTK2nu.png =500x) ![](https://i.imgur.com/rbtyUbY.png =500x) ![](https://i.imgur.com/ZViGENA.png =500x) ![](https://i.imgur.com/fExmxeb.png =500x) ![](https://i.imgur.com/kv1X01w.png =500x) |

Understanding deep learning requires rethinking generalization

Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

**First published:** 2016/11/10 (3 years ago)

**Abstract:** Despite their massive size, successful deep artificial neural networks can
exhibit a remarkably small difference between training and test performance.
Conventional wisdom attributes small generalization error either to properties
of the model family, or to the regularization techniques used during training.
Through extensive systematic experiments, we show how these traditional
approaches fail to explain why large neural networks generalize well in
practice. Specifically, our experiments establish that state-of-the-art
convolutional networks for image classification trained with stochastic
gradient methods easily fit a random labeling of the training data. This
phenomenon is qualitatively unaffected by explicit regularization, and occurs
even if we replace the true images by completely unstructured random noise. We
corroborate these experimental findings with a theoretical construction showing
that simple depth two neural networks already have perfect finite sample
expressivity as soon as the number of parameters exceeds the number of data
points as it usually does in practice.
We interpret our experimental findings by comparison with traditional models.
more
less

Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

[link]
## Summary The broad goal of this paper is to understand how a neural network learns the underlying distribution of the input data and the properties of the network that describes its generalization power. Previous literature tries to use statistical measures like Rademacher complexity, uniform stability and VC dimension to explain the generalization error of the model. These methods explain generalization in terms of the number of parameters in the model along with the applied regularization. The experiments performed in the [Section 2] of the paper show that the learning capacity of a CNN cannot be sufficiently explained by traditional statistical learning theory. Even the effect of different regularization strategies in CNN is shown to be potentially unrelated to the generalization error, which contradicts the theory behind VC dimension. The experiments of the paper show that the model is able to learn some underlying patterns for random labels and input with different amounts of gaussian noise. When the authors gradually increase the noise in the inputs the generalization error gradually increases while the training error is still able to reach zero. The authors have concluded that big networks are able to completely memorise the complete dataset. ## Personal Thoughts 1) Firstly we need a new theory to explain why and how CNN memorizes the inputs and generalizes itself to new data. Since the paper shows that regularization doesn't have too much effect on the generalization for big networks, maybe the network is actually memorizing the whole input space. But the memorization is very strategic in the sense that only the inputs (eg. noise) where no underlying simple features are found, are completely memorized unlike inputs with a stronger signal where patterns can be found. This may explain the discrepancy in number of training steps between ‘true labels’ and noisy inputs in [Figure 1 a.]. My very general understanding of Information Bottleneck Hypothesis [4] is that networks compresses noisy input data as much as possible while preserving important information. For a network more time is taken to compress noise compared to strong signals in images. This may give some intuision behind the learning process taking place. 2) CNN is highly non-linear with millions of parameters and has a very complex loss landscape. There might be multiple minima and we need a theory to explain which of these minima gives the highest generalization. Unfortunately the working of SGD is still a black box and is very difficult to characterize. There are many interesting phenomena like adversarial attacks, effect of optimizer used on the weights found (Daniel Jiwoong et al., 2016) and the actual understanding of non-linearity in CNN (Ian J. Goodfellow et al., 2015) that all point to lapses in our overall understanding of very high dimensional manifolds. This requires rigorous experimentation to study and understand the effect of the network architecture, optimizer and the actual input (Nitish Shirish et al.,2017) to the network independently on generalization. ## References 1. Im, Daniel Jiwoong et al. “An empirical analysis of the optimization of deep network loss surfaces.” arXiv: Learning (2016): n. pag. 2. Goodfellow, Ian J. and Oriol Vinyals. “Qualitatively characterizing neural network optimization problems.” CoRR abs/1412.6544 (2015): n. pag. 3. Keskar, Nitish Shirish et al. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.” ArXiv abs/1609.04836 (2017): n. pag. 4. https://www.youtube.com/watch?v=XL07WEc2TRI |

About