First published: 2018/06/19 (1 year ago) Abstract: We introduce a new family of deep neural network models. Instead of
specifying a discrete sequence of hidden layers, we parameterize the derivative
of the hidden state using a neural network. The output of the network is
computed using a black-box differential equation solver. These continuous-depth
models have constant memory cost, adapt their evaluation strategy to each
input, and can explicitly trade numerical precision for speed. We demonstrate
these properties in continuous-depth residual networks and continuous-time
latent variable models. We also construct continuous normalizing flows, a
generative model that can train by maximum likelihood, without partitioning or
ordering the data dimensions. For training, we show how to scalably
backpropagate through any ODE solver, without access to its internal
operations. This allows end-to-end training of ODEs within larger models.
Summary by senior author [duvenaud on hackernews](https://news.ycombinator.com/item?id=18678078).
A few years ago, everyone switched their deep nets to "residual nets". Instead of building deep models like this:
h1 = f1(x)
h2 = f2(h1)
h3 = f3(h2)
h4 = f3(h3)
y = f5(h4)
They now build them like this:
h1 = f1(x) + x
h2 = f2(h1) + h1
h3 = f3(h2) + h2
h4 = f4(h3) + h3
y = f5(h4) + h4
Where f1, f2, etc are neural net layers. The idea is that it's easier to model a small change to an almost-correct answer than to output the whole improved answer at once.
In the last couple of years a few different groups noticed that this looks like a primitive ODE solver (Euler's method) that solves the trajectory of a system by just taking small steps in the direction of the system dynamics and adding them up. They used this connection to propose things like better training methods.
We just took this idea to its logical extreme: What if we _define_ a deep net as a continuously evolving system? So instead of updating the hidden units layer by layer, we define their derivative with respect to depth instead. We call this an ODE net.
Now, we can use off-the-shelf adaptive ODE solvers to compute the final state of these dynamics, and call that the output of the neural network. This has drawbacks (it's slower to train) but lots of advantages too: We can loosen the numerical tolerance of the solver to make our nets faster at test time. We can also handle continuous-time models a lot more naturally. It turns out that there is also a simpler version of the change of variables formula (for density modeling) when you move to continuous time.
First published: 2018/11/12 (9 months ago) Abstract: Planning has been very successful for control tasks with known environment
dynamics. To leverage planning in unknown environments, the agent needs to
learn the dynamics from interactions with the world. However, learning dynamics
models that are accurate enough for planning has been a long-standing
challenge, especially in image-based domains. We propose the Deep Planning
Network (PlaNet), a purely model-based agent that learns the environment
dynamics from pixels and chooses actions through online planning in latent
space. To achieve high performance, the dynamics model must accurately predict
the rewards ahead for multiple time steps. We approach this problem using a
latent dynamics model with both deterministic and stochastic transition
function and a generalized variational inference objective that we name latent
overshooting. Using only pixel observations, our agent solves continuous
control tasks with contact dynamics, partial observability, and sparse rewards.
PlaNet uses significantly fewer episodes and reaches final performance close to
and sometimes higher than top model-free algorithms.
**Summary**: This paper presents three tricks that make model-based reinforcement more reliable when tested in tasks that require walking and balancing. The tricks are 1) are planning based on features, 2) using a recursive network that mixes probabilistic and deterministic information, and 3) looking forward multiple steps.
Imagine playing pool, armed with a tablet that can predict exactly where the ball will bounce, and the next bounce, and so on. That would be a huge advantage to someone learning pool, however small inaccuracies in the model could mislead you especially when thinking ahead to the 2nd and third bounce.
The tablet is analogous to the dynamics model in model-based reinforcement learning (RL). Model based RL promises to solve a lot of the open problems with RL, letting the agent learn with less experience, transfer well, dream, and many others advantages. Despite the promise, dynamics models are hard to get working: they often suffer from even small inaccuracies, and need to be redesigned for specific tasks.
Enter PlaNet, a clever name, and a net that plans well in range of environments. To increase the challenge the model must predict directly from pixels in fairly difficult tasks such as teaching a cheetah to run or balancing a ball in a cup.
How do they do this? Three main tricks.
- Planning in latest space: this means that the policy network doesn't need to look at the raw image, but looks at a summary of it as represented by a feature vector.
- Recurrent state space models: They found that probabilistic information helps describe the space of possibilities but makes it harder for their RNN based model to look back multiple steps. However mixing probabilistic information and deterministic information gives it the best of both worlds, and they have results that show a starting performance increase when both compared to just one.
- Latent overshooting: They train the model to look more than one step ahead, this helps prevent errors that build up over time
Overall this paper shows great results that tackle the shortfalls of model based RL. I hope the results remain when tested on different and more complex environments.
First published: 2018/10/29 (9 months ago) Abstract: Lack of performance when it comes to continual learning over non-stationary
distributions of data remains a major challenge in scaling neural network
learning to more human realistic settings. In this work we propose a new
conceptualization of the continual learning problem in terms of a trade-off
between transfer and interference. We then propose a new algorithm,
Meta-Experience Replay (MER), that directly exploits this view by combining
experience replay with optimization based meta-learning. This method learns
parameters that make interference based on future gradients less likely and
transfer based on future gradients more likely. We conduct experiments across
continual lifelong supervised learning benchmarks and non-stationary
reinforcement learning environments demonstrating that our approach
consistently outperforms recently proposed baselines for continual learning.
Our experiments show that the gap between the performance of MER and baseline
algorithms grows both as the environment gets more non-stationary and as the
fraction of the total experiences stored gets smaller.
Catastrophic forgetting is the tendency of an neural network to forget previously learned information when learning new information. This paper combats that by keeping a buffer of experience and applying meta-learning to it. They call their new module Meta Experience Replay or MER.
How does this work? At each update they compute multiple possible updates to the model weights. One for the new batch of information and some more updates for batches of previous experience. Then they apply meta-learning using the REPTILE algorithm, here the meta-model sees each possible update and has to predict the output which combines them with the least interference. This is done by predicting an update vector that maximizes the dot product between the new and old update vectors, that way it transfers as much learning as possible from the new update without interfering with the old updates. https://i.imgur.com/TG4mZOn.png
Does it work? Yes, while it may take longer to train, the results show that it generalizes better and needs a much smaller buffer of experience than the popular approach of using replay buffers.
First published: 2018/11/14 (9 months ago) Abstract: While current benchmark reinforcement learning (RL) tasks have been useful to
drive progress in the field, they are in many ways poor substitutes for
learning with real-world data. By testing increasingly complex RL algorithms on
low-complexity simulation environments, we often end up with brittle RL
policies that generalize poorly beyond the very specific domain. To combat
this, we propose three new families of benchmark RL domains that contain some
of the complexity of the natural world, while still supporting fast and
extensive data acquisition. The proposed domains also permit a characterization
of generalization through fair train/test separation, and easy comparison and
replication of results. Through this work, we challenge the RL research
community to develop more robust algorithms that meet high standards of
This paper proposed three new reinforcement learning tasks which involved dealing with images.
- Task 1: An agent crawls across a hidden image, revealing portions of it at each step. It must classify the image in the minimum amount of steps. For example classify the image as a cat after choosing to travel across the ears.
- Task 2: The agent crawls across a visible image to sit on it's target. For example a cat in a scene of pets.
- Task 3: The agent plays an Atari game where the background has been replaced with a distracting video.
These tasks are easy to construct, but solving them requires large scale visual processing or attention, which typically require deep networks. To address these new tasks, popular RL agents (PPO, A2C, and ACKTR) were augmented with a deep image processing network (ResNet-18), but they still performed poorly.
First published: 2018/11/15 (9 months ago) Abstract: To solve complex real-world problems with reinforcement learning, we cannot
rely on manually specified reward functions. Instead, we can have humans
communicate an objective to the agent directly. In this work, we combine two
approaches to learning from human feedback: expert demonstrations and
trajectory preferences. We train a deep neural network to model the reward
function and use its predicted reward to train an DQN-based deep reinforcement
learning agent on 9 Atari games. Our approach beats the imitation learning
baseline in 7 games and achieves strictly superhuman performance on 2 games
without using game rewards. Additionally, we investigate the goodness of fit of
the reward model, present some reward hacking problems, and study the effects
of noise in the human labels.
How can humans help an agent perform at a task that has no clear reward? Imitation, demonstration, and preferences. This paper asks which combinations of imitation, demonstration, and preferences will best guide an agent in Atari games.
For example an agent that is playing Pong on the Atari, but can't access the score. You might help it by demonstrating your play style for a few hours. To help the agent further you are shown two short clips of it playing and you are asked to indicate which one, if any, you prefer.
To avoid spending many hours rating videos the authors sometimes used an automated approach where the game's score decides which clip is preferred, but they also compared this approach to human preferences. It turns out that human preferences are often worse because of reward traps. These happen, for example, when the human tries to encourage the agent to explore ladders, resulting in the agent obsessing about ladders instead of continuing the game.
They also observed that the agent often misunderstood the preferences it was given, causing unexpected behavior called reward hacking. The only solution they mention was to have someone keep an eye on it and continue giving it preferences, but this isn't always feasible. This is the alignment problem which is a hard problem in AGI research.
Results: adding merely a few thousand preferences can help in most games, unless they have sparse rewards. Demonstrations, on the other hand, tend to help those games with sparse rewards but only if the demonstrator is good at the game.
First published: 2018/10/15 (10 months ago) Abstract: Humans spend a remarkable fraction of waking life engaged in acts of "mental
time travel". We dwell on our actions in the past and experience satisfaction
or regret. More than merely autobiographical storytelling, we use these event
recollections to change how we will act in similar scenarios in the future.
This process endows us with a computationally important ability to link actions
and consequences across long spans of time, which figures prominently in
addressing the problem of long-term temporal credit assignment; in artificial
intelligence (AI) this is the question of how to evaluate the utility of the
actions within a long-duration behavioral sequence leading to success or
failure in a task. Existing approaches to shorter-term credit assignment in AI
cannot solve tasks with long delays between actions and consequences. Here, we
introduce a new paradigm for reinforcement learning where agents use recall of
specific memories to credit actions from the past, allowing them to solve
problems that are intractable for existing algorithms. This paradigm broadens
the scope of problems that can be investigated in AI and offers a mechanistic
account of behaviors that may inspire computational models in neuroscience,
psychology, and behavioral economics.
This builds on the previous ["MERLIN"](https://arxiv.org/abs/1803.10760) paper. First they introduce the RMA agent, which is a simplified version of MERLIN which uses model based RL and long term memory. They give the agent long term memory by letting it choose to save and load the agent's working memory (represented by the LSTM's hidden state).
Then they add credit assignment, similar to the RUDDER paper, to get the "Temporal Value Transport" (TVT) agent that can plan long term in the face of distractions. **The critical insight here is that they use the agent's memory access to decide on credit assignment**. So if the model uses a memory from 512 steps ago, that action from 512 steps ago gets lots of credit for the current reward.
They use various tasks, for example a maze with a distracting task then a memory retrieval task. For example, after starting in a maze with, say, a yellow wall, the agent needs to collect apples. This serves as a distraction, ensuring the agent can recall memories even after distraction. At the end of the maze it needs to remember that initial color (e.g. yellow) in order to choose the exit of the correct color.
They include performance graphs showing that memory or even better memory plus credit assignment are a significant help in this, and similar, tasks.
First published: 2018/06/20 (1 year ago) Abstract: We propose a novel reinforcement learning approach for finite Markov decision
processes (MDPs) with delayed rewards. In this work, biases of temporal
difference (TD) estimates are proved to be corrected only exponentially slowly
in the number of delay steps. Furthermore, variances of Monte Carlo (MC)
estimates are proved to increase the variance of other estimates, the number of
which can exponentially grow in the number of delay steps. We introduce RUDDER,
a return decomposition method, which creates a new MDP with same optimal
policies as the original MDP but with redistributed rewards that have largely
reduced delays. If the return decomposition is optimal, then the new MDP does
not have delayed rewards and TD estimates are unbiased. In this case, the
rewards track Q-values so that the future expected reward is always zero. We
experimentally confirm our theoretical results on bias and variance of TD and
MC estimates. On artificial tasks with different lengths of reward delays, we
show that RUDDER is exponentially faster than TD, MC, and MC Tree Search
(MCTS). RUDDER outperforms rainbow, A3C, DDQN, Distributional DQN, Dueling
DDQN, Noisy DQN, and Prioritized DDQN on the delayed reward Atari game Venture
in only a fraction of the learning time. RUDDER considerably improves the
state-of-the-art on the delayed reward Atari game Bowling in much less learning
time. Source code is available at https://github.com/ml-jku/baselines-rudder,
with demonstration videos at https://goo.gl/EQerZV.
[Summary by author /u/SirJAM_armedi](https://www.reddit.com/r/MachineLearning/comments/8sq0jy/rudder_reinforcement_learning_algorithm_that_is/e11swv8/).
Math aside, the "big idea" of RUDDER is the following: We use an LSTM to predict the return of an episode. To do this, the LSTM will have to recognize what actually causes the reward (e.g. "shooting the gun in the right direction causes the reward, even if we get the reward only once the bullet hits the enemy after travelling along the screen"). We then use a salience method (e.g. LRP or integrated gradients) to get that information out of the LSTM, and redistribute the reward accordingly (i.e., we then give reward already once the gun is shot in the right direction). Once the reward is redistributed this way, solving/learning the actual Reinforcement Learning problem is much, much easier and as we prove in the paper, the optimal policy does not change with this redistribution.
**TL;DR:** There are 'place cells' in the hippopotamus that are fired when passing through a location. You can take a rat and measure how its cells are activated in a maze, then monitor neurons during planning, rest or sleep. You'll see patterns that show it's thinking of locations in order and focusing on interesting locations. This paper looks at how RL agents do 'prioritized experience replay' and compare it to place cells in animals. The authors do a RL simulation and *qualitatively* compare the results to the activity observed in place cells.
> Neural activity recorded from hippocampal place cells during spatial navigation typically represents the animal’s spatial position, though it can sometimes represent locations ahead of the animal. For instance, during “sharp wave ripple” events, activity might progress sequentially from the animal’s current location towards a goal location. These “forward replay” ´sequences predict subsequent behavior and have been suggested to support a planning mechanism that links actions to their deferred consequences along a spatial trajectory. However, analogously to the human evidence, remote activity in the hippocampus can also represent locations behind the animal, and even altogether disjoint, ´remote locations (especially during rest or sleep) (Fig. 1a).
> we develop a normative theory to predict not just whether but which memories should be accessed at each time
to enable the most rewarding future decisions.
> To test the implications of our theory, we simulate a spatial navigation task where an agent generates and stores experiences which can be later retrieved. We show that an agent that accesses memories sequentially and in order of utility
produces patterns of sequential state consideration that resemble place cell replay, and reproduces qualitatively and with
no parameter fitting a wealth of empirical findings including (i) the existence and balance between forward and reverse replay; (ii) the content of replay; and (iii) effects of experience.
> we propose the unifying view that all patterns of replay during behavior, rest, and sleep reflect different instances of a more general state retrieval operation that integrates experiences across space and time to propagate value and guide decisions.
**My 2 cents**: I like this paper because prioritized experience replay reminds me of how we often dream or daydream of novel good or bad events that happened or that we anticipate. This paper drills much deeper into this connection.