First published: 2018/10/15 (1 month ago) Abstract: Humans spend a remarkable fraction of waking life engaged in acts of "mental
time travel". We dwell on our actions in the past and experience satisfaction
or regret. More than merely autobiographical storytelling, we use these event
recollections to change how we will act in similar scenarios in the future.
This process endows us with a computationally important ability to link actions
and consequences across long spans of time, which figures prominently in
addressing the problem of long-term temporal credit assignment; in artificial
intelligence (AI) this is the question of how to evaluate the utility of the
actions within a long-duration behavioral sequence leading to success or
failure in a task. Existing approaches to shorter-term credit assignment in AI
cannot solve tasks with long delays between actions and consequences. Here, we
introduce a new paradigm for reinforcement learning where agents use recall of
specific memories to credit actions from the past, allowing them to solve
problems that are intractable for existing algorithms. This paradigm broadens
the scope of problems that can be investigated in AI and offers a mechanistic
account of behaviors that may inspire computational models in neuroscience,
psychology, and behavioral economics.
This builds on the previous ["MERLIN"](https://arxiv.org/abs/1803.10760) paper. First they introduce the RMA agent, which is a simplified version of MERLIN which uses model based RL and long term memory. They give the agent long term memory by letting it choose to save and load the agent's working memory (represented by the LSTM's hidden state).
Then they add credit assignment, similar to the RUDDER paper, to get the "Temporal Value Transport" (TVT) agent that can plan long term in the face of distractions. **The critical insight here is that they use the agent's memory access to decide on credit assignment**. So if the model uses a memory from 512 steps ago, that action from 512 steps ago gets lots of credit for the current reward.
They use various tasks, for example a maze with a distracting task then a memory retrieval task. For example, after starting in a maze with, say, a yellow wall, the agent needs to collect apples. This serves as a distraction, ensuring the agent can recall memories even after distraction. At the end of the maze it needs to remember that initial color (e.g. yellow) in order to choose the exit of the correct color.
They include performance graphs showing that memory or even better memory plus credit assignment are a significant help in this, and similar, tasks.
First published: 2018/06/20 (4 months ago) Abstract: We propose a novel reinforcement learning approach for finite Markov decision
processes (MDPs) with delayed rewards. In this work, biases of temporal
difference (TD) estimates are proved to be corrected only exponentially slowly
in the number of delay steps. Furthermore, variances of Monte Carlo (MC)
estimates are proved to increase the variance of other estimates, the number of
which can exponentially grow in the number of delay steps. We introduce RUDDER,
a return decomposition method, which creates a new MDP with same optimal
policies as the original MDP but with redistributed rewards that have largely
reduced delays. If the return decomposition is optimal, then the new MDP does
not have delayed rewards and TD estimates are unbiased. In this case, the
rewards track Q-values so that the future expected reward is always zero. We
experimentally confirm our theoretical results on bias and variance of TD and
MC estimates. On artificial tasks with different lengths of reward delays, we
show that RUDDER is exponentially faster than TD, MC, and MC Tree Search
(MCTS). RUDDER outperforms rainbow, A3C, DDQN, Distributional DQN, Dueling
DDQN, Noisy DQN, and Prioritized DDQN on the delayed reward Atari game Venture
in only a fraction of the learning time. RUDDER considerably improves the
state-of-the-art on the delayed reward Atari game Bowling in much less learning
time. Source code is available at https://github.com/ml-jku/baselines-rudder,
with demonstration videos at https://goo.gl/EQerZV.
[Summary by author /u/SirJAM_armedi](https://www.reddit.com/r/MachineLearning/comments/8sq0jy/rudder_reinforcement_learning_algorithm_that_is/e11swv8/).
Math aside, the "big idea" of RUDDER is the following: We use an LSTM to predict the return of an episode. To do this, the LSTM will have to recognize what actually causes the reward (e.g. "shooting the gun in the right direction causes the reward, even if we get the reward only once the bullet hits the enemy after travelling along the screen"). We then use a salience method (e.g. LRP or integrated gradients) to get that information out of the LSTM, and redistribute the reward accordingly (i.e., we then give reward already once the gun is shot in the right direction). Once the reward is redistributed this way, solving/learning the actual Reinforcement Learning problem is much, much easier and as we prove in the paper, the optimal policy does not change with this redistribution.
**TL;DR:** There are 'place cells' in the hippopotamus that are fired when passing through a location. You can take a rat and measure how its cells are activated in a maze, then monitor neurons during planning, rest or sleep. You'll see patterns that show it's thinking of locations in order and focusing on interesting locations. This paper looks at how RL agents do 'prioritized experience replay' and compare it to place cells in animals. The authors do a RL simulation and *qualitatively* compare the results to the activity observed in place cells.
> Neural activity recorded from hippocampal place cells during spatial navigation typically represents the animal’s spatial position, though it can sometimes represent locations ahead of the animal. For instance, during “sharp wave ripple” events, activity might progress sequentially from the animal’s current location towards a goal location. These “forward replay” ´sequences predict subsequent behavior and have been suggested to support a planning mechanism that links actions to their deferred consequences along a spatial trajectory. However, analogously to the human evidence, remote activity in the hippocampus can also represent locations behind the animal, and even altogether disjoint, ´remote locations (especially during rest or sleep) (Fig. 1a).
> we develop a normative theory to predict not just whether but which memories should be accessed at each time
to enable the most rewarding future decisions.
> To test the implications of our theory, we simulate a spatial navigation task where an agent generates and stores experiences which can be later retrieved. We show that an agent that accesses memories sequentially and in order of utility
produces patterns of sequential state consideration that resemble place cell replay, and reproduces qualitatively and with
no parameter fitting a wealth of empirical findings including (i) the existence and balance between forward and reverse replay; (ii) the content of replay; and (iii) effects of experience.
> we propose the unifying view that all patterns of replay during behavior, rest, and sleep reflect different instances of a more general state retrieval operation that integrates experiences across space and time to propagate value and guide decisions.
**My 2 cents**: I like this paper because prioritized experience replay reminds me of how we often dream or daydream of novel good or bad events that happened or that we anticipate. This paper drills much deeper into this connection.