Episodic Curiosity through ReachabilityEpisodic Curiosity through ReachabilityNikolay Savinov and Anton Raichuk and Raphaël Marinier and Damien Vincent and Marc Pollefeys and Timothy Lillicrap and Sylvain Gelly2018
Paper summarydecodyngThis paper proposes a new curiosity-based intrinsic reward technique that seeks to address one of the failure modes of previous curiosity methods. The basic idea of curiosity is that, often, exploring novel areas of an environment can be correlated with gaining reward within that environment, and that we can find ways to incentivize the former that don’t require a hand-designed reward function. This is appealing because many useful-to-learn environments either lack inherent reward altogether, or have reward that is very sparse (i.e. no signal until you reach the end, at which point you get a reward of 1). In both of these cases, supplementing with some kind of intrinsic incentive towards exploration might improve performance. The existing baseline curiosity technique is called ICM, and works based on “surprisal”: asking the agent to predict the next state as a function of its current state, and incentivizing exploration of areas where the gap between these two quantities is high, to promote exploration of harder-to-predict (and presumably more poorly sampled) locations. However, one failure mode of this approach is something called the “noisy TV” problem, whereby if the environment contains something analogous to a television where one can press a button and go to a random channel, that is highly unpredictable, and thus a source of easy rewards, and thus liable to distract the agent from any other actions.
As an alternative, the authors here suggest a different way of defining novelty: rather than something that is unpredictable, novelty should be seen as something far away from what I as an agent have seen before. This is more direct than the prior approach, which takes ‘hard to predict’ as a proxy for ‘somewhere I haven’t explored’, which may not necessary be a reasonable assumption.
They implement this idea by keeping a memory of past (embedded) observations that the agent has seen during this episode, and, at each step, check whether the current observation is predicted to be more than K steps away than any of the observations in memory (more on that in a moment). If so, a bonus reward is added, and this observation is added to the aforementioned memory. (Which, waving hands vigorously, kind of ends up functioning as a spanning set of prior experience).
The question of “how many steps is observation A from observation B” is answered by a separate Comparator network which is trained in pretty straightforward fashion: a random-samplling policy is used to collect trajectories, which are then turned into pairs of observations as input, and a 1 if they occurred > k + p steps apart, and a 0 if they occurred < k steps apart. Then, these paired states are passed into a shared-weight convolutional network, which creates an embedding, and, from that embedding, a prediction is made as to whether they’re closer than the thresholds or farther away. This network is pre-trained before the actual RL training starts. (Minor sidenote: at RL-training time, the network is chopped into two, and the embedding read out and stored, and then input as a pair with each current observation to make the prediction).
Overall, the authors find that their method works better than both ICM and no-intrinsic-reward for VizDoom (a maze + shooting game), and the advantage is stronger in situations more sparse settings of the external reward.
On DeepMind Lab tasks, they saw no advantage on tasks with already-dense extrinsic rewards, and little advantage on the “normally sparse”, which they suggest may be due to it actually being easier than expected. They added doors to a maze navigation task, to ensure the agent couldn’t find the target right away, and this situation brought better performance of their method. They also tried a fully no-extrinsic-reward situation, and their method strongly performed both the ICM baseline and (obviously) the only-extrinsic-reward baseline, which was basically an untrained random policy in this setting. Regarding the poor performance of the ICM baseline in this environment, “we hypothesise that the agent can most significantly change its current view when it is close to the wall — thus increasing one-step prediction error — so it tends to get stucknear “interesting” diverse textures on the walls.”.
First published: 2018/10/04 (2 years ago) Abstract: Rewards are sparse in the real world and most today's reinforcement learning
algorithms struggle with such sparsity. One solution to this problem is to
allow the agent to create rewards for itself - thus making rewards dense and
more suitable for learning. In particular, inspired by curious behaviour in
animals, observing something novel could be rewarded with a bonus. Such bonus
is summed up with the real task reward - making it possible for RL algorithms
to learn from the combined reward. We propose a new curiosity method which uses
episodic memory to form the novelty bonus. To determine the bonus, the current
observation is compared with the observations in memory. Crucially, the
comparison is done based on how many environment steps it takes to reach the
current observation from those in memory - which incorporates rich information
about environment dynamics. This allows us to overcome the known "couch-potato"
issues of prior work - when the agent finds a way to instantly gratify itself
by exploiting actions which lead to unpredictable consequences. We test our
approach in visually rich 3D environments in ViZDoom and DMLab. In ViZDoom, our
agent learns to successfully navigate to a distant goal at least 2 times faster
than the state-of-the-art curiosity method ICM. In DMLab, our agent generalizes
well to new procedurally generated levels of the game - reaching the goal at
least 2 times more frequently than ICM on test mazes with very sparse reward.