#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning#Exploration: A Study of Count-Based Exploration for Deep Reinforcement LearningHaoran Tang and Rein Houthooft and Davis Foote and Adam Stooke and Xi Chen and Yan Duan and John Schulman and Filip De Turck and Pieter Abbeel2016
Paper summarydennybritzTLDR; The authors encourage exploration by adding a pseudo-reward of the form $\frac{\beta}{\sqrt{count(state)}}$ for infrequently visited states. State visits are counted using Locality Sensitive Hashing (LSH) based on an environment-specific feature representation like raw pixels or autoencoder representations. The authors show that this simple technique achieves gains in various classic RL control tasks and several games in the ATARI domain.
While the algorithm itself is simple there are now several more hyperaprameters to tune: The bonus coefficient `beta`, the LSH hashing granularity (how many bits to use for hashing) as well as the type of feature representation based on which the hash is computed, which itself may have more parameters. The experiments don't paint a consistent picture and different environments seem to need vastly different hyperparameter settings, which in my opinion will make this technique difficult to use in practice.
First published: 2016/11/15 (1 year ago) Abstract: Count-based exploration algorithms are known to perform near-optimally when
used in conjunction with tabular reinforcement learning (RL) methods for
solving small discrete Markov decision processes (MDPs). It is generally
thought that count-based methods cannot be applied in high-dimensional state
spaces, since most states will only occur once. Recent deep RL exploration
strategies are able to deal with high-dimensional continuous state spaces
through complex heuristics, often relying on optimism in the face of
uncertainty or intrinsic motivation. In this work, we describe a surprising
finding: a simple generalization of the classic count-based approach can reach
near state-of-the-art performance on various high-dimensional and/or continuous
deep RL benchmarks. States are mapped to hash codes, which allows to count
their occurrences with a hash table. These counts are then used to compute a
reward bonus according to the classic count-based exploration theory. We find
that simple hash functions can achieve surprisingly good results on many
challenging tasks. Furthermore, we show that a domain-dependent learned hash
code may further improve these results. Detailed analysis reveals important
aspects of a good hash function: 1) having appropriate granularity and 2)
encoding information relevant to solving the MDP. This exploration strategy
achieves near state-of-the-art performance on both continuous control tasks and
Atari 2600 games, hence providing a simple yet powerful baseline for solving
MDPs that require considerable exploration.
TLDR; The authors encourage exploration by adding a pseudo-reward of the form $\frac{\beta}{\sqrt{count(state)}}$ for infrequently visited states. State visits are counted using Locality Sensitive Hashing (LSH) based on an environment-specific feature representation like raw pixels or autoencoder representations. The authors show that this simple technique achieves gains in various classic RL control tasks and several games in the ATARI domain.
While the algorithm itself is simple there are now several more hyperaprameters to tune: The bonus coefficient `beta`, the LSH hashing granularity (how many bits to use for hashing) as well as the type of feature representation based on which the hash is computed, which itself may have more parameters. The experiments don't paint a consistent picture and different environments seem to need vastly different hyperparameter settings, which in my opinion will make this technique difficult to use in practice.