RUDDER: Return Decomposition for Delayed Rewards on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

RUDDER: Return Decomposition for Delayed Rewards
Jose A. Arjona-Medina and Michael Gillhofer and Michael Widrich and Thomas Unterthiner and Sepp Hochreiter
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG, cs.AI, math.OC, stat.ML
more

Summaries/Notes 1

[link] Summary by wassname 5 years ago

[Summary by author /u/SirJAM_armedi](https://www.reddit.com/r/MachineLearning/comments/8sq0jy/rudder_reinforcement_learning_algorithm_that_is/e11swv8/).

Math aside, the "big idea" of RUDDER is the following: We use an LSTM to predict the return of an episode. To do this, the LSTM will have to recognize what actually causes the reward (e.g. "shooting the gun in the right direction causes the reward, even if we get the reward only once the bullet hits the enemy after travelling along the screen"). We then use a salience method (e.g. LRP or integrated gradients) to get that information out of the LSTM, and redistribute the reward accordingly (i.e., we then give reward already once the gun is shot in the right direction). Once the reward is redistributed this way, solving/learning the actual Reinforcement Learning problem is much, much easier and as we prove in the paper, the optimal policy does not change with this redistribution.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private