RUDDER: Return Decomposition for Delayed Rewards RUDDER: Return Decomposition for Delayed Rewards
Paper summary [Summary by author /u/SirJAM_armedi]( Math aside, the "big idea" of RUDDER is the following: We use an LSTM to predict the return of an episode. To do this, the LSTM will have to recognize what actually causes the reward (e.g. "shooting the gun in the right direction causes the reward, even if we get the reward only once the bullet hits the enemy after travelling along the screen"). We then use a salience method (e.g. LRP or integrated gradients) to get that information out of the LSTM, and redistribute the reward accordingly (i.e., we then give reward already once the gun is shot in the right direction). Once the reward is redistributed this way, solving/learning the actual Reinforcement Learning problem is much, much easier and as we prove in the paper, the optimal policy does not change with this redistribution.
RUDDER: Return Decomposition for Delayed Rewards
Jose A. Arjona-Medina and Michael Gillhofer and Michael Widrich and Thomas Unterthiner and Sepp Hochreiter
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG, cs.AI, math.OC, stat.ML


Summary by Anonymous 1 month ago
Your comment: allows researchers to publish paper summaries that are voted on and ranked!

Sponsored by: and