Addressing Function Approximation Error in Actor-Critic Methods on ShortScience.org

arxiv.org
scholar.google.com

Addressing Function Approximation Error in Actor-Critic Methods
Scott Fujimoto and Herke van Hoof and Dave Meger
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.AI, cs.LG, stat.ML
more

Summaries/Notes 1

[link] Summary by Roman Ring 6 years ago

As in Q-learning, modern actor-critic methods suffer from value estimation errors due to high bias and variance. While there are many attempts to address this in Q-learning (such as Double DQN), not much was done in actor-critic methods.

Authors of the paper propose three modifications to DDPG and empirically show that they help address both bias and variance issues:

* 1.) Clipped Double Q-Learning:  
  Add a second pair of critics $Q_{\theta}$ and $Q_{\theta_\text{target}}$ (so four critics total) and use them to upper-bound the value estimate target update: $y = r + \gamma \min\limits_{i=1,2} Q_{\theta_{target,i}}(s', \pi_{\phi_1}(s'))$
* 2.) Reduce number of policy and target networks updates, and magnitude of target networks updates: $\theta_{target} \leftarrow \tau\theta + (1-\tau)\theta_{target}$
* 3.) Inject (clipped) random noise to the target policy: $\hat{a} \leftarrow \pi_{\phi_{target}}(s) + \text{clip}(N(0,\sigma), -c, c)$

Implementing these results, authors show significant improvements on seven continuous control tasks, beating not only reference DDPG algorithm, but also PPO, TRPO and ACKTR.

Full algorithm from the paper:

https://i.imgur.com/rRjwDyT.png

Source code: https://github.com/sfujim/TD3

Your comment: