[link]
As in Qlearning, modern actorcritic methods suffer from value estimation errors due to high bias and variance. While there are many attempts to address this in Qlearning (such as Double DQN), not much was done in actorcritic methods. Authors of the paper propose three modifications to DDPG and empirically show that they help address both bias and variance issues: * 1.) Clipped Double QLearning: Add a second pair of critics $Q_{\theta}$ and $Q_{\theta_\text{target}}$ (so four critics total) and use them to upperbound the value estimate target update: $y = r + \gamma \min\limits_{i=1,2} Q_{\theta_{target,i}}(s', \pi_{\phi_1}(s'))$ * 2.) Reduce number of policy and target networks updates, and magnitude of target networks updates: $\theta_{target} \leftarrow \tau\theta + (1\tau)\theta_{target}$ * 3.) Inject (clipped) random noise to the target policy: $\hat{a} \leftarrow \pi_{\phi_{target}}(s) + \text{clip}(N(0,\sigma), c, c)$ Implementing these results, authors show significant improvements on seven continuous control tasks, beating not only reference DDPG algorithm, but also PPO, TRPO and ACKTR. Full algorithm from the paper: https://i.imgur.com/rRjwDyT.png Source code: https://github.com/sfujim/TD3
Your comment:
