Benchmarking Batch Deep Reinforcement Learning Algorithms on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Benchmarking Batch Deep Reinforcement Learning Algorithms
Scott Fujimoto and Edoardo Conti and Mohammad Ghavamzadeh and Joelle Pineau
arXiv e-Print archive - 2019 via Local arXiv
Keywords: cs.LG, cs.AI, stat.ML
more

Summaries/Notes 1

[link] Summary by Robert Müller 4 years ago

The authors propose a unified setting to evaluate the performance of batch reinforcement learning algorithms. The proposed benchmark is discrete and based on the popular Atari Domain. The authors review and benchmark several current batch RL algorithms against a newly introduced version of BCQ (Batch Constrained Deep Q Learning) for discrete environments.

https://i.imgur.com/zrCZ173.png

Note in line 5 that the policy chooses actions with a restricted argmax operation, eliminating actions that have not enough support in the batch.

One of the key difficulties in batch-RL is the divergence of value estimates. In this paper the authors use Double DQN, which means actions are selected with a value net $Q_{\theta}$ and the policy evaluation is done with a target network $Q_{\theta'}$ (line 6).

**How is the batch created?**
A partially trained DQN-agent (trained online for 10mio steps, aka 40mio frames) is used as behavioral policy to collect a batch $B$ containing 10mio transitions. The DQN agent uses either with probability 0.8 an $\epsilon=0.2$ and with probability 0.2 an $\epsilon = 0.001$. The batch RL agents are trained on this batch for 10mio steps and evaluated every 50k time steps for 10 episodes. This process of batch creation differs from the settings used in other papers in i) having only a single behavioral policy, ii) the batch size and iii) the proficiency level of the batch policy.

The experiments, performed on the arcade learning environment include DQN, REM, QR-DQN, KL-Control, BCQ, OnlineDQN and Behavioral Cloning and show that:
- for conventional RL algorithms distributional algorithms (QR-DQN) outperform the plain algorithms (DQN)
- batch RL algorithms perform better than conventional algorithms with BCQ outperforming every other algorithm in every tested game

In addition to the return the authors plot the value estimates for the Q-networks. A drop in performance corresponds in all cases to a divergence (up or down) in value estimates.

The paper is an important contribution to the debate about what is the right setting to evaluate batch RL algorithms. It remains however to be seen if the proposed choice of i) a single behavior policy, ii) the batch size and iii) quality level of the behavior policy will be accepted as standard. Further work is in any case required to decide upon a benchmark for continuous domains.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private