Behavior Regularized Offline Reinforcement Learning on ShortScience.org

arxiv.org
scholar.google.com

Behavior Regularized Offline Reinforcement Learning
Wu, Yifan and Tucker, George and Nachum, Ofir
arXiv e-Print archive - 2019 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 1

[link] Summary by Robert Müller 3 years ago

Wu et al. provide a framework (behavior regularized actor critic (BRAC)) which they use to empirically study the impact of different design choices in batch reinforcement learning (RL). Specific instantiations of the framework include BCQ, KL-Control and BEAR. 

Pure off-policy rl describes the problem of learning a policy purely from a batch $B$ of one step transitions collected with a behavior policy $\pi_b$. The setting allows for no further interactions with the environment. This learning regime is for example in high stake scenarios, like education or heath care, desirable. 

The core principle of batch RL-algorithms in to stay in some sense close to the behavior policy. The paper proposes to incorporate this firstly via a regularization term in the value function, which is denoted as **value penalty**. In this case the value function of BRAC takes the following form:

$
V_D^{\pi}(s) = \sum_{t=0}^{\infty} \gamma ^t \mathbb{E}_{s_t \sim P_t^{\pi}(s)}[R^{pi}(s_t)- \alpha D(\pi(\cdot\vert s_t) \Vert \pi_b(\cdot \vert s_t)))], 
$

where $\pi_b$ is the maximum likelihood estimate of the behavior policy based upon $B$. 
This results in a Q-function objective:
$\min_{Q} = \mathbb{E}_{\substack{(s,a,r,s') \sim D \\ a' \sim \pi_{\theta}(\cdot \vert s)}}\left[(r + \gamma \left(\bar{Q}(s',a')-\alpha D(\pi(\cdot\vert s) \Vert \pi_b(\cdot \vert s) \right) - Q(s,a) \right]  
$

and the corresponding policy update:
$
\max_{\pi_{\theta}} \mathbb{E}_{(s,a,r,s') \sim D} \left[ \mathbb{E}_{a^{''} \sim \pi_{\theta}(\cdot \vert s)}[Q(s,a^{''})] - \alpha  D(\pi(\cdot\vert s) \Vert \pi_b(\cdot \vert s)  \right] 
$
 
The second approach is **policy regularization** . Here the regularization weight $\alpha$ is set for value-objectives (V- and Q) to zero and is non-zero for the policy objective.

It is possible to instantiate for example the following batch RL algorithms in this setting:
- BEAR: policy regularization with sample-based kernel MMD as D and min-max mixture of the two ensemble elements for $\bar{Q}$
- BCQ: no regularization but policy optimization over restricted space

Extensive Experiments over the four Mujoco tasks Ant, HalfCheetah,Hopper Walker show:
1. for a BEAR like instantiation there is a  modest advantage of keeping $\alpha$ fixed
2. using a mixture of a two or four Q-networks ensemble as target value yields better returns that using one Q-network
3. taking the minimum of ensemble Q-functions is slightly better than taking a mixture (for Ant, HalfCeetah & Walker, but not for Hooper
4. the use of value-penalty yields higher return than the policy-penalty
5. no choice for D (MMD, KL (primal), KL(dual) or Wasserstein (dual)) significantly outperforms the other (note that his contradicts the BEAR paper where MMD was better than KL)
6. the value penalty version consistently outperforms BEAR which in turn outperforms BCQ with improves upon a partially trained baseline. 

This large scale study of different design choices helps in developing new methods. It is however surprising to see, that most design choices in current methods are shown empirically to be non crucial. This points to the importance of agreeing upon common test scenarios within a community to prevent over-fitting new algorithms to a particular setting.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private