[link]
Summary by Robert Müller 3 months ago
Interacting with the environment comes sometimes at a high cost, for example in high stake scenarios like health care or teaching. Thus instead of learning online, we might want to learn from a fixed buffer $B$ of transitions, which is filled in advance from a behavior policy.
The authors show that several so called off-policy algorithms, like DQN and DDPG fail dramatically in this pure off-policy setting.
They attribute this to the extrapolation error, which occurs in the update of a value estimate $Q(s,a)$, where the target policy selects an unfamiliar action $\pi(s')$ such that $(s', \pi(s'))$ is unlikely or not present in $B$. Extrapolation error is caused by the mismatch between the true state-action visitation distribution of the current policy and the state-action distribution in $B$ due to:
- state-action pairs (s,a) missing in $B$, resulting in arbitrarily bad estimates of $Q_{\theta}(s, a)$ without sufficient data close to (s,a).
- the finiteness of the batch of transition tuples $B$, leading to a biased estimate of the transition dynamics in the Bellman operator $T^{\pi}Q(s,a) \approx \mathbb{E}_{\boldsymbol{s' \sim B}}\left[r + \gamma Q(s', \pi(s')) \right]$
- transitions are sampled uniformly from $B$, resulting in a loss weighted w.r.t the frequency of data in the batch: $\frac{1}{\vert B \vert} \sum_{\boldsymbol{(s, a, r, s') \sim B}} \Vert r + \gamma Q(s', \pi(s')) - Q(s, a)\Vert^2$
The proposed algorithm Batch-Constrained deep Q-learning (BCQ) aims to choose actions that:
1. minimize distance of taken actions to actions in the batch
2. lead to states contained in the buffer
3. maximizes the value function,
where 1. is prioritized over the other two goals to mitigate the extrapolation error.
Their proposed algorithm (for continuous environments) consists informally of the following steps that are repeated at each time $t$:
1. update generator model of the state conditional marginal likelihood $P_B^G(a \vert s)$
2. sample n actions form the generator model
3. perturb each of the sampled actions to lie in a range $\left[-\Phi, \Phi \right]$
4. act according to the argmax of respective Q-values of perturbed actions
5. update value function
The experiments considers Mujoco tasks with four scenarios of batch data creation:
- 1 million time steps from training a DDPG agent with exploration noise $\mathcal{N}(0,0.5)$ added to the action.This aims for a diverse set of states and actions.
- 1 million time steps from training a DDPG agent with an exploration noise $\mathcal{N}(0,0.1)$ added to the actions as behavior policy. The batch-RL agent and the behavior DDPG are trained concurrently from the same buffer.
- 1 million transitions from rolling out a already trained DDPG agent
- 100k transitions from a behavior policy that acts with probability 0.3 randomly and follows otherwise an expert demonstration with added exploration noise $\mathcal{N}(0,0.3)$
I like the fourth choice of behavior policy the most as this captures high stake scenarios like education or medicine the closest, in which training data would be acquired by human experts that are by the nature of humans not optimal but significantly better than learning from scratch.
The proposed BCQ algorithm is the only algorithm that is successful across all experiments. It matches or outperforms the behavior policy. Evaluation of the value estimates showcases unstable and diverging value estimates for all algorithms but BCQ that exhibits a stable value function.
The paper outlines a very important issue that needs to be tackled in order to use reinforcement learning in real world applications.

more
less