[link]
The typical model based reinforcement learning (RL) loop consists of collecting data, training a model of the environment, using the model to do model predictive control (MPC). If however the model is wrong, for example for state-action pairs that have been barely visited, the dynamics model might be very wrong and the MPC fails as the imagined model and the reality align to longer. Boney et a. propose to tackle this with a denoising autoencoder for trajectory regularization according to the familiarity of a trajectory. MPC uses at each time t the learned model $s_{t+1} = f_{\theta}(s_t, a_t)$ to select a plan of actions, that is maximizing the sum of expected future reward: $ G(a_t, \dots, a_{t+h}) = \mathbb{E}[\sum_{k=t}^{t+H}r(o_t, a_t)] ,$ where $r(o_t, a_t)$ is the observation and action dependent reward. The plan obtained by trajectory optimization is subsequently unrolled for H steps. Boney et al. propose to regularize trajectories by the familiarity of the visited states leading to the regularized objective: $G_{reg} = G + \alpha \log p(o_k, a_k, \dots, o_{t+H}, a_{t+H}) $ Instead of regularizing over the whole trajectory they propose to regularize over marginal probabilities of windows of length w: $G_{reg} = G + \alpha \sum_{k = t}^{t+H-w} \log p(x_k), \text{ where } x_k = (o_k, a_k, \dots, o_{t+w}, a_{t+w}).$ Instead of explicitly learning a generative model of the familiarity $p(x_k)$ a denoising auto-encoder is used that approximates instead the derivative of the log probability density $\frac{\delta}{\delta x} \log p(x)$. This allows the following back-propagation rule: $ \frac{\delta G_{reg}}{\delta_i} = \frac{\delta G}{\delta a_i} + \alpha \sum_{k = i}^{i+w} \frac{\delta x_k}{\delta a_i} \frac{\delta}{\delta x} \log p(x).$ The experiments show that the proposed method has competitive sample-efficiency. For example on Halfcheetah the asymptotic performance of PETS is not matched. This is due to the biggest limitation of this approach, the hindering of exploration. Penalizing the unfamiliarity of states is in contrast to approaches like optimism in the face of uncertainty, which is a core principle of exploration. Aiming to avoid states of high unfamiliarity, the proposed method is the precise opposite of curiosity driven exploration. The appendix proposes preliminary experiments to account for exploration. I would expect, that the pure penalization of unfamiliarity works best in a batch RL setting, which would be an interesting extension of this work. |
[link]
Kumar et al. propose an algorithm to learn in batch reinforcement learning (RL), a setting where an agent learns purely form a fixed batch of data, $B$, without any interactions with the environments. The data in the batch is collected according to a batch policy $\pi_b$. Whereas most previous methods (like BCQ) constrain the learned policy to stay close to the behavior policy, Kumar et al. propose bootstrapping error accumulation reduction (BEAR), which constrains the newly learned policy to place some probability mass on every non negligible action. The difference is illustrated in the picture from the BEAR blog post: https://i.imgur.com/zUw7XNt.png The behavior policy is in both images the dotted red line, the left image shows the policy matching where the algorithm is constrained to the purple choices, while the right image shows the support matching. **Theoretical Contribution:** The paper analysis formally how the use of out-of-distribution actions to compute the target in the Bellman equation influences the back-propagated error. Firstly a distribution constrained backup operator is defined as $T^{\Pi}Q(s,a) = \mathbb{E}[R(s,a) + \gamma \max_{\pi \in \Pi} \mathbb{E}_{P(s' \vert s,a)} V(s')]$ and $V(s) = \max_{\pi \in \Pi} \mathbb{E}_{\pi}[Q(s,a)]$ which considers only policies $\pi \in \Pi$. It is possible that the optimal policy $\pi^*$ is not contained in the policy set $\Pi$, thus there is a suboptimallity constant $\alpha (\Pi) = \max_{s,a} \vert \mathcal{T}^{\Pi}Q^{*}(s,a) - \mathcal{T}Q^{*}(s,a) ]\vert $ which captures how far $\pi^{*}$ is from $\Pi$. Letting $P^{\pi_i}$ be the transition-matrix when following policy $\pi_i$, $\rho_0$ the state marginal distribution of the training data in the batch and $\pi_1, \dots, \pi_k \in \Pi $. The error analysis relies upon a concentrability assumption $\rho_0 P^{\pi_1} \dots P^{\pi_k} \leq c(k)\mu(s)$, with $\mu(s)$ the state marginal. Note that $c(k)$ might be infinite if the support of $\Pi$ is not contained in the state marginal of the batch. Using the coefficients $c(k)$ a concentrability coefficient is defined as: $C(\Pi) = (1-\gamma)^2\sum_{k=1}^{\infty}k \gamma^{k-1}c(k).$ The concentrability takes values between 1 und $\infty$, where 1 corresponds to the case that the batch data were collected by $\pi$ and $\Pi = \{\pi\}$ and $\infty$ to cases where $\Pi$ has support outside of $\pi$. Combining this Kumar et a. get a bound of the Bellman error for distribution constrained value iteration with the constrained Bellman operator $T^{\Pi}$: $\lim_{k \rightarrow \infty} \mathbb{E}_{\rho_0}[\vert V^{\pi_k}(s)- V^{*}(s)] \leq \frac{\gamma}{(1-\gamma^2)} [C(\Pi) \mathbb{E}_{\mu}[\max_{\pi \in \Pi}\mathbb{E}_{\pi}[\delta(s,a)] + \frac{1-\gamma}{\gamma}\alpha(\Pi) ] ]$, where $\delta(s,a)$ is the Bellman error. This presents the inherent batch RL trade-off between keeping policies close to the behavior policy of the batch (captured by $C(\Pi)$ and keeping $\Pi$ sufficiently large (captured by $\alpha(\Pi)$). It is finally proposed to use support sets to construct $\Pi$, that is $\Pi_{\epsilon} = \{\pi \vert \pi(a \vert s)=0 \text{ whenever } \beta(a \vert s) < \epsilon \}$. This amounts to the set of all policies that place probability on all non-negligible actions of the behavior policy. For this particular choice of $\Pi = \Pi_{\epsilon}$ the concentrability coefficient can be bounded. **Algorithm**: The algorithm has an actor critic style, where the Q-value to update the policy is taken to be the minimum over the ensemble. The support constraint to place at least some probability mass on every non negligible action from the batch is enforced via sampled MMD. The proposed algorithm is a member of the policy regularized algorithms as the policy is updated to optimize: $\pi_{\Phi} = \max_{\pi} \mathbb{E}_{s \sim B} \mathbb{E}_{a \sim \pi(\cdot \vert s)} [min_{j = 1 \dots, k} Q_j(s,a)] s.t. \mathbb{E}_{s \sim B}[MMD(D(s), \pi(\cdot \vert s))] \leq \epsilon$ The Bellman target to update the Q-functions is computed as the convex combination of minimum and maximum of the ensemble. **Experiments** The experiments use the Mujoco environments Halfcheetah, Walker, Hopper and Ant. Three scenarios of batch collection, always consisting of 1Mio. samples, are considered: - completely random behavior policy - partially trained behavior policy - optimal policy as behavior policy The experiments confirm that BEAR outperforms other off-policy methods like BCQ or KL-control. The ablations show further that the choice of MMD is crucial as it is sometimes on par and sometimes substantially better than choosing KL-divergence. |
[link]
Wu et al. provide a framework (behavior regularized actor critic (BRAC)) which they use to empirically study the impact of different design choices in batch reinforcement learning (RL). Specific instantiations of the framework include BCQ, KL-Control and BEAR. Pure off-policy rl describes the problem of learning a policy purely from a batch $B$ of one step transitions collected with a behavior policy $\pi_b$. The setting allows for no further interactions with the environment. This learning regime is for example in high stake scenarios, like education or heath care, desirable. The core principle of batch RL-algorithms in to stay in some sense close to the behavior policy. The paper proposes to incorporate this firstly via a regularization term in the value function, which is denoted as **value penalty**. In this case the value function of BRAC takes the following form: $ V_D^{\pi}(s) = \sum_{t=0}^{\infty} \gamma ^t \mathbb{E}_{s_t \sim P_t^{\pi}(s)}[R^{pi}(s_t)- \alpha D(\pi(\cdot\vert s_t) \Vert \pi_b(\cdot \vert s_t)))], $ where $\pi_b$ is the maximum likelihood estimate of the behavior policy based upon $B$. This results in a Q-function objective: $\min_{Q} = \mathbb{E}_{\substack{(s,a,r,s') \sim D \\ a' \sim \pi_{\theta}(\cdot \vert s)}}\left[(r + \gamma \left(\bar{Q}(s',a')-\alpha D(\pi(\cdot\vert s) \Vert \pi_b(\cdot \vert s) \right) - Q(s,a) \right] $ and the corresponding policy update: $ \max_{\pi_{\theta}} \mathbb{E}_{(s,a,r,s') \sim D} \left[ \mathbb{E}_{a^{''} \sim \pi_{\theta}(\cdot \vert s)}[Q(s,a^{''})] - \alpha D(\pi(\cdot\vert s) \Vert \pi_b(\cdot \vert s) \right] $ The second approach is **policy regularization** . Here the regularization weight $\alpha$ is set for value-objectives (V- and Q) to zero and is non-zero for the policy objective. It is possible to instantiate for example the following batch RL algorithms in this setting: - BEAR: policy regularization with sample-based kernel MMD as D and min-max mixture of the two ensemble elements for $\bar{Q}$ - BCQ: no regularization but policy optimization over restricted space Extensive Experiments over the four Mujoco tasks Ant, HalfCheetah,Hopper Walker show: 1. for a BEAR like instantiation there is a modest advantage of keeping $\alpha$ fixed 2. using a mixture of a two or four Q-networks ensemble as target value yields better returns that using one Q-network 3. taking the minimum of ensemble Q-functions is slightly better than taking a mixture (for Ant, HalfCeetah & Walker, but not for Hooper 4. the use of value-penalty yields higher return than the policy-penalty 5. no choice for D (MMD, KL (primal), KL(dual) or Wasserstein (dual)) significantly outperforms the other (note that his contradicts the BEAR paper where MMD was better than KL) 6. the value penalty version consistently outperforms BEAR which in turn outperforms BCQ with improves upon a partially trained baseline. This large scale study of different design choices helps in developing new methods. It is however surprising to see, that most design choices in current methods are shown empirically to be non crucial. This points to the importance of agreeing upon common test scenarios within a community to prevent over-fitting new algorithms to a particular setting. |
[link]
Rakelly et al. propose a method to do off-policy meta reinforcement learning (rl). The method achieves a 20-100x improvement on sample efficiency compared to on-policy meta rl like MAML+TRPO. The key difficulty for offline meta rl arises from the meta-learning assumption, that meta-training and meta-test time match. However during test time the policy has to explore and sees as such on-policy data which is in contrast to the off-policy data that should be used at meta-training. The key contribution of PEARL is an algorithm that allows for online task inference in a latent variable at train and test time, which is used to train a Soft Actor Critic, a very sample efficient off-policy algorithm, with additional dependence of the latent variable. The implementation of Rakelly et al. proposes to capture knowledge about the current task in a latent stochastic variable Z. A inference network $q_{\Phi}(z \vert c)$ is used to predict the posterior over latents given context c of the current task in from of transition tuples $(s,a,r,s')$ and trained with an information bottleneck. Note that the task inference is done on samples according to a sampling strategy sampling more recent transitions. The latent z is used as an additional input to policy $\pi(a \vert s, z)$ and Q-function $Q(a,s,z)$ of a soft actor critic algorithm which is trained with offline data of the full replay buffer. https://i.imgur.com/wzlmlxU.png So the challenge of differing conditions at test and train times is resolved by sampling the content for the latent context variable at train time only from very recent transitions (which is almost on-policy) and at test time by construction on-policy. Sampling $z \sim q(z \vert c)$ at test time allows for posterior sampling of the latent variable, yielding efficient exploration. The experiments are performed across 6 Mujoco tasks with ProMP, MAML+TRPO and $RL^2$ with PPO as baselines. They show: - PEARL is 20-100x more sample-efficient - the posterior sampling of the latent context variable enables deep exploration that is crucial for sparse reward settings - the inference network could be also a RNN, however it is crucial to train it with uncorrelated transitions instead of trajectories that have high correlated transitions - using a deterministic latent variable, i.e. reducing $q_{\Phi}(z \vert c)$ to a point estimate, leaves the algorithm unable to solve sparse reward navigation tasks which is attributed to the lack of temporally extended exploration. The paper introduces an algorithm that allows to combine meta learning with an off-policy algorithm that dramatically increases the sample-efficiency compared to on-policy meta learning approaches. This increases the chance of seeing meta rl in any sort of real world applications. |
[link]
Bechtle et al. propose meta learning via learned loss ($ML^3$) and derive and empirically evaluate the framework on classification, regression, model-based and model-free reinforcement learning tasks. The problem is formalized as learning parameters $\Phi$ of a meta loss function $M_\phi$ that computes loss values $L_{learned} = M_{\Phi}(y, f_{\theta}(x))$. Following the outer-inner loop meta algorithm design the learned loss $L_{learned}$ is used to update the parameters of the learner in the inner loop via gradient descent: $\theta_{new} = \theta - \alpha \nabla_{\theta}L_{learned} $. The key contribution of the paper is the way to construct a differentiable learning signal for the loss parameters $\Phi$. The framework requires to specify a task loss $L_T$ during meta train time, which can be for example the mean squared error for regression tasks. After updating the model parameters to $\theta_{new}$ the task loss is used to measure how much learning progress has been made with loss parameters $\Phi$. The key insight is the decomposition via chain-rule of $\nabla_{\Phi} L_T(y, f_{\theta_{new}})$: $\nabla_{\Phi} L_T(y, f_{\theta_{new}}) = \nabla_f L_t \nabla_{\theta_{new}}f_{\theta_{new}} \nabla_{\Phi} \theta_{new} = \nabla_f L_t \nabla_{\theta_{new}}f_{\theta_{new}} [\theta - \alpha \nabla_{\theta} \mathbb{E}[M_{\Phi}(y, f_{\theta}(x))]]$. This allows to update the loss parameters with gradient descent as: $\Phi_{new} = \Phi - \eta \nabla_{\Phi} L_T(y, f_{\theta_{new}})$. This update rules yield the following $ML^3$ algorithm for supervised learning tasks: https://i.imgur.com/tSaTbg8.png For reinforcement learning the task loss is the expected future reward of policies induced by the policy $\pi_{\theta}$, for model-based rl with respect to the approximate dynamics model and for the model free case a system independent surrogate: $L_T(\pi_{\theta_{new}}) = -\mathbb{E}_{\pi_{\theta_{new}}} \left[ R(\tau_{\theta_{new}}) \log \pi_{\theta_{new}}(\tau_{new})\right] $. The allows further to incorporate extra information via an additional loss term $L_{extra}$ and to consider the augmented task loss $\beta L_T + \gamma L_{extra} $, with weights $\beta, \gamma$ at train time. Possible extra loss terms are used to add physics priors, encouragement of exploratory behavior or to incorporate expert demonstrations. The experiments show that this, at test time unavailable information, is retained in the shape of the loss landscape. The paper is packed with insightful experiments and shows that the learned loss function: - yields in regression and classification better accuracies at train and test tasks - generalizes well and speeds up learning in model based rl tasks - yields better generalization and faster learning in model free rl - is agnostic across a bunch of evaluated architectures (2,3,4,5 layers) - with incorporated extra knowledge yields better performance than without and is superior to alternative approaches like iLQR in a MountainCar task. The paper introduces a promising alternative, by learning the loss parameters, to MAML like approaches that learn the model parameters. It would be interesting to see if the learned loss function generalizes better than learned model parameters to a broader distribution of tasks like the meta-world tasks. |
[link]
The authors propose a unified setting to evaluate the performance of batch reinforcement learning algorithms. The proposed benchmark is discrete and based on the popular Atari Domain. The authors review and benchmark several current batch RL algorithms against a newly introduced version of BCQ (Batch Constrained Deep Q Learning) for discrete environments. https://i.imgur.com/zrCZ173.png Note in line 5 that the policy chooses actions with a restricted argmax operation, eliminating actions that have not enough support in the batch. One of the key difficulties in batch-RL is the divergence of value estimates. In this paper the authors use Double DQN, which means actions are selected with a value net $Q_{\theta}$ and the policy evaluation is done with a target network $Q_{\theta'}$ (line 6). **How is the batch created?** A partially trained DQN-agent (trained online for 10mio steps, aka 40mio frames) is used as behavioral policy to collect a batch $B$ containing 10mio transitions. The DQN agent uses either with probability 0.8 an $\epsilon=0.2$ and with probability 0.2 an $\epsilon = 0.001$. The batch RL agents are trained on this batch for 10mio steps and evaluated every 50k time steps for 10 episodes. This process of batch creation differs from the settings used in other papers in i) having only a single behavioral policy, ii) the batch size and iii) the proficiency level of the batch policy. The experiments, performed on the arcade learning environment include DQN, REM, QR-DQN, KL-Control, BCQ, OnlineDQN and Behavioral Cloning and show that: - for conventional RL algorithms distributional algorithms (QR-DQN) outperform the plain algorithms (DQN) - batch RL algorithms perform better than conventional algorithms with BCQ outperforming every other algorithm in every tested game In addition to the return the authors plot the value estimates for the Q-networks. A drop in performance corresponds in all cases to a divergence (up or down) in value estimates. The paper is an important contribution to the debate about what is the right setting to evaluate batch RL algorithms. It remains however to be seen if the proposed choice of i) a single behavior policy, ii) the batch size and iii) quality level of the behavior policy will be accepted as standard. Further work is in any case required to decide upon a benchmark for continuous domains. |
[link]
Interacting with the environment comes sometimes at a high cost, for example in high stake scenarios like health care or teaching. Thus instead of learning online, we might want to learn from a fixed buffer $B$ of transitions, which is filled in advance from a behavior policy. The authors show that several so called off-policy algorithms, like DQN and DDPG fail dramatically in this pure off-policy setting. They attribute this to the extrapolation error, which occurs in the update of a value estimate $Q(s,a)$, where the target policy selects an unfamiliar action $\pi(s')$ such that $(s', \pi(s'))$ is unlikely or not present in $B$. Extrapolation error is caused by the mismatch between the true state-action visitation distribution of the current policy and the state-action distribution in $B$ due to: - state-action pairs (s,a) missing in $B$, resulting in arbitrarily bad estimates of $Q_{\theta}(s, a)$ without sufficient data close to (s,a). - the finiteness of the batch of transition tuples $B$, leading to a biased estimate of the transition dynamics in the Bellman operator $T^{\pi}Q(s,a) \approx \mathbb{E}_{\boldsymbol{s' \sim B}}\left[r + \gamma Q(s', \pi(s')) \right]$ - transitions are sampled uniformly from $B$, resulting in a loss weighted w.r.t the frequency of data in the batch: $\frac{1}{\vert B \vert} \sum_{\boldsymbol{(s, a, r, s') \sim B}} \Vert r + \gamma Q(s', \pi(s')) - Q(s, a)\Vert^2$ The proposed algorithm Batch-Constrained deep Q-learning (BCQ) aims to choose actions that: 1. minimize distance of taken actions to actions in the batch 2. lead to states contained in the buffer 3. maximizes the value function, where 1. is prioritized over the other two goals to mitigate the extrapolation error. Their proposed algorithm (for continuous environments) consists informally of the following steps that are repeated at each time $t$: 1. update generator model of the state conditional marginal likelihood $P_B^G(a \vert s)$ 2. sample n actions form the generator model 3. perturb each of the sampled actions to lie in a range $\left[-\Phi, \Phi \right]$ 4. act according to the argmax of respective Q-values of perturbed actions 5. update value function The experiments considers Mujoco tasks with four scenarios of batch data creation: - 1 million time steps from training a DDPG agent with exploration noise $\mathcal{N}(0,0.5)$ added to the action.This aims for a diverse set of states and actions. - 1 million time steps from training a DDPG agent with an exploration noise $\mathcal{N}(0,0.1)$ added to the actions as behavior policy. The batch-RL agent and the behavior DDPG are trained concurrently from the same buffer. - 1 million transitions from rolling out a already trained DDPG agent - 100k transitions from a behavior policy that acts with probability 0.3 randomly and follows otherwise an expert demonstration with added exploration noise $\mathcal{N}(0,0.3)$ I like the fourth choice of behavior policy the most as this captures high stake scenarios like education or medicine the closest, in which training data would be acquired by human experts that are by the nature of humans not optimal but significantly better than learning from scratch. The proposed BCQ algorithm is the only algorithm that is successful across all experiments. It matches or outperforms the behavior policy. Evaluation of the value estimates showcases unstable and diverging value estimates for all algorithms but BCQ that exhibits a stable value function. The paper outlines a very important issue that needs to be tackled in order to use reinforcement learning in real world applications. |
[link]
The authors analyse in the very well written paper the relation between Fisher $F(\theta) = \sum_n \mathbb{E}_{p_{\theta}(y \vert x)}[\nabla_{\theta} \log(p_{\theta}(y \vert x_n))\nabla_{\theta} \log(p_{\theta}(y \vert x_n))^T] $ and empirical Fisher $\bar{F}(\theta) = \sum_n [\nabla_{\theta} \log(p_{\theta}(y_n \vert x_n))\nabla_{\theta} \log(p_{\theta}(y_n \vert x_n))^T] $, which has recently seen a surge in interest. . The definitions differ in that $y_n$ is a training label instead of a sample of the model $p_{\theta}(y \vert x_n)$, thus even so the name suggests otherwise $\bar{F}$ is not a empirical, for example Monte Carlo, estimate of the Fisher. The authors rebuff common arguments used to justify the use of the empirical fisher by an amendment to the generalized Gauss-Newton, give conditions when the empirical Fisher does indeed approach the Fisher and give an argument why the empirical fisher might work in practice nonetheless. The Fisher, capturing the curvature of the parameter space, provides information about the geometry of the parameters pace, the empirical Fisher might however fail so capture the curvature as the striking plot from the paper shows: https://i.imgur.com/c5iCqXW.png The authors rebuff the two major justifications for the use of empirical Fisher: 1. "the empirical Fisher matches the construction of a generalized Gauss-Newton" * for the log-likelihood $log(p(y \vert f) = \log \exp(-\frac{1}{2}(y-f)^2))$ the generalized Gauss-Newton intuition that small residuals $f(x_n, \theta) - y_n$ lead to a good approximation of the Hessian is not satisfied. Whereas the Fisher approaches the Hessian, the empirical Fisher approaches 0 2. "the empirical Fisher converges to the true Fisher when the model is a good fit for the data" * the authors sharpen the argument to "the empirical Fisher converges at the minimum to the Fisher as the number of samples grows", which is unlikely to be satisfied in practice. The authors provide an alternative perspective on why the empirical Fisher might be successful, namely to adapt the gradient to the gradient noise in stochastic optimization. The empirical Fisher coincides with the second moment of the stochastic gradient estimate and encodes as such covariance information about the gradient noise. This allows to reduce the effects of gradient noise by scaling back the updates in high variance aka noise directions. |