In many policy gradient algorithms, we update the parameters in online fashion. We collect trajectories from a policy, use the trajectories to compute the gradient of policy parameters with respect to the longterm cumulative reward, and update the policy parameters using this gradient. It is to be noted here that we do not use these samples again after updating the policies. The main reason that we do not use these samples again because we need to use **importance sampling** and **importance sampling** suffers from high variance and can make the learning potentially unstable. This paper proposes an update on **Asynchronous Advantage Actor Critic (A3C)** to incorporate offline data (the trajectories collected using previous policies). ** Incorporating offline data in Policy Gradient ** The offline data is incorporated using importance sampling. Mainly; lets $J(\theta)$ denote the total reward using policy $\pi(\theta)$, then using Policy Gradient Theorem $$ \Delta J(\theta) \propto \mathbb{E}_{x_t \sim \beta_\mu, a_t \sim \mu}[\rho_t \nabla_{\theta} \log \pi(a_t  x_t) Q^{\pi}(x_t, a_t)] $$ where $\rho_t = \frac{\pi(a_t  x_t)}{\mu({a_tx_t})}$. $\rho_t$ is called the importance sampling term. $\beta_\mu$ is the stationary probability distribution of states under the policy $\mu$. **Estimating $Q^{\pi}(x_t, a_t)$ in above equation:** The authors used a *retrace$\lambda$* approach to estimate $Q^{\pi}$. Mainly; the actionvalues were computed using the following recursive equation: $$ Q^{\text{ret}}(x_t, a_t) = r_t + \gamma \bar{\rho}_{t+1}\left(Q^{\text{ret}}(x_{t+1}, a_{t+1})  Q(x_{t+1}, a_{t+1})\right) + \gamma V(x_{t+1}) $$ where $\bar{\rho}_t = \min\{c, \rho_t\}$ and $\rho_t$ is the importance sampling term. $Q$ and $V$ in the above equation are the estimate of actionvalue and statevalue respectively. To estimate $Q$, the authors used a similar architecture as A3C except that the final layer outputs $Q$values instead of statevalues $V$. To train $Q$, the authors used the $Q^{\text{ret}}$. ** Reducing the variance because of importance sampling in the above equation:** The authors used a technique called *importance weight truncation with bias correction* to keep the variance bounded in the policy gradient equation. Mainly; they use the following identity: $$ \begin{array}{ccc} &&\mathbb{E}_{x_t \sim \beta_\mu, a_t \sim \mu}[\rho_t \nabla_{\theta} \log \pi(a_t  x_t) Q^{\pi}(x_t, a_t)] \\ &=& \mathbb{E}_{x_t \sim \beta_\mu}\left[ \mathbb{E}_{a_t \sim \mu}[\bar{\rho}_t \nabla_{\theta} \log \pi(a_t  x_t) Q^{\pi}(x_t, a_t)] \right] \\ &+& \mathbb{E}_{a\sim \pi}\left[\left[\frac{\rho_t(a)  c}{\rho_t(a)}\right] \nabla_{\theta} \log\pi_{\theta}(a  x_t) Q^{\pi}(x_t, a)\right] \end{array} $$ Note that in the above identity, the variance in the both the terms on the right hand side is bounded. ** Results: ** The authors showed that by using the offline data, they were able to match the performance of best DQN agent with the less data and the same amount of computation. **Continuous task: ** The authors used a stochastic duelling architecture for tasks having continuous action spaces while utilizing the innovation of discrete cases.
Your comment:
