Learning Continuous Control Policies by Stochastic Value Gradients Learning Continuous Control Policies by Stochastic Value Gradients
Paper summary This paper shows how a family of reinforcement learning algorithms known as value gradient methods can be generalised to learn stochastic policies and deal with stochastic environment models. Value gradients are a type of policy gradient algorithm which represent a value function either by: * A learned Q-function (a critic) * Linking together a policy, an environment model and reward function to define a recursive function to simulate the trajectory and the total return from a given state. By backpropagating though these functions, value gradient methods can calculate a policy gradient. This backpropagation sets them apart from other policy gradient methods (like REINFORCE for example) which are model-free and sample returns from the real environment. Applying value gradients to stochastic problems requires differentiating the stochastic bellman equation: \begin{equation} V ^t (s) = \int \left[ r^t + γ \int V^{t+1} (s) p(s' | s, a) ds' \right] p(a|s; θ) da \end{equation} To do that, the authors use a trick called re-parameterisation to express the stochastic bellman equation as a deterministic function which takes a noise variable as an input. To differentiate a re-parameterised function, one simply samples the noise variable then computes the derivative as if the function were deterministic. This can then be repeated $ M $ times and averaged to arrive at a Monte Carlo estimate for the derivative of the stochastic function. The re-parameterised bellman equation is: $ V (s) = \mathbb{E}_{ \rho(\eta) } \left[ r(s, \pi(s, \eta; \theta)) + \gamma \mathbb{E}_{\rho(\xi) } \left[ V' (f(s, \pi(s, \eta; \theta), \xi)) \right] \right] $ It's derivative with respect to the current state and the policy parameters is: $ V_s = \mathbb{E}_{\rho(\eta)} \[ r_\textbf{s} + r_\textbf{a} \pi_\textbf{s} + \gamma \mathbb{E}_{\rho(\xi)} V'_{s'} (\textbf{f}_\textbf{s} + \textbf{f}_\textbf{a} \pi_\textbf{s}) \] $ $ V_\theta = \mathbb{E}_{\rho(\eta)} \[ r_\textbf{a} \pi_\theta + \gamma \mathbb{E}_{\rho(\xi)} \[ V'_{\textbf{s'}} \textbf{f}_\textbf{a} \pi_\textbf{s} + V'_\theta\] \] $ Based on these relationships the authors define two algorithms; SVG(∞), SVG(1) * SVG(∞) takes the trajectory from an entire episode and starting at the terminal state accumulates a gradients $V_{\textbf{s}} $ and $ V_{\theta} $ using the expressions above to arrive at a policy gradient. SVG(∞) is on-policy and only works with finite-horizon environments * SVG(1) trains a value function then uses its gradient as an estimate for $ V_{\textbf{s}} $ above. SVG(1) also uses importance weighting so as to be off-policy and can work with infinite-horizon environments. Both algorithms use an environment model which is trained using an experience replay database. The paper also introduces SVG(0) which is a similar to SVG(1), but is model-free. SVG was analysed using several MuJoCo environments and it was found that: * SVG(∞) outperformed a BBPT planner on a control problem with a stochastic model, indicating that gradient evaluation using real trajectories is more effective than planning for stochastic environments * SVG(1) is more robust to inaccurate environment models and value functions than SVG(∞) * SVG(1) was able to solve several complex environments
Learning Continuous Control Policies by Stochastic Value Gradients
Nicolas Heess and Greg Wayne and David Silver and Timothy Lillicrap and Yuval Tassa and Tom Erez
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG, cs.NE


Summary by tom89 2 months ago
Your comment:

ShortScience.org allows researchers to publish paper summaries that are voted on and ranked!

Sponsored by: and