Learning Continuous Control Policies by Stochastic Value Gradients Learning Continuous Control Policies by Stochastic Value Gradients
Paper summary This paper shows how a family of reinforcement learning algorithms known as value gradient methods can be generalised to learn stochastic policies and deal with stochastic environment models. Value gradients are a type of policy gradient algorithm which represent a value function either by: * A learned Q-function (a critic) * Linking together a policy, an environment model and reward function to define a recursive function to simulate the trajectory and the total return from a given state. By backpropagating though these functions, value gradient methods can calculate a policy gradient. This backpropagation sets them apart from other policy gradient methods (like REINFORCE for example) which are model-free and sample returns from the real environment. Applying value gradients to stochastic problems requires differentiating the stochastic bellman equation: \begin{equation} V ^t (s) = \int \left[ r^t + γ \int V^{t+1} (s) p(s' | s, a) ds' \right] p(a|s; θ) da \end{equation} To do that, the authors use a trick called re-parameterisation to express the stochastic bellman equation as a deterministic function which takes a noise variable as an input. To differentiate a re-parameterised function, one simply samples the noise variable then computes the derivative as if the function were deterministic. This can then be repeated $ M $ times and averaged to arrive at a Monte Carlo estimate for the derivative of the stochastic function. The re-parameterised bellman equation is: $ V (s) = \mathbb{E}_{ \rho(\eta) } \left[ r(s, \pi(s, \eta; \theta)) + \gamma \mathbb{E}_{\rho(\xi) } \left[ V' (f(s, \pi(s, \eta; \theta), \xi)) \right] \right] $ It's derivative with respect to the current state and the policy parameters is: $ V_s = \mathbb{E}_{\rho(\eta)} \[ r_\textbf{s} + r_\textbf{a} \pi_\textbf{s} + \gamma \mathbb{E}_{\rho(\xi)} V'_{s'} (\textbf{f}_\textbf{s} + \textbf{f}_\textbf{a} \pi_\textbf{s}) \] $ $ V_\theta = \mathbb{E}_{\rho(\eta)} \[ r_\textbf{a} \pi_\theta + \gamma \mathbb{E}_{\rho(\xi)} \[ V'_{\textbf{s'}} \textbf{f}_\textbf{a} \pi_\textbf{s} + V'_\theta\] \] $ Based on these relationships the authors define two algorithms; SVG(∞), SVG(1) * SVG(∞) takes the trajectory from an entire episode and starting at the terminal state accumulates a gradients $V_{\textbf{s}} $ and $ V_{\theta} $ using the expressions above to arrive at a policy gradient. SVG(∞) is on-policy and only works with finite-horizon environments * SVG(1) trains a value function then uses its gradient as an estimate for $ V_{\textbf{s}} $ above. SVG(1) also uses importance weighting so as to be off-policy and can work with infinite-horizon environments. Both algorithms use an environment model which is trained using an experience replay database. The paper also introduces SVG(0) which is a similar to SVG(1), but is model-free. SVG was analysed using several MuJoCo environments and it was found that: * SVG(∞) outperformed a BBPT planner on a control problem with a stochastic model, indicating that gradient evaluation using real trajectories is more effective than planning for stochastic environments * SVG(1) is more robust to inaccurate environment models and value functions than SVG(∞) * SVG(1) was able to solve several complex environments
arxiv.org
arxiv-sanity.com
scholar.google.com
Learning Continuous Control Policies by Stochastic Value Gradients
Nicolas Heess and Greg Wayne and David Silver and Timothy Lillicrap and Yuval Tassa and Tom Erez
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG, cs.NE

more

Summary by tom89 8 months ago
Loading...
Your comment:


ShortScience.org allows researchers to publish paper summaries that are voted on and ranked!
About

Sponsored by: and