[link]
Summary by tom89 1 year ago
This paper shows how a family of reinforcement learning algorithms known as value gradient methods can be generalised to learn stochastic policies and deal with stochastic environment models.
Value gradients are a type of policy gradient algorithm which represent a value function either by:
* A learned Q-function (a critic)
* Linking together a policy, an environment model and reward function to define a recursive function to simulate the trajectory and the total return from a given state.
By backpropagating though these functions, value gradient methods can calculate a policy gradient. This backpropagation sets them apart from other policy gradient methods (like REINFORCE for example) which are model-free and sample returns from the real environment.
Applying value gradients to stochastic problems requires differentiating the stochastic bellman equation:
\begin{equation}
V ^t (s) = \int \left[ r^t + γ \int V^{t+1} (s) p(s' | s, a) ds' \right] p(a|s; θ) da
\end{equation}
To do that, the authors use a trick called re-parameterisation to express the stochastic bellman equation as a deterministic function which takes a noise variable as an input. To differentiate a re-parameterised function, one simply samples the noise variable then computes the derivative as if the function were deterministic. This can then be repeated $ M $ times and averaged to arrive at a Monte Carlo estimate for the derivative of the stochastic function.
The re-parameterised bellman equation is:
$ V (s) = \mathbb{E}_{ \rho(\eta) } \left[ r(s, \pi(s, \eta; \theta)) + \gamma \mathbb{E}_{\rho(\xi) } \left[ V' (f(s, \pi(s, \eta; \theta), \xi)) \right] \right] $
It's derivative with respect to the current state and the policy parameters is:
$ V_s = \mathbb{E}_{\rho(\eta)} \[ r_\textbf{s} + r_\textbf{a} \pi_\textbf{s} + \gamma \mathbb{E}_{\rho(\xi)} V'_{s'} (\textbf{f}_\textbf{s} + \textbf{f}_\textbf{a} \pi_\textbf{s}) \] $
$ V_\theta = \mathbb{E}_{\rho(\eta)} \[ r_\textbf{a} \pi_\theta + \gamma \mathbb{E}_{\rho(\xi)} \[ V'_{\textbf{s'}} \textbf{f}_\textbf{a} \pi_\textbf{s} + V'_\theta\] \] $
Based on these relationships the authors define two algorithms; SVG(∞), SVG(1)
* SVG(∞) takes the trajectory from an entire episode and starting at the terminal state accumulates a gradients $V_{\textbf{s}} $ and $ V_{\theta} $ using the expressions above to arrive at a policy gradient. SVG(∞) is on-policy and only works with finite-horizon environments
* SVG(1) trains a value function then uses its gradient as an estimate for $ V_{\textbf{s}} $ above. SVG(1) also uses importance weighting so as to be off-policy and can work with infinite-horizon environments.
Both algorithms use an environment model which is trained using an experience replay database. The paper also introduces SVG(0) which is a similar to SVG(1), but is model-free.
SVG was analysed using several MuJoCo environments and it was found that:
* SVG(∞) outperformed a BBPT planner on a control problem with a stochastic model, indicating that gradient evaluation using real trajectories is more effective than planning for stochastic environments
* SVG(1) is more robust to inaccurate environment models and value functions than SVG(∞)
* SVG(1) was able to solve several complex environments

more
less