[link]
Rakelly et al. propose a method to do offpolicy meta reinforcement learning (rl). The method achieves a 20100x improvement on sample efficiency compared to onpolicy meta rl like MAML+TRPO. The key difficulty for offline meta rl arises from the metalearning assumption, that metatraining and metatest time match. However during test time the policy has to explore and sees as such onpolicy data which is in contrast to the offpolicy data that should be used at metatraining. The key contribution of PEARL is an algorithm that allows for online task inference in a latent variable at train and test time, which is used to train a Soft Actor Critic, a very sample efficient offpolicy algorithm, with additional dependence of the latent variable. The implementation of Rakelly et al. proposes to capture knowledge about the current task in a latent stochastic variable Z. A inference network $q_{\Phi}(z \vert c)$ is used to predict the posterior over latents given context c of the current task in from of transition tuples $(s,a,r,s')$ and trained with an information bottleneck. Note that the task inference is done on samples according to a sampling strategy sampling more recent transitions. The latent z is used as an additional input to policy $\pi(a \vert s, z)$ and Qfunction $Q(a,s,z)$ of a soft actor critic algorithm which is trained with offline data of the full replay buffer. https://i.imgur.com/wzlmlxU.png So the challenge of differing conditions at test and train times is resolved by sampling the content for the latent context variable at train time only from very recent transitions (which is almost onpolicy) and at test time by construction onpolicy. Sampling $z \sim q(z \vert c)$ at test time allows for posterior sampling of the latent variable, yielding efficient exploration. The experiments are performed across 6 Mujoco tasks with ProMP, MAML+TRPO and $RL^2$ with PPO as baselines. They show:  PEARL is 20100x more sampleefficient  the posterior sampling of the latent context variable enables deep exploration that is crucial for sparse reward settings  the inference network could be also a RNN, however it is crucial to train it with uncorrelated transitions instead of trajectories that have high correlated transitions  using a deterministic latent variable, i.e. reducing $q_{\Phi}(z \vert c)$ to a point estimate, leaves the algorithm unable to solve sparse reward navigation tasks which is attributed to the lack of temporally extended exploration. The paper introduces an algorithm that allows to combine meta learning with an offpolicy algorithm that dramatically increases the sampleefficiency compared to onpolicy meta learning approaches. This increases the chance of seeing meta rl in any sort of real world applications.
Your comment:
