[link]
This paper shows exciting results on using Modelbased RL for Atari. Modelbased RL has shown impressive improvements in sample efficiency on Mujoco tasks ([Chua et. al, 2018](https://arxiv.org/abs/1805.12114)), so its nice to see that the sample efficiency improvements carry over to Pixelbased envs like Atari too. Specifically, the authors show that their modelbased method can do well on several Atari games after training on only 100K env steps (400K frames with FrameSkip 4) which roughly corresponds to 2 hours of game play. They compare to SOTA modelfree variants (Rainbow, PPO) after similar number of frames and show that the modelbased version achieves much better scores. The overall training procedure has a very Dyna like flavor. The algorithm, termed SimPLe follows an iterative scheme of: * Collect experience from the real environment using a policy (initialized to random). * Use this experience to train the world model (a nextstep frame prediction model, and a reward prediction model). This amounts to supervised learning on `{(s, a) > s’}` and `{(s, a) > r}` pairs. * Generate rollouts using the world model, and learn a policy with these rollouts using PPO. https://i.imgur.com/SZLmdME.png **Countering distributional shift:** A key issue when training models is compounding errors when doing multistep rollouts. This is similar to the problem of making predictions with RNNs trained via teacherforcing, and hence it's natural to leverage existing techniques from that literature. This paper uses one such technique: scheduled sampling, that is during training randomly replace some frames of the input by the prediction from the previous step. This seems like a natural way to make the model robust to slight distributional changes. **Commentary / possible future work:** * The paper evaluated only on 26 out of 60 Atari games in ALE. I would have really liked if the authors showed performance numbers on all the games even if they weren’t good. * Related: I suspect the method would not work well when the initial diversity of frames given by the random policy is not sufficient (ex. Sparse reward games like Montezuma’s revenge/Pitfall). Using sample efficient exploration algorithms to augment model learning would be really interesting. * The trained worldmodel is able to rollout only for 50 timesteps (compounding errors don't allow for longer rollouts), it might be worthwhile to explore models that can do longhorizon predictions [(TDVAE?)](https://openreview.net/forum?id=S1x4ghC9tQ). * Apart from sampleefficiency gains, one reason I am excited about models is their potential ability to generalize to different tasks in the same environment. Benchmarking their generalization capability should thus be an exciting next step. Finally, props to authors for opensourcing the code: [tensor2tensor/tensor2tensor/rl at master · tensorflow/tensor2tensor · GitHub](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/rl) and providing detailed instructions to run.
Your comment:
