Model-Based Reinforcement Learning via Meta-Policy OptimizationModel-Based Reinforcement Learning via Meta-Policy OptimizationClavera, Ignasi and Rothfuss, Jonas and Schulman, John and Fujita, Yasuhiro and Asfour, Tamim and Abbeel, Pieter2018
Paper summaryjoecohenIn terms of model based RL, learning dynamics models is imperfect, which often leads to the learned policy overfitting to the learned dynamics model, doing well in the learned simulator but not in the real world.
Key solution idea: No need to try to learn one accurate simulator. We can learn an ensemble of models that together will sufficiently represent the space. If we learn an ensemble of models (to be used as many learned simulators) we can denoise estimates of performance. In a meta-learning sense these simulations become the tasks. The real world is then just yet another task, to which the policy could adapt quickly. One experimental observation is that at the start of training there is a lot of variation between learned simulators, and then the simulations come together over training, which might also point to this approach providing improved exploration.
This summary was written with the help of Pieter Abbeel.
In terms of model based RL, learning dynamics models is imperfect, which often leads to the learned policy overfitting to the learned dynamics model, doing well in the learned simulator but not in the real world.
Key solution idea: No need to try to learn one accurate simulator. We can learn an ensemble of models that together will sufficiently represent the space. If we learn an ensemble of models (to be used as many learned simulators) we can denoise estimates of performance. In a meta-learning sense these simulations become the tasks. The real world is then just yet another task, to which the policy could adapt quickly. One experimental observation is that at the start of training there is a lot of variation between learned simulators, and then the simulations come together over training, which might also point to this approach providing improved exploration.
This summary was written with the help of Pieter Abbeel.