Recently, DeepMind released a new paper showing strong performance on board game tasks using a mechanism similar to the Value Prediction Network one in this paper, which inspired me to go back and get a grounding in this earlier work. A goal of this paper is to design a model-based RL approach that can scale to complex environment spaces, but can still be used to run simulations and do explicit planning. Traditional, model-based RL has worked by learning a dynamics model of the environment - predicting the next observation state given the current one and an action, and then using that model of the world to learn values and plan with. In addition to the advantages of explicit planning, a hope is that model-based systems generalize better to new environments, because they predict one-step changes in local dynamics in a way that can be more easily separated from long-term dynamics or reward patterns. However, a downside of MBRL is that it can be hard to train, especially when your observation space is high-dimensional, and learning a straight model of your environment will lead to you learning details that aren't actually unimportant for planning or creating policies. The synthesis proposed by this paper is the Value Prediction Network. Rather than predicting observed state at the next step, it learns a transition model in latent space, and then learns to predict next-step reward and future value from that latent space vector. Because it learns to encode latent-space state from observations, and also learns a transition model from one latent state to another, the model can be used for planning, by simulating multiple transitions between latent state. However, unlike a normal dynamics model, whose training signal comes from a loss against observational prediction, the signal for training both latent → reward/value/discount predictions, and latent → latent transitions comes from using this pipeline to predict reward values. This means that if an aspect of the environment isn't useful for predicting reward, it won't generally be encoded into latent state, meaning you don't waste model capacity predicting irrelevant detail. https://i.imgur.com/4bJylms.png Once this model exists, it can be used for generating a policy through a tree-search planning approach: simulating future trajectories and aggregating the predicted reward along those trajectories, and then taking the highest-value one. The authors find that their model is able to do better than both model-free and model-based methods on the tasks they tested on. In particular, they find that it has many of the benefits of a model that predicts full observations, but that the Value Prediction Network learns more quickly, and is more robust to stochastic environments where there's an inherent ceiling on how well a next-step observation prediction can work. My main question coming into this paper is: how is this different from simply a value estimator like those used in DQN or A2C, and my impression is that the difference comes from this model's ability to do explicit state simulation in latent space, and then predict a value off of the *latent* state, whereas a value network predicts value from observational state.