The successes of deep learning on complex strategic games like Chess and Go have been largely driven by the ability to do tree search: that is, simulating sequences of actions in the environment, and then training policy and value functions to more speedily approximate the results that more exhaustive search reveals. However, this relies on having a good simulator that can predict the next state of the world, given your action. In some games, with straightforward rules, this is easy to explicitly code, but in many RL tasks like Atari, and in many contexts in the real world, having a good model of how the world responds to your actions is in fact a major part of the difficulty of RL. A response to this within the literature has been systems that learn models of the world from trajectories, and then use those models to do this kind of simulated planning. Historically these have been done by designing models that predict the next observation, given past observations and a passed-in action. This lets you "roll out" observations from actions in a way similar to how a simulator could. However, in high-dimensional observation spaces it takes a lot of model capacity to accurately model the full observation, and many parts of a given observation space will often be irrelevant. https://i.imgur.com/wKK8cnj.png To address this difficulty, the MuZero architecture uses an approach from Value Prediction Networks, and learns an internal model that can predict transitions between abstract states (which don't need to match the actual observation state of the world) and then predict a policy, value, and next-step reward from the abstract state. So, we can plan in latent space, by simulating transitions from state to state through actions, and the training signal for that space representation and transition model comes from being able to accurately predict the reward, the empirical future value at a state (discovered through Monte Carlo rollouts) and the policy action that the rollout search would have taken at that point. If two observations are identical in terms of their implications for these quantities, the transition model doesn't need to differentiate them, making it more straightforward to learn. (Apologies for the long caption in above screenshot; I feel like it's quite useful to gain intuition, especially if you're less recently familiar with the MCTS deep learning architectures DeepMind typically uses) https://i.imgur.com/4nepG6o.png The most impressive empirical aspect of this paper is the fact that it claims (from what I can tell credibly) to be able to perform as well as planning algorithms with access to a real simulator in games like Chess and Go, and as well as model-free models in games like Atari where MFRL has typically been the state of the art (because world models have been difficult to learn). I feel like I've read a lot recently that suggests to me that the distinction between model-free and model-based RL is becoming increasingly blurred, and I'm really curious to see how that trajectory evolves in future.