One of the most notable flaws of modern model-free reinforcement learning is its sample inefficiency; where humans can learn a new task with relatively few examples, model that learn policies or value functions directly from raw data need huge amounts of data to train properly. Because the model isn't given any semantic features, it has to learn a meaningful representation from raw pixels using only the (often sparse, often noisy) signal of reward. Some past approaches have tried learning representations separately from the RL task (where you're not bottlenecked by agent actions), or by adding more informative auxiliary objectives to the RL task. Instead, the authors of this paper, quippily titled "Image Augmentation Is All You Need", suggest using data augmentation of input observations through image modification (in particular, by taking random different crops of an observation stack), and integrating that augmentation into the native structure of a RL loss function (in particular, the loss term for Q learning). There are two main reasons why you might expect image augmentation to be useful. 1. On the most simplistic level, it's just additional data for your network 2. But, in particular, it's additional data designed to exhibit ways an image observation can be different on a pixel level, but still not be meaningfully different in terms of its state within the game. You'd expect this kind of information to make your model robust to overfitting. The authors go into three different ways they could add image augmentation to a Q Learning model, and show that each one provides additional marginal value. The first, and most basic, is to just add augmented versions of observations to your training dataset. The basic method being used, Soft Actor Critic, uses a replay buffer of old observations, and this augmentation works by simply applying a different crop transformation each time an observation is sampled from a replay buffer. This is a neat and simple trick, that effectively multiplies the number of distinct observations your network sees by the number of possible crops, making it less prone to overfitting. The next two ways involve integrating transformed versions of an observation into the structure of the Q function itself. As a quick background, Q learning is trained using a Bellman consistency loss, and Q tries to estimate the value of a (state, action) pair, assuming that you do the best possible thing at every point after you take the action at the state. The consistency loss tries to push your Q estimate of the value of a (state, action) pair closer to the sum of reward you got by taking that action and your current max Q estimate for the next state. The second term in this loss, the combined reward and next-step Q value, is called the target, since it's what you push your current-step Q value closer towards. This paper suggests both: - Averaging your current-step Q estimate over multiple different crops the observation stack at the current state - Averaging the next-step Q estimate used in the target over multiple different crops (that aren't the ones used in the current-step averaging) This has the nice side effect that, in addition to telling your network about image transformations (like small crops) that shouldn't impact its strategic assessment, it also makes your Q learning process overall lower variance, because both the current step and next step quantities are averages rather than single-sample values. https://i.imgur.com/LactlFq.png Operating in a lower data regime, the authors found that simply adding augmentations to their replay buffer sampling (without the two averaging losses) gave them a lot of gains in how efficiently they could learn, but all three combined gave the best performance.