VQ-VAE is a Variational AutoEncoder that uses as its information bottleneck a discrete set of codes, rather than a continuous vector. That is: the encoder creates a downsampled spatial representation of the image, where in each grid cell of the downsampled image, the cell is represented by a vector. But, before that vector is passed to the decoder, it's discretized, by (effectively) clustering the vectors the network has historically seen, and substituting each vector with the center of the vector it's closest to. This has the effect of reducing the capacity of your information bottleneck, but without just pushing your encoded representation closer to an uninformed prior. (If you're wondering how the gradient survives this very much not continuous operation, the answer is: we just pretend that operation didn't exist, and imagine that the encoder produced the cluster-center "codebook" vector that the decoder sees). The part of the model that got a (small) upgrade in this paper is the prior distribution model that's learned on top of these latent representations. The goal of this prior is to be able to just sample images, unprompted, from the distribution of latent codes. Once we have a trained decoder, if we give it a grid of such codes, it can produce an image. But these codes aren't one-per-image, but rather a grid of many codes representing features in different part of the image. In order to generate a set of codes corresponding to a reasonable image, we can either generate them all at once, or else (as this paper does) use an autoregressive approach, where some parts of the code grid are generated, and then subsequent ones conditioned on those. In the original version of the paper, the autoregressive model used was a PixelCNN (don't have the space to fully explain that here, but, at a high level: a model that uses convolutions over previously generated regions to generate a new region). In this paper, the authors took inspiration from the huge rise of self-attention in recent years, and swapped that operation in in lieu of the convolutions. Self-attention has the nice benefit that you can easily have a global receptive range (each region being generated can see all other regions) which you'd otherwise need multiple layers of convolutions to accomplish. In addition, the authors add an additional layer of granularity: generating both a 32x32 and 64x64 grid, and using both to generate the decoded reconstruction. They argue that this allows one representation to focus on more global details, and the other on more precise ones. https://i.imgur.com/zD78Pp4.png The final result is the ability to generate quite realistic looking images, that at least are being claimed to be more diverse than those generated by GANs (examples above). I'm always a bit cautious of claims of better performance in the image-generation area, because it's all squinting at pixels and making up somewhat-reasonable but still arbitrary metrics. That said, it seems interesting and useful to be aware of the current relative capabilities of two of the main forms of generative modeling, and so I'd recommend this paper on that front, even if it's hard for me personally to confidently assess the improvements on prior art.