These days, a bulk of recent work in Variational AutoEncoders - a type of generative model - focuses on the question of how to add recently designed, powerful decoders (the part that maps from the compressed information bottleneck to the reconstruction) to VAEs, but still cause them to capture high level, conceptual information within the aforementioned information bottleneck (also know as a latent code). In the status quo, it’s the case that the decoder can do well enough even without conditioning on conceptual variables stored in the latent codes, that it’s not worth storing information there. The reason why VAEs typically make it costly to store information in latent codes is the typical inclusion of a term that measures the KL divergence (distributional distance, more or less) between an uninformative unit Gaussian (the prior) and distribution of latent z codes produced for each individual input x (the posterior). Intuitively, if the distribution for each input x just maps to the prior, then that gives the decoder no information about what x was initially passed in: this means the encoder has learned to ignore the latent code. The question of why this penalty term is included in the VAE has two answers, depending on whether you’re asking from a theoretical or practical standpoint. Theoretically, it’s because the original VAE objective function could be interpreted as a lower bound on the true p(x) distribution. Practically, pulling the individual distributions closer to that prior often has a regularizing effect, that causes z codes for individual files to be closer together, and also for closeness in z space to translate more to closeness in recreation concept. That happens because the encoder is disincentivized from making each individual z distribution that far from a prior. The upshot of this is that there’s a lot of overlap between the distributions learned for various input x values, and so it’s in the model’s interest to make the reconstruction of those nearby elements similar as well. The argument of this paper starts from the compression cost side. If you look at the KL divergence term with the prior from an information theory, you can see it as the “cost of encoding your posterior, using a codebook developed from your prior”. This is a bit of an opaque framing, but the right mental image is the morse code tree, the way that the most common character in the English language corresponds to the shortest morse symbol, and so on. This tree was optimized to make messages as short as possible, and was done so by mapping common letters to short symbols. But, if you were to encode a message in, say, Russian, you’d no longer be well optimized for the letter distribution in Russian, and your messages would generally be longer. So, in the typical VAE setting, we’re imagining a receiver who has no idea what message he’ll be sent yes, and so uses the global prior to inform their codebook. By contrast, the authors suggest a world in which we meaningfully order the entries sent to the receiver in terms of similarity. Then, if you use the heuristic “each message provides a good prior for the next message I’ll receive, you incur a lot less coding cost than, because the “prior” is designed to be a good distribution to use to encode this sample, which will hopefully be quite similar to the next one. On a practical level, this translates to: 1. Encoding a z distribution 2. Choosing one of that z code’s K closest neighbors 3. Putting that as input into a “prior network” that takes in the randomly chosen nearby c, and spits out distributional parameters for another distribution over zs, which we’ll call the “prior”. Intuitively, a lot of the trouble with the constraint that all z encodings be close to the same global prior is that that was just too restrictive. This paper tries to impose a local prior instead, that’s basically enforcing local smoothness, by pulling the z value closer to others already nearby it,but without forcing everything to look like a global prior.