This paper outlines (yet another) variation on a variational autoencoder (VAE), which is, at a high level, a model that seeks to 1) learn to construct realistic samples from the data distribution, and 2) capture meaningful information about the data within its latent space. The “latent space” is a way of referring to the information bottleneck that happens when you compress the input (typically for these examples: an image) into a lowdimensional vector, before trying to predict that input out again using that lowdimensional vector as a seed or conditional input. In a typical VAE, the objective function is composed of two terms: a reconstruction loss that captures how well your Decoder distribution captures the X that was passed in as input, and a regularization loss that pushes the latent z code you create to be close to some input prior distribution. Pushing your learned z codes to be closer to a prior is useful because you can then sample using that prior, and have those draws map to the coheret regions of the space, where you’ve trained in the past. The Implicit Autoencoder proposal changes both elements of this objective function, but since one  the modification of the regularization term  is actually drawn from another (Adversarial Autoencoders), I’m primarily going to be focusing on the changes to the reconstruction term. In a typical variational autoencoder, the model is incentivized to perform an exact reconstruction of the input X, by using the latent code as input. Since this distance is calculated on a pixelwise basis, this puts a lot of pressure on the latent z code to learn ways of encoding this detailed local information, rather than what we’d like it to be capturing, which is broader, global structure of the data. In the IAE approach, instead of incentivizing the input x to be high probability in the distribution conditioned by the z that the encoder embedded off of x, we instead try to match the joint distributions of (x, z) and (reconstructedx, z). This is done by taking these two pairs, and running them through a GAN, which needs to tell which pair represents the reconstructed x, and which the input x. Here, the GAN takes as input a concatenation of z (the embedded code for this image), and n, which is a random vector. Since a GAN is a deterministic mapping, this random vector n is what allows for sampling from this model, rather than just pulling the same output every time. Under this system, the model is under less pressure to recreate the details from the particular image that was input. Instead, it just needs to synchronize the use of z between the encoder and the decoder. To understand why this is true, imagine if you had an MNIST set of 1 and 2s, and a binary number for your z distribution. If you encode a 2, you can do so by setting that binary float to 0. Now, as long as your decoder realizes what the encoder was trying to do, and reconstructs a 2, then the joint distribution will be similar between the encoder and decoder, and our new objective function will be happy. An important fact here is: this doesn’t require that the decoder reconstruct the *exact* 2 that was passed in, as long as it matches, in distribution, the set of images that the encoder is choosing to map to the same z code, the decoder can do well. A consequence of this approach is an ability to modulate how much information you actually want to pull out into your latent vector, and how much you just want to be represented by your random noise vector, which will control randomness in the GAN and, to continue the example above, allow you to draw more than one distinct 2 off of the ‘2” latent code. If you have a limited set of z dimensionality, the they will represent high level concepts (for example: MNIST digits) and the rest of the variability in images will be modeled through the native GAN framework. If you have a high dimensional z, then more and more detaillevel information will get encoded into the z vector, rather than just being left to the noise.
Your comment:
