Despite their difficulties in training, Generative Adversarial Networks are still one of the most exciting recent ideas in machine learning; a way to generate data without the fuzziness and averaging of earlier methods. However, up until recently, there had been major way in which the GAN’s primary competitor in the field, the Variational Autoencoder, was superior: it could do inference. Intuitively, inference is the inverse of generation. Whereas generation works by taking some source of randomness - a random vector, the setting of some latent code - and transforming that recipe into an observation, an inference process tries to work in reverse, taking in the observation as input and trying to guess what “recipe” was used to generate it. (As a note: in real world data, it’s generally not the case that there were explicit numerical factors used to generate data; this framing is a simplified model meant to represent the way a small set of latent settings of an object jointly cause a lot of that object’s feature values). The authors of this paper proposed the BiGAN to fix that deficiency in GAN literature. https://i.imgur.com/vZZzWH5.png The BiGAN - short for Bidirectional GAN - works by having two generators, not one. One generator works in the typical fashion of a GAN: taking in a random vector z, and transforming that into G(z) = x. The second generator works in reverse, taking in as input data from the underlying dataset, and transforming it into a code z, E(x) = z. Once these generators are in place, the discriminators work, not by trying to differentiate the x and z values separately, but all together. That works by giving the discriminator a pair, (x, z), and asking the discriminator to decide whether that pair came from the z -> x decoder, or the x -> z encoder. If this model fully converges, it becomes the case that G(z) and E(x) are inverse transformations, giving us a way to take in a new input x, and infer its underlying factors z. This is valuable because it’s been shown that, in typical GANs, changes in z often correspond to latent values we care about, and it would be useful to be able to generate z from x for purposes of representation learning. The authors offer quite a nice intuitive proof for why the model learns this inverse mapping. For each pair of (x, z), it’s either the case that E(x) = z (if the pair came from the encoder), or that G(z) = x (if the pair came from the decoder). But if only one of those is the case, then it’s easy for the discriminator to tell which generation process produced the pair. So, in order to fool the discriminator, G(z) and E(x) need to synchronize their decoding and encoding processes. The authors also tried a method where, instead of having this bidirectional GAN structure, they instead simply built a network on top of the generated samples, that tries to predict the original z used, taking the generated x as input. They show that this performs less well on subjective quality measures of the learned representation, which they attribute to the fact that GANs notoriously only learn some modes of the data, and thus a x -> z encoder that only takes the generated z as input will not have good coverage over the full distribution of x.