This paper proposes to train a neural network generative model by optimizing an importance sampling (IS) weighted estimate of the log probability under the model. The authors show that the case of an estimate based on a single sample actually corresponds to the learning objective of variational autoencoders (VAE). Importantly, they exploit this connection by showing that, similarly to VAE, a gradient can be passed through the approximate posterior (the IS proposal) samples, thus yielding an importance weighted autoencoder (IWAE). The authors also show that, by using more samples, this objective, which is a lower bound of the actual loglikelihood, becomes an increasingly tighter approximation to the loglikelihood. In other words, the IWAE is expected to better optimize the real loglikelihood of the neural network, compared to VAE. The experiments presented show that the model achieves competitive performance on a version of the binarized MNIST benchmark and on the Omniglot dataset. #### My two cents This is a really neat contribution! While simple (both conceptually and algorithmically), it really seems to be an important step forward for the VAE framework. I really like the theoretical result showing that IWAE provides a better approximation to the real loglikelihood, it's quite neat and provides an excellent motivation for the method. The results on binarized MNIST are certainly impressive. Unfortunately, it appears that the training setup isn't actually comparable to the majority of published results on this dataset. Indeed, it seems that they didn't use the stochastic but *fixed* binarization of the inputs that other publications on this benchmark have used (since my paper on NADE with Iain Murray, we've made available that fixed training set for everyone to use, along with fixed validation and test sets as well). I believe instead they've resampled the binarization for each minibatch, effectively creating a setup with a somewhat larger training set than usual. It's unfortunate that this is the case, since it makes this result effectively impossible to compare directly with previous work. I'm being picky on this issue only because I'm super interested in this problem (that is of generative modeling with neural networks) and this little issue is pretty much the only thing that stops this paper from being a slam dunk. Hopefully the authors (or perhaps someone interested in reimplementing IWAE) can clarify this question eventually. Otherwise, it seems quite clear to me that IWAE is an improvement over VAE. The experiments of section 5.2, showing that finetuning a VAE model with IWAE training improves performance, while finetuning a IWAE model using VAE actually makes things worse, is further demonstration that IWAE is indeed a good idea.
Your comment:
