[link]
This very new paper, is currently receiving quite a bit of attention by the [community](https://www.reddit.com/r/MachineLearning/comments/5qxoaz/r_170107875_wasserstein_gan/). The paper describes a new training approach, which solves the two major practical problems with current GAN training: 1) The training process comes with a meaningful loss. This can be used as a (soft) performance metric and will help debugging, tune parameters and so on. 2) The training process does not suffer from all the instability problems. In particular the paper reduces mode collapse significantly. On top of that, the paper comes with quite a bit mathematical theory, explaining why there approach works and other approachs have failed. This paper is a must read for anyone interested in GANs.
Your comment:

[link]
_Objective:_ Robust unsupervised learning of a probability distribution using a new module called the `critic` and the `Earthmover distance`. _Dataset:_ [LSUNBedrooms](http://lsun.cs.princeton.edu/2016/) ## Inner working: Basically train a `critic` until convergence to retrieve the Wasserstein1 distance, see pseudoalgorithm below: [![screen shot 20170503 at 5 05 09 pm](https://cloud.githubusercontent.com/assets/17261080/25667162/003c9330302311e79081c181011f4e6f.png)](https://cloud.githubusercontent.com/assets/17261080/25667162/003c9330302311e79081c181011f4e6f.png) ## Results: * Easier training: no need for batch normalization and no need to finetune generator/discriminator balance. * Less sensitivity to network architecture. * Very good proxy that correlates very well with sample quality. * Nonvanishing gradients. 
[link]
* They suggest a slightly altered algorithm for GANs. * The new algorithm is more stable than previous ones. ### How * Each GAN contains a Generator that generates (fake)examples and a Discriminator that discriminates between fake and real examples. * Both fake and real examples can be interpreted as coming from a probability distribution. * The basis of each GAN algorithm is to somehow measure the difference between these probability distributions and change the network parameters of G so that the fakedistribution becomes more and more similar to the real distribution. * There are multiple distance measures to do that: * Total Variation (TV) * KLDivergence (KL) * JensenShannon divergence (JS) * This one is based on the KLDivergence and is the basis of the original GAN, as well as LAPGAN and DCGAN. * EarthMover distance (EM), aka Wasserstein1 * Intuitively, one can imagine both probability distributions as hilly surfaces. EM then reflects, how much mass has to be moved to convert the fake distribution to the real one. * Ideally, a distance measure has everywhere nice values and gradients (e.g. no +/ infinity values; no binary 0 or 1 gradients; gradients that get continously smaller when the generator produces good outputs). * In that regard, EM beats JS and JS beats TV and KL (roughly speaking). So they use EM. * EM * EM is defined as * ![EM](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/WGAN__EM.jpg?raw=true "EM") * (inf = infinum, more or less a minimum) * which is intractable, but following the KantorovichRubinstein duality it can also be calculated via * ![EM tractable](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/WGAN__EM_tractable.jpg?raw=true "EM tractable") * (sup = supremum, more or less a maximum) * However, the second formula is here only valid if the network is a KLipschitz function (under every set of parameters). * This can be guaranteed by simply clipping the discriminator's weights to the range `[0.01, 0.01]`. * Then in practice the following version of the tractable EM is used, where `w` are the parameters of the discriminator: * ![EM tractable in practice](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/WGAN__EM_tractable_practice.jpg?raw=true "EM tractable in practice") * The full algorithm is mostly the same as for DCGAN: * ![Algorithm](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/WGAN__algorithm.jpg?raw=true "Algorithm") * Line 2 leads to training the discriminator multiple times per batch (i.e. more often than the generator). * This is similar to the `max w in W` in the third formula (above). * This was already part of the original GAN algorithm, but is here more actively used. * Because of the EM distance, even a "perfect" discriminator still gives good gradient (in contrast to e.g. JS, where the discriminator should not be too far ahead). So the discriminator can be safely trained more often than the generator. * Line 5 and 10 are derived from EM. Note that there is no more Sigmoid at the end of the discriminator! * Line 7 is derived from the KLipschitz requirement (clipping of weights). * High learning rates or using momentumbased optimizers (e.g. Adam) made the training unstable, which is why they use a small learning rate with RMSprop. ### Results * Improved stability. The method converges to decent images with models which failed completely when using JSdivergence (like in DCGAN). * For example, WGAN worked with generators that did not have batch normalization or only consisted of fully connected layers. * Apparently no more mode collapse. (Mode collapse in GANs = the generator starts to generate often/always the practically same image, independent of the noise input.) * There is a relationship between loss and image quality. Lower loss (at the generator) indicates higher image quality. Such a relationship did not exist for JS divergence. * Example images: * ![Example images](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/WGAN__examples.jpg?raw=true "Example images") 