[link]
The paper provides derivations and intuitions about the learning dynamics for VAEs based on observations about [$\beta$VAEs][beta]. Using this they derive an alternative way to constrain the training of VAEs that doesn't require typical heuristics, such as warmup or adding noise to the data. How exactly would this change a typical implementation? Typically, SGD is used to [optimize the ELBO directly](https://github.com/pytorch/examples/blob/master/vae/main.py#L91L95). Using GECO, I keep a moving average of my constraint $C$ (chosen based on what I want the VAE to do, but it can be just the likelihood plus a tolerance parameter) and use that to calculate Lagrange multipliers, which control the weighting of the constraint to the loss. [This implementation](https://github.com/denproc/TamingVAEs/blob/master/train.py#L83L97) from a class project appears to be correct. With the stabilization of training, I can't help but think of this as batchnorm for VAEs. [beta]: https://openreview.net/forum?id=Sy2fzU9gl
Your comment:
