Joint Stochastic Approximation learning of Helmholtz MachinesJoint Stochastic Approximation learning of Helmholtz MachinesXu, Haotian and Ou, Zhijian2016
Paper summaryopenreviewThe authors present a new method to perform maximum likelihood training for Helmholtz machines. This paper follows up on recent work that jointly train a directed generative model $p(h)p(x|h)$ and an approximate inference model $q(h|x)$. The authors provide a concise summary of previous work and their mutual differences (e.g. Table 1).
Their new method maintains a (persistent) MCMC chain of latent configurations per training datapoint and it uses $q(h|x)$ as a proposal distribution in a Metropolis Hastings style sampling algorithm. The proposed algorithm looks promising although the authors do not provide any in-depth analysis that highlights the potential strengths and weaknesses of the algorithm. For example: It seems plausible that the persistent Markov chain could deal with more complex posterior distributions $p(h|x)$ than RWS or NVIL because these have to find high probability configurations $p(h|x)$ by drawing only a few samples from (a typically factorial) $q(h|x)$. It would therefore be interesting to measure the distance between the intractable $p(h|x)$ and the approximate inference distribution $q(h|x) $by estimating $KL(q|p)$ or by estimating the effective sampling size for samples $h$ ~ $q(h|x) $ or by showing the final testset NLL estimates over the number of samples h from q (compared to other methods). It would also be interesting to see how this method compares to the others when deeper models are trained.
In summary: I think the paper presents an interesting method and provides sufficient experimental results for a workshop contribution. For a full conference or journal publication it would need to be extended.
The authors present a new method to perform maximum likelihood training for Helmholtz machines. This paper follows up on recent work that jointly train a directed generative model $p(h)p(x|h)$ and an approximate inference model $q(h|x)$. The authors provide a concise summary of previous work and their mutual differences (e.g. Table 1).
Their new method maintains a (persistent) MCMC chain of latent configurations per training datapoint and it uses $q(h|x)$ as a proposal distribution in a Metropolis Hastings style sampling algorithm. The proposed algorithm looks promising although the authors do not provide any in-depth analysis that highlights the potential strengths and weaknesses of the algorithm. For example: It seems plausible that the persistent Markov chain could deal with more complex posterior distributions $p(h|x)$ than RWS or NVIL because these have to find high probability configurations $p(h|x)$ by drawing only a few samples from (a typically factorial) $q(h|x)$. It would therefore be interesting to measure the distance between the intractable $p(h|x)$ and the approximate inference distribution $q(h|x) $by estimating $KL(q|p)$ or by estimating the effective sampling size for samples $h$ ~ $q(h|x) $ or by showing the final testset NLL estimates over the number of samples h from q (compared to other methods). It would also be interesting to see how this method compares to the others when deeper models are trained.
In summary: I think the paper presents an interesting method and provides sufficient experimental results for a workshop contribution. For a full conference or journal publication it would need to be extended.