[link]
The authors present a new method to perform maximum likelihood training for Helmholtz machines. This paper follows up on recent work that jointly train a directed generative model $p(h)p(xh)$ and an approximate inference model $q(hx)$. The authors provide a concise summary of previous work and their mutual differences (e.g. Table 1). Their new method maintains a (persistent) MCMC chain of latent configurations per training datapoint and it uses $q(hx)$ as a proposal distribution in a Metropolis Hastings style sampling algorithm. The proposed algorithm looks promising although the authors do not provide any indepth analysis that highlights the potential strengths and weaknesses of the algorithm. For example: It seems plausible that the persistent Markov chain could deal with more complex posterior distributions $p(hx)$ than RWS or NVIL because these have to find high probability configurations $p(hx)$ by drawing only a few samples from (a typically factorial) $q(hx)$. It would therefore be interesting to measure the distance between the intractable $p(hx)$ and the approximate inference distribution $q(hx) $by estimating $KL(qp)$ or by estimating the effective sampling size for samples $h$ ~ $q(hx) $ or by showing the final testset NLL estimates over the number of samples h from q (compared to other methods). It would also be interesting to see how this method compares to the others when deeper models are trained. In summary: I think the paper presents an interesting method and provides sufficient experimental results for a workshop contribution. For a full conference or journal publication it would need to be extended.
Your comment:
