Estimating Gradients for Discrete Random Variables by Sampling without Replacement Estimating Gradients for Discrete Random Variables by Sampling without Replacement
Paper summary It's a shame that the authors weren't able to continue their series of [great][reinforce] [paper][attention] [titles][beams], although it looks like they thought about calling this paper **"Put Replacement In Your Basement"**. Also, although they don't say it in the title or abstract, this paper introduces an estimator the authors call the **"unordered set estimator"** which, as a name, is not the best. However, this is one of the most exciting estimators for gradients of non-differentiable expectations over structured discrete distributions (that sounds specific but includes a vast number of important problems, for example see [the first paragraph of this paper][mcts] and, less important but fun, adversarial examples or GANS on sequences). For a distribution $p_\theta(s)$, where $s$ can take some set of discrete states, this paper is concerned with how to compute the gradient of the expectation of a function, $f$, over $p_\theta(s)$: $$ \nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] $$ To jump directly to the definition of the estimator I've got to define the *leave-one-out* ratio. Given a distribution $p$ over unordered sets $S^k$, the leave-one-out ratio is $$ R(S^k, s) = \frac{p^{D \backslash \{ s \}} (S^k \backslash \{ s \} )}{p(S^k)}. $$ Where $p^{D \backslash \{ s \}}$ is the distribution over sets that don't contain $s$, which is the value of the first sampled element. For a given element $s$, it's the ratio of probability of sampling the other elements *given $s$* in $S^k$ divided by the probability of sampling $S^k$. Then the *unordered set estimator* is $$ \nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] = \nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ \sum_{s \in S^k} p(s) R(S^k, s) f(s) \right] \\ = \mathbb{E}_{s \sim p_\theta(s)} \left[ \nabla_\theta p_\theta (s) R(S^k, s) f(s) \right], $$ and they proceed to show that it's a Rao-Blackwellization (so guaranteed as low or lower variance) than a single sample estimator, importance weighted estimators and the [the RB discrete estimator][rb]. The authors also show that they can use the multiple samples to compute a built-in baseline and further reduce variance. If we define the leave-one-out ratio on a distribution with one element already left out as $R^{D \backslash \{ s \} } (S^k, s')$ (the *second-order* leave-one-out ratio): $$ \nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] \\ = \mathbb{E}_{s \sim p_\theta(s)} \left[ \sum_{s \in S^k} \nabla_\theta p_\theta (s) R(S^k, s) \left( f(s) - \sum_{s' \in S^k} p_{\theta} (s') R^{D \backslash \{ s \} } (S^k, s') f(s') \right) \right] $$ Thanks to [stochastic beam search][beams] these sets can be sampled quickly and easily, so it all adds up to a powerful generic method. If I need gradients in an SGD setting of an expectation involving discrete variables, and I can take multiple samples, this is the most promising method I know about at the moment. [reinforce]: [attention]: [beams]: [mcts]: [rb]:
Estimating Gradients for Discrete Random Variables by Sampling without Replacement
Kool, Wouter and van Hoof, Herke and Welling, Max
International Conference on Learning Representations - 2020 via Local Bibsonomy
Keywords: readings, optimization, sampling

Summary by Gavin Gray 2 weeks ago
Your comment: allows researchers to publish paper summaries that are voted on and ranked!

Sponsored by: and