Categorical Reparameterization with Gumbel-SoftmaxCategorical Reparameterization with Gumbel-SoftmaxJang, Eric and Gu, Shixiang and Poole, Ben2016

Paper summarygngdbIn [stochastic computation graphs][scg], like [variational autoencoders][vae], using discrete variables is hard because we can't just differentiate through Monte Carlo estimates. This paper introduces a distribution that is a smoothed version of the [categorical distribution][cat] and has a parameter that, as it goes to zero, will make it equal the categorical distribution. This distribution is continuous and can be reparameterised.
In other words, the Gumbel trick way to sample a categorical $z$ looks like this ($g_i$ is gumbel distributed and $\boldsymbol{\pi}/\sum_j \pi_j$ are the categorical probabilties):
$$
z = \text{one_hot} \left( \underset{i}{\text{arg max}} [ g_i + \log \pi_i ] \right)
$$
This paper replaces the one hot and argmax with a [softmax][], and they introduce $\tau$ to control the "discreteness":
$$
z = \text{softmax} \left( \frac{ g_i + \log \pi_i}{\tau} \right)
$$
I made a [notebook that illustrates this][nb] while looking at another paper that came out at the same time, which I should probably compare against here.
Comparison with [Concrete Distribution][concrete]
---------------------------------------------------------------
The concrete and Gumbel-softmax distributions are exactly the same (notation switch: $\tau \to \lambda$, $\pi_i \to \alpha_k$, $G_k \to g_i$). Both papers have structured output prediction experiments (predict one half of MNIST digits from the other half). This paper shows Gumbel-softmax always being better, but doesn't compare to VIMCO, which is sometimes better at test time in the concrete distribution paper.
Sidenote - blog post
----------------------------
The authors posted a [nice blog post][blog] that is also a good short summary and explanation.
[blog]: http://blog.evjang.com/2016/11/tutorial-categorical-variational.html
[scg]: https://arxiv.org/abs/1506.05254
[vae]: https://arxiv.org/abs/1312.6114
[cat]: https://en.wikipedia.org/wiki/Categorical_distribution
[softmax]: https://en.wikipedia.org/wiki/Softmax_function
[concrete]: http://www.shortscience.org/paper?bibtexKey=journals/corr/1611.00712
[nb]: https://gist.github.com/gngdb/ef1999ce3a8e0c5cc2ed35f488e19748

In [stochastic computation graphs][scg], like [variational autoencoders][vae], using discrete variables is hard because we can't just differentiate through Monte Carlo estimates. This paper introduces a distribution that is a smoothed version of the [categorical distribution][cat] and has a parameter that, as it goes to zero, will make it equal the categorical distribution. This distribution is continuous and can be reparameterised.
In other words, the Gumbel trick way to sample a categorical $z$ looks like this ($g_i$ is gumbel distributed and $\boldsymbol{\pi}/\sum_j \pi_j$ are the categorical probabilties):
$$
z = \text{one_hot} \left( \underset{i}{\text{arg max}} [ g_i + \log \pi_i ] \right)
$$
This paper replaces the one hot and argmax with a [softmax][], and they introduce $\tau$ to control the "discreteness":
$$
z = \text{softmax} \left( \frac{ g_i + \log \pi_i}{\tau} \right)
$$
I made a [notebook that illustrates this][nb] while looking at another paper that came out at the same time, which I should probably compare against here.
Comparison with [Concrete Distribution][concrete]
---------------------------------------------------------------
The concrete and Gumbel-softmax distributions are exactly the same (notation switch: $\tau \to \lambda$, $\pi_i \to \alpha_k$, $G_k \to g_i$). Both papers have structured output prediction experiments (predict one half of MNIST digits from the other half). This paper shows Gumbel-softmax always being better, but doesn't compare to VIMCO, which is sometimes better at test time in the concrete distribution paper.
Sidenote - blog post
----------------------------
The authors posted a [nice blog post][blog] that is also a good short summary and explanation.
[blog]: http://blog.evjang.com/2016/11/tutorial-categorical-variational.html
[scg]: https://arxiv.org/abs/1506.05254
[vae]: https://arxiv.org/abs/1312.6114
[cat]: https://en.wikipedia.org/wiki/Categorical_distribution
[softmax]: https://en.wikipedia.org/wiki/Softmax_function
[concrete]: http://www.shortscience.org/paper?bibtexKey=journals/corr/1611.00712
[nb]: https://gist.github.com/gngdb/ef1999ce3a8e0c5cc2ed35f488e19748