There has been recently two concurrent works on using a relaxed version of the Gumbel-Max Trick to train deep probabilistic models (The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables and Categorical Reparameterization with Gumbel-Softmax). I like to compare the Gumbel-Max trick, as described in this blogpost, to the Reparametrization Trick (previously referred to as Backpropagating through Random Number Generators) as it decomposes sampling from a parametrized distribution into two parts: sampling from a standard distribution and a deterministic parametrized function, possibly resulting in accelerated sampling.
The Gumbel-Max Trick is a method to sample from a categorical distribution , where category has probability to be sampled among K categories, and relies on the Gumbel distribution defined by the Cumulative Distribution Function:
This trick is based on the clever observation that, if , then follows the desired categorical distribution (see this blogpost from Ryan Adams for proof). In practice, we can sample from this Gumbel distrbution using auxiliary random uniform variables in using inverse transform sampling:
Last summer, Luke Vilnis asked me whether it was possible to practically infer those Gumbel random variables given an observed category . That is: what is the expression of the posterior distribution and how to sample from it ? As a thought experiment, I will describe in this blogpost how to do it.
If we choose to be interested in the associated uniform random variables , inference of those latent variables looks conceptually easier. Indeed, given the observed category , can be more easily derived through Bayes Rule. If we consider the set
- because follow a priori the uniform distribution ;
- by definition of .
Therefore, we have
Meaning that is uniform in the set defined by the nonlinear constraints
of volume .
2-D example of how and partition the space of configurations with and .
Usually, when we sample from a uniform distribution, it is usually from an axis-aligned box from which it’s very easy to sample like . But in general, sampling from an arbitrary uniform distribution can be non-trivial and requires more complex sampling procedure like Markov Chain Monte Carlo algorithms, especially since define a partition with nonlinear boundaries. In this case, you will see a derivation of an exact Monte Carlo procedure from the described constraints.
Meaning that . Finally, we derive the expression of in :
Once again, we can efficiently sample from this posterior distribution using auxiliary random uniform variables through inverse transform sampling:
If you were to infer the Gumbel variables , then .
I’m very grateful to Luke Vilnis, Vincent Dumoulin, Kyle Kastner, Harm de Vries and my supervisors, Yoshua Bengio and Samy Bengio, for interesting discussions that were helpful in writing this post. I discovered the Gumbel-Max Trick through the NIPS 2014 paper A* sampling.