It is well known from classical literature in variational inference that Variational Auto-Encoders optimizes the Evidence Lower Bound (ELBo), a lower bound on the log-likelihood of the data. I will point out in this blogpost that it also optimizes a joint likelihood.

Reminder on Variational Auto-Encoders

Given a data space and a continuous latent space , the standard Variational Auto-Encoder aims at learning a directed generative model defined by a gaussian prior and a generator network defining a distribution on a data point conditioned on a latent variable .

Since optimizing is in general intractable, their inventors chose to rely on a auxiliary distribution called the approximate posterior to optimize the ELBo, a lower bound on the log-likelihood:

Building a reasonable approximate posterior is called doing approximate inference. Instead of doing approximate inference at every gradient iteration, the standard Variational Auto-Encoder defines as a gaussian through the functions and used to amortize inference.

Since the expectation is over a parametrized distribution , its optimization would normally rely on a high-variance gradient as derived in REINFORCE. However the standard Variational Auto-Encoder is able to fully differentiate cost function via the Reparametrization Trick by defining a auxiliary standard random variable and, given a data point , redefining the latent variable as .

Please read the original papers for more informations, Auto-Encoding Variational Bayes and Stochastic Backpropagation and Approximate Inference in Deep Generative Models.

Optimizing a joint

After some discussions following my talk at Twitter, I discovered that one under-appreciated fact about Variational Auto-Encoders is that they also optimize the joint log-likelihood .

Variational Auto-Encoders rely on the Reparametrization Trick, a straightforward application of the Change of Variable formula. According to this formula, if and is bijective then:

In our case, given a data point , the standard Variational Auto-Encoder algorithm defines a bijective transformation from the auxiliary with a Jacobian determinant of . Therefore, , which checks with the fact that . Likewise, .

I’ll define as the true data and auxiliary variable joint distribution and as the model distribution. If we manipulate the log-likelihood of according to the model , we obtain:


As is constant, optimizing the Evidence Lower Bound is equivalent in this case with optimizing the expected joint log-likelihood .

This observation serves as an interesting a posteriori justification for the coupling layer architecture in NICE and Real NVP but also highlights the presence of “triangular” pattern (e.g. in the Jacobian of the mapping ) in several tractable probablistic generative learning algorithms.

Triangular Pattern Triangular pattern confirmed


The motivation for this rewriting comes from discussions with Hugo Larochelle. This connection was also discussed with my PhD supervisor, Yoshua Bengio.