It is well known from classical literature in variational inference that Variational Auto-Encoders optimizes the Evidence Lower Bound (ELBo), a lower bound on the log-likelihood of the data. I will point out in this blogpost that it also optimizes a joint likelihood.
Reminder on Variational Auto-Encoders
Given a data space and a continuous latent space , the standard Variational Auto-Encoder aims at learning a directed generative model defined by a gaussian prior and a generator network defining a distribution on a data point conditioned on a latent variable .
Since optimizing is in general intractable, their inventors chose to rely on a auxiliary distribution called the approximate posterior to optimize the ELBo, a lower bound on the log-likelihood:
Building a reasonable approximate posterior is called doing approximate inference. Instead of doing approximate inference at every gradient iteration, the standard Variational Auto-Encoder defines as a gaussian through the functions and used to amortize inference.
Since the expectation is over a parametrized distribution , its optimization would normally rely on a high-variance gradient as derived in REINFORCE. However the standard Variational Auto-Encoder is able to fully differentiate cost function via the Reparametrization Trick by defining a auxiliary standard random variable and, given a data point , redefining the latent variable as .
Please read the original papers for more informations, Auto-Encoding Variational Bayes and Stochastic Backpropagation and Approximate Inference in Deep Generative Models.
Optimizing a joint
After some discussions following my talk at Twitter, I discovered that one under-appreciated fact about Variational Auto-Encoders is that they also optimize the joint log-likelihood .
Variational Auto-Encoders rely on the Reparametrization Trick, a straightforward application of the Change of Variable formula. According to this formula, if and is bijective then:
In our case, given a data point , the standard Variational Auto-Encoder algorithm defines a bijective transformation from the auxiliary with a Jacobian determinant of . Therefore, , which checks with the fact that . Likewise, .
I’ll define as the true data and auxiliary variable joint distribution and as the model distribution. If we manipulate the log-likelihood of according to the model , we obtain:
As is constant, optimizing the Evidence Lower Bound is equivalent in this case with optimizing the expected joint log-likelihood .
This observation serves as an interesting a posteriori justification for the coupling layer architecture in NICE and Real NVP but also highlights the presence of “triangular” pattern (e.g. in the Jacobian of the mapping ) in several tractable probablistic generative learning algorithms.
Triangular pattern confirmed