## Variational Auto-Encoders Optimize a Joint

It is well known from classical literature in variational inference that *Variational* *Auto-Encoders* optimizes the *Evidence Lower Bound* (*ELBo*), a lower bound on the log-likelihood of the data. I will point out in this blogpost that it also optimizes a joint likelihood.

### Reminder on Variational Auto-Encoders

Given a data space and a continuous latent space , the standard Variational Auto-Encoder aims at learning a directed generative model defined by a gaussian prior and a generator network defining a distribution on a data point conditioned on a latent variable .

Since optimizing is in general intractable, their inventors chose to rely on a auxiliary distribution called the *approximate posterior* to optimize the *ELBo*, a lower bound on the log-likelihood:

Building a reasonable approximate posterior is called doing approximate inference. Instead of doing approximate inference at every gradient iteration, the standard Variational Auto-Encoder defines as a gaussian through the functions and used to amortize inference.

Since the expectation is over a parametrized distribution , its optimization would normally rely on a high-variance gradient as derived in *REINFORCE*. However the standard Variational Auto-Encoder is able to fully differentiate cost function via the *Reparametrization Trick* by defining a auxiliary standard random variable and, given a data point , redefining the latent variable as .

Please read the original papers for more informations, *Auto-Encoding Variational Bayes* and *Stochastic Backpropagation and Approximate Inference in Deep Generative Models*.

### Optimizing a joint

After some discussions following my talk at Twitter, I discovered that one under-appreciated fact about Variational Auto-Encoders is that they also optimize the joint log-likelihood .

Variational Auto-Encoders rely on the *Reparametrization Trick*, a straightforward application of the *Change of Variable formula*. According to this formula, if and is bijective then:

In our case, given a data point , the standard Variational Auto-Encoder algorithm defines a bijective transformation from the auxiliary with a Jacobian determinant of . Therefore, , which checks with the fact that . Likewise, .

I’ll define as the true data and auxiliary variable joint distribution and as the model distribution. If we manipulate the log-likelihood of according to the model , we obtain:

Therefore:

As is constant, optimizing the Evidence Lower Bound is equivalent in this case with optimizing the expected joint log-likelihood .

This observation serves as an interesting *a posteriori* justification for the coupling layer architecture in NICE and Real NVP but also highlights the presence of “triangular” pattern (e.g. in the Jacobian of the mapping ) in several tractable probablistic generative learning algorithms.

*Triangular pattern confirmed*

### Acknowledgements

The motivation for this rewriting comes from discussions with Hugo Larochelle. This connection was also discussed with my PhD supervisor, Yoshua Bengio.