Mixtures from Depth

Table of Contents

One of the predominant approaches to deep generative modelling has been the generator network paradigm, where a deep neural network transforms a simple base distribution to approximate a target distribution. This kind of method includes generative adversarial networks, variational autoencoders, normalizing flows, and diffusion models. The go-to methods to increase the capacity of these models are to increase width/depth or to apply another architectural advance to the generator net.

Another method to increase the capacity of a generative model is through mixture modelling. Mixture models increase the capacity of a probabilistic density model by using multiple instances (the mixture components) and combining them through simple convex summation.

Primer on mixture models

The idea behind mixture models is to combine multiple instances of simpler families of probability distributions to build a more expressive distribution. More formally, these instances \(p_{X \mid K}^{(\theta)}\) (mixture components) are indexed by a discrete variable \(\textcolor{#8c2323}{K} \in \{0, \dots, \lvert K\rvert\}\) and the resulting distribution density is expressed as a convex sum of the component distribution densities,

\[p_{X}^{(\theta)}(x) = \sum_{\textcolor{#8c2323}{k}=1}^{\lvert K\rvert}{p_{\textcolor{#8c2323}{K}}^{(\theta)}(\textcolor{#8c2323}{k}) p_{X \mid \textcolor{#8c2323}{K}}^{(\theta)}(x \mid \textcolor{#8c2323}{k})},\]

where the sum of these convex coefficients \(p_{K}^{(\theta)}(k)\) (called mixing weights) is \(1\):

\[\sum_{k=1}^{\lvert K\rvert}{p_{K}^{(\theta)}(k)} = 1.\]

The generative process goes as follow: randomly sample the mixture component index \(k\) then sample from the associated component distribution, i.e.,

\[\begin{aligned} k &\sim p_{K}^{(\theta)} \\ x &\sim p_{X \mid K}^{(\theta)}(\cdot \mid k). \end{aligned}\]

When naively applied to these generator network architectures, mixture models introduce a difficult trade-off between network capacity and the number of mixture components given a fixed computation and memory budget. On the one hand, each instance of this deep generative model should be expressive enough to approximate the target distribution, but, on the other hand, the computation required for one model is multiplied by the number of mixture components. At one end of this trade-off, practitioners often opt for using more but simpler mixture components, e.g., gaussians, as part of a larger deep generative model.

An underexplored trick could be used in conjunction with several types of deep generative models, e.g., variational autoencoders, normalizing flows, and diffusion models, to increase their capacity with little overhead in computation or memory. This trick relies on one key principle of deep models: the reuse of previous computation.

Primer on Flow-Based Models

I’ll explain the method by applying it to one family of deep generative models I’m familiar with: normalizing flow. First, let’s have a short, high-level description of flow-based generative models (for a more comprehensive description, read any of these reviews, blogposts, or these course notes or watch these videos).

Flow-based generative models rely on a expressive, parametrized, invertible function (flow) \(f^{(\theta)}\) with a tractable Jacobian determinant \(\left\lvert\frac{\partial f^{(\theta)}}{\partial x}\right\rvert\). Given a simple (and often standard) base distribution \(p_Z\), we can compute the density for the resulting deep generative model, and therefore approximately maximize log-likelihood, using the change of variable formula:

\[p_{X}^{(\theta)}(x) = p_Z\big(f^{(\theta)}(x)\big) \left\lvert\frac{\partial f^{(\theta)}}{\partial x}(x)\right\rvert.\]

We can then generate from this model using the inverse \(\big(f^{(\theta)}\big)^{-1}\) of this invertible function \(f^{(\theta)}\) as follows:

\[\begin{aligned} z &\sim p_Z \\ x &= \big(f^{(\theta)}\big)^{-1}(z). \end{aligned}\]

The Trick

Often, this flow \(f^{(\theta)}\) is a composition of simpler invertible components \(\big(f_{l}^{(\theta)}\big)_{l \leq \lvert L\rvert}\) with \(\lvert L\rvert\) being the number of invertible layers:

\[f^{(\theta)} = f_{\lvert L\rvert}^{(\theta)} \circ f_{\lvert L\rvert-1}^{(\theta)} \circ \dots \circ f_1^{(\theta)}.\]

We adopt the following notations,

\[\begin{aligned} f_{\leq l}^{(\theta)} &= f_l^{(\theta)} \circ f_{l-1}^{(\theta)} \circ \dots \circ f_1^{(\theta)} \\ z_l &= f_{\leq l}^{(\theta)}(x) = f_l^{(\theta)}(z_{l-1}), \end{aligned}\]

and remember that the inverse of a composion of function is the composition of their inverses in reverse order,

\[\big(f_{\leq l}^{(\theta)}\big)^{-1} = \big(f_1^{(\theta)}\big)^{-1} \circ \dots \circ \big(f_l^{(\theta)}\big)^{-1},\]

and that the Jacobian determinant of a composition of function is the product of their respective Jacobian determinant,

\[\left\lvert\frac{\partial f_{\leq l}^{(\theta)}}{\partial x}(x)\right\rvert = \left\lvert\frac{\partial f_{l}^{(\theta)}}{\partial z_{l-1}}(x)\right\rvert \dots \left\lvert\frac{\partial f_{1}^{(\theta)}}{\partial x}(x)\right\rvert.\]

Therefore, each partial composition \(f_{\leq l}^{(\theta)}\) can stand on its own as a flow-based model. Interestingly, there is shared computation between \(f_{\leq l}^{(\theta)}\) and \(f_{\leq l'}^{(\theta)}\), namely if \(l < l'\) then most of the computation used for computing the log-likelihood for \(f_{\leq l}^{(\theta)}\) will be used for \(f_{\leq l'}^{(\theta)}\).

With little overhead, we obtain the following tied parameters mixture model:

\[\begin{aligned} p_{X}^{(\theta)}(x) &= \sum_{l=1}^{\lvert L\rvert}{p_{L}^{(\theta)}(l)~p_Z\big(f_{\leq l}^{(\theta)}(x)\big) \left\lvert\frac{\partial f_{\leq l}^{(\theta)}}{\partial x}(x)\right\rvert}. \end{aligned}\]

As the base distribution factor is propagated at every level, the associated loss becomes very reminiscent of deep supervision of neural nets and stochastic depth training.

Not only we gain a potentially stricly more expressive family of distributions without compromise but the sampling process becomes

\[\begin{aligned} z &\sim p_Z \\ l &\sim p_{L}^{(\theta)} \\ x &= \big(f_{\leq l}^{(\theta)}\big)^{-1}(z), \end{aligned}\]

which is potentially faster than sampling from the flow \(f^{(\theta)}\).

Gating Network

As described in “Mixture Density Networks”, one can make the mixture weights \(p_{L}^{(\theta)}\) a function of another variable. In particular, the observation made in “Locally-Connected Transformations for Deep GMMs” is that these mixture weights can be a function of \(z\) through a gating network \(p_{L \mid Z}^{(\theta)}\).

Applied to this method, this yields the density expression,

\[\begin{aligned} p_{X}^{(\theta)}(x) &= \sum_{l=1}^{\lvert L\rvert}{p_{L \textcolor{#8c2323}{\mid Z}}^{(\theta)}\big(l \textcolor{#8c2323}{\mid f_{\leq l}^{(\theta)}(x)}\big)~p_Z\big(f_{\leq l}^{(\theta)}(x)\big) \left\lvert\frac{\partial f_{\leq l}^{(\theta)}}{\partial x}(x)\right\rvert}, \end{aligned}\]

and the sampling process,

\[\begin{aligned} z &\sim p_Z \\ l &\sim p_{L \textcolor{#8c2323}{\mid Z}}^{(\theta)}(\cdot \textcolor{#8c2323}{\mid z}) \\ x &= \big(f_{\leq l}^{(\theta)}\big)^{-1}(z), \end{aligned}\]

which means that the model can be trained to potentially adapt its computation to more efficiently sample from this model through a single loss function, the negative log-likelihood.

The gating network term can also be interpreted as a modifier to the base distribution \(p_Z\) for each component, changing it to \(\propto p_{L \mid Z}^{(\theta)}(l \mid \cdot)~p_Z\). This provides a more flexible parametrization to this new base distribution, similar to the Learned Accept/Reject Sampling or Noise Contrastive Prior approaches, providing a less wasteful sampling process (no rejection step involved) albeit with a more constrained parametrization. Note that, as a result of this gating network,

\[p_{L, X}^{(\theta)}(l, x) = p_{L \mid Z}^{(\theta)} \big(l \mid f_{\leq l}^{(\theta)}(x)\big)~p_Z\big(f_{\leq l}^{(\theta)}(x)\big) \left\lvert\frac{\partial f_{\leq l}^{(\theta)}}{\partial x}(x)\right\rvert\]

cannot be easily decomposed into a product \(p_{L \mid X}^{(\theta)}(x \mid x) \cdot p_{X}^{(\theta)}(x)\) nor \(p_{X \mid L}^{(\theta)}(x \mid l) \cdot p_{L}^{(\theta)}(l)\).

Limits and Extensions

One could compose this method, i.e., use this mixtures from depth technique for \(p_Z\). While this results in a multiplicative number of “modes”, like a deep mixture model, this also results in the computational complications that comes with it. There are however alternatives to overcome this issue.

While this method can be extended to other tractable probabilistic models such as autoregressive models, extending it to models trained through an evidence lower bound (variational autoencoders and some diffusion models) is less straightforward although possible through a modified evidence lower bound, e.g.,

\[\log\big(p(x)\big) \geq \sum_{l=1}^{\vert L\rvert} q_{L \mid X}(l \mid x) \big(\mathcal{L}_Z(l) + \mathcal{L}_L(l)\big)\]

where,

\[\begin{align*} \mathcal{L}_Z(l) &= \int q_{Z \mid L, X}(z \mid l, x) \log\left(\frac{p_{X, Z_{<L} \mid L}(x, z_{<l} \mid l)p_{Z}(z_l)}{q_{Z \mid L, X}(z \mid l, x)}\right) dz,\\ \mathcal{L}_L(l) &= \int q_{Z \mid L, X}(z \mid l, x) \log\left(\frac{p_{L \mid Z_L}(l \mid z_l)}{q_{L \mid X}(l \mid x)}\right) dz. \end{align*}\]

Acknowledgement

I’d like to thank Kyle Kastner, David Warde-Farley, Erin Grant and Arthur Gretton for encouraging discussions leading to the writing of this blogpost.

References

Citation

For attribution in academic contexts, please cite this work as

Dinh, Laurent (2022) "Mixtures From Depth", Thesis by Blogpost.

BibTeX citation:

@misc{
  dinh_depth_2022,
  title={‌Mixtures From Depth‌},
  url={‌https://laurent-dinh.github.io/2022/01/22/mixture.html‌},
  journal={‌Thesis by Blogpost‌},
  author={Dinh, Laurent},
  year={‌2022‌},
  month={‌Jan‌}
}