Mixtures From Depth | Thesis by Blogpost

Mixtures from Depth

Primer on Flow-Based Models
The Trick
Gating Network
Limits and Extensions
Acknowledgement
References

One of the predominant approaches to deep generative modelling has been the generator network paradigm, where a deep neural network transforms a simple base distribution to approximate a target distribution. This kind of method includes generative adversarial networks, variational autoencoders, normalizing flows, and diffusion models. The go-to methods to increase the capacity of these models are to increase width/depth or to apply another architectural advance to the generator net.

Another method to increase the capacity of a generative model is through mixture modelling. Mixture models increase the capacity of a probabilistic density model by using multiple instances (the mixture components) and combining them through simple convex summation.

Primer on mixture models

The idea behind mixture models is to combine multiple instances of simpler families of probability distributions to build a more expressive distribution. More formally, these instances \(p_{X \mid K}^{(\theta)}\) (mixture components) are indexed by a discrete variable \(\textcolor{#8c2323}{K} \in \{0, \dots, \lvert K\rvert\}\) and the resulting distribution density is expressed as a convex sum of the component distribution densities,

\[p_{X}^{(\theta)}(x) = \sum_{\textcolor{#8c2323}{k}=1}^{\lvert K\rvert}{p_{\textcolor{#8c2323}{K}}^{(\theta)}(\textcolor{#8c2323}{k}) p_{X \mid \textcolor{#8c2323}{K}}^{(\theta)}(x \mid \textcolor{#8c2323}{k})},\]

where the sum of these convex coefficients \(p_{K}^{(\theta)}(k)\) (called mixing weights) is \(1\):

\[\sum_{k=1}^{\lvert K\rvert}{p_{K}^{(\theta)}(k)} = 1.\]

The generative process goes as follow: randomly sample the mixture component index \(k\) then sample from the associated component distribution, i.e.,

\[\begin{aligned} k &\sim p_{K}^{(\theta)} \\ x &\sim p_{X \mid K}^{(\theta)}(\cdot \mid k). \end{aligned}\]

When naively applied to these generator network architectures, mixture models introduce a difficult trade-off between network capacity and the number of mixture components given a fixed computation and memory budget. On the one hand, each instance of this deep generative model should be expressive enough to approximate the target distribution, but, on the other hand, the computation required for one model is multiplied by the number of mixture components. At one end of this trade-off, practitioners often opt for using more but simpler mixture components, e.g., gaussians, as part of a larger deep generative model.

An underexplored trick could be used in conjunction with several types of deep generative models, e.g., variational autoencoders, normalizing flows, and diffusion models, to increase their capacity with little overhead in computation or memory. This trick relies on one key principle of deep models: the reuse of previous computation.

Primer on Flow-Based Models

I’ll explain the method by applying it to one family of deep generative models I’m familiar with: normalizing flow. First, let’s have a short, high-level description of flow-based generative models (for a more comprehensive description, read any of these reviews, blogposts, or these course notes or watch these videos).

Flow-based generative models rely on a expressive, parametrized, invertible function (flow) \(f^{(\theta)}\) with a tractable Jacobian determinant \(\left\lvert\frac{\partial f^{(\theta)}}{\partial x}\right\rvert\). Given a simple (and often standard) base distribution \(p_Z\), we can compute the density for the resulting deep generative model, and therefore approximately maximize log-likelihood, using the change of variable formula:

\[p_{X}^{(\theta)}(x) = p_Z\big(f^{(\theta)}(x)\big) \left\lvert\frac{\partial f^{(\theta)}}{\partial x}(x)\right\rvert.\]

We can then generate from this model using the inverse \(\big(f^{(\theta)}\big)^{-1}\) of this invertible function \(f^{(\theta)}\) as follows:

\[\begin{aligned} z &\sim p_Z \\ x &= \big(f^{(\theta)}\big)^{-1}(z). \end{aligned}\]

The Trick

Often, this flow \(f^{(\theta)}\) is a composition of simpler invertible components \(\big(f_{l}^{(\theta)}\big)_{l \leq \lvert L\rvert}\) with \(\lvert L\rvert\) being the number of invertible layers:

\[f^{(\theta)} = f_{\lvert L\rvert}^{(\theta)} \circ f_{\lvert L\rvert-1}^{(\theta)} \circ \dots \circ f_1^{(\theta)}.\]

We adopt the following notations,

\[\begin{aligned} f_{\leq l}^{(\theta)} &= f_l^{(\theta)} \circ f_{l-1}^{(\theta)} \circ \dots \circ f_1^{(\theta)} \\ z_l &= f_{\leq l}^{(\theta)}(x) = f_l^{(\theta)}(z_{l-1}), \end{aligned}\]

and remember that the inverse of a composion of function is the composition of their inverses in reverse order,

\[\big(f_{\leq l}^{(\theta)}\big)^{-1} = \big(f_1^{(\theta)}\big)^{-1} \circ \dots \circ \big(f_l^{(\theta)}\big)^{-1},\]

and that the Jacobian determinant of a composition of function is the product of their respective Jacobian determinant,

\[\left\lvert\frac{\partial f_{\leq l}^{(\theta)}}{\partial x}(x)\right\rvert = \left\lvert\frac{\partial f_{l}^{(\theta)}}{\partial z_{l-1}}(x)\right\rvert \dots \left\lvert\frac{\partial f_{1}^{(\theta)}}{\partial x}(x)\right\rvert.\]

Therefore, each partial composition \(f_{\leq l}^{(\theta)}\) can stand on its own as a flow-based model. Interestingly, there is shared computation between \(f_{\leq l}^{(\theta)}\) and \(f_{\leq l'}^{(\theta)}\), namely if \(l < l'\) then most of the computation used for computing the log-likelihood for \(f_{\leq l}^{(\theta)}\) will be used for \(f_{\leq l'}^{(\theta)}\).

With little overhead, we obtain the following tied parameters mixture model:

\[\begin{aligned} p_{X}^{(\theta)}(x) &= \sum_{l=1}^{\lvert L\rvert}{p_{L}^{(\theta)}(l)~p_Z\big(f_{\leq l}^{(\theta)}(x)\big) \left\lvert\frac{\partial f_{\leq l}^{(\theta)}}{\partial x}(x)\right\rvert}. \end{aligned}\]

As the base distribution factor is propagated at every level, the associated loss becomes very reminiscent of deep supervision of neural nets and stochastic depth training.

Not only we gain a potentially stricly more expressive family of distributions without compromise but the sampling process becomes

\[\begin{aligned} z &\sim p_Z \\ l &\sim p_{L}^{(\theta)} \\ x &= \big(f_{\leq l}^{(\theta)}\big)^{-1}(z), \end{aligned}\]

which is potentially faster than sampling from the flow \(f^{(\theta)}\).

Gating Network

As described in “Mixture Density Networks”, one can make the mixture weights \(p_{L}^{(\theta)}\) a function of another variable. In particular, the observation made in “Locally-Connected Transformations for Deep GMMs” is that these mixture weights can be a function of \(z\) through a gating network \(p_{L \mid Z}^{(\theta)}\).

Applied to this method, this yields the density expression,

\[\begin{aligned} p_{X}^{(\theta)}(x) &= \sum_{l=1}^{\lvert L\rvert}{p_{L \textcolor{#8c2323}{\mid Z}}^{(\theta)}\big(l \textcolor{#8c2323}{\mid f_{\leq l}^{(\theta)}(x)}\big)~p_Z\big(f_{\leq l}^{(\theta)}(x)\big) \left\lvert\frac{\partial f_{\leq l}^{(\theta)}}{\partial x}(x)\right\rvert}, \end{aligned}\]

and the sampling process,

\[\begin{aligned} z &\sim p_Z \\ l &\sim p_{L \textcolor{#8c2323}{\mid Z}}^{(\theta)}(\cdot \textcolor{#8c2323}{\mid z}) \\ x &= \big(f_{\leq l}^{(\theta)}\big)^{-1}(z), \end{aligned}\]

which means that the model can be trained to potentially adapt its computation to more efficiently sample from this model through a single loss function, the negative log-likelihood.

The gating network term can also be interpreted as a modifier to the base distribution \(p_Z\) for each component, changing it to \(\propto p_{L \mid Z}^{(\theta)}(l \mid \cdot)~p_Z\). This provides a more flexible parametrization to this new base distribution, similar to the Learned Accept/Reject Sampling or Noise Contrastive Prior approaches, providing a less wasteful sampling process (no rejection step involved) albeit with a more constrained parametrization. Note that, as a result of this gating network,

\[p_{L, X}^{(\theta)}(l, x) = p_{L \mid Z}^{(\theta)} \big(l \mid f_{\leq l}^{(\theta)}(x)\big)~p_Z\big(f_{\leq l}^{(\theta)}(x)\big) \left\lvert\frac{\partial f_{\leq l}^{(\theta)}}{\partial x}(x)\right\rvert\]

cannot be easily decomposed into a product \(p_{L \mid X}^{(\theta)}(x \mid x) \cdot p_{X}^{(\theta)}(x)\) nor \(p_{X \mid L}^{(\theta)}(x \mid l) \cdot p_{L}^{(\theta)}(l)\).

Limits and Extensions

One could compose this method, i.e., use this mixtures from depth technique for \(p_Z\). While this results in a multiplicative number of “modes”, like a deep mixture model, this also results in the computational complications that comes with it. There are however alternatives to overcome this issue.

While this method can be extended to other tractable probabilistic models such as autoregressive models, extending it to models trained through an evidence lower bound (variational autoencoders and some diffusion models) is less straightforward although possible through a modified evidence lower bound, e.g.,

\[\log\big(p(x)\big) \geq \sum_{l=1}^{\vert L\rvert} q_{L \mid X}(l \mid x) \big(\mathcal{L}_Z(l) + \mathcal{L}_L(l)\big)\]

where,

\[\begin{align*} \mathcal{L}_Z(l) &= \int q_{Z \mid L, X}(z \mid l, x) \log\left(\frac{p_{X, Z_{<L} \mid L}(x, z_{<l} \mid l)p_{Z}(z_l)}{q_{Z \mid L, X}(z \mid l, x)}\right) dz,\\ \mathcal{L}_L(l) &= \int q_{Z \mid L, X}(z \mid l, x) \log\left(\frac{p_{L \mid Z_L}(l \mid z_l)}{q_{L \mid X}(l \mid x)}\right) dz. \end{align*}\]

Acknowledgement

I’d like to thank Kyle Kastner, David Warde-Farley, Erin Grant and Arthur Gretton for encouraging discussions leading to the writing of this blogpost.

References

Goodfellow, Ian and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron and Bengio, Yoshua (2014), Generative Adversarial Networks, Advances in Neural Information Processing Systems 27 (NeurIPS 2014)
Kingma, Durk and Welling, Max (2014), Auto-Encoding Variational Bayes, 2nd International Conference on Learning Representations 2014 (Conference Track Proceedings)
Rezende, Danilo Jimenez and Mohamed, Shakir and Wierstra, Daan (2014) Stochastic Backpropagation and Approximate Inference in Deep Generative Models, Proceedings of the 31st International Conference on Machine Learning (ICML 2014)
Vahdat, Arash and Kautz, Jan (2020) NVAE: A Deep Hierarchical Variational Autoencoder, Advances in Neural Information Processing Systems 33 (NeurIPS 2020)
Dinh, Laurent and Krueger, David and Bengio, Yoshua (2014) NICE: Non-linear Independent Components Estimation, 3rd International Conference on Learning Representations 2015 (Workshop Track Proceedings)
Dinh, Laurent and Sohl-Dickstein, Jascha and Bengio, Samy (2016) Density estimation using Real NVP, 5th International Conference on Learning Representations 2017 (Conference Track Proceedings)
Kobyzev, Ivan and Prince, Simon J.D. and Brubaker, Marcus A. (2019) Normalizing Flows: An Introduction and Review of Current Methods, IEEE Transactions on Pattern Analysis and Machine Intelligence (Volume: 43 Issue: 11)
Papamakarios, George and Nalisnick, Eric and Rezende, Danilo Jimenez and Mohamed, Shakir and Lakshminarayanan, Balaji (2019) Normalizing Flows for Probabilistic Modeling and Inference, Journal of Machine Learning Research (Volume 22)
Sohl-Dickstein, Jascha and Weiss, Eric and Maheswaranathan, Niru and Ganguli, Surya (2015) Deep Unsupervised Learning using Nonequilibrium Thermodynamics, Proceedings of the 32nd International Conference on Machine Learning (ICML 2015)
Ho, Jonathan and Jain, Ajay and Abbeel, Pieter (2020) Denoising Diffusion Probabilistic Models, Advances in Neural Information Processing Systems 33 (NeurIPS 2020)
Song, Yang (2021) Generative Modeling by Estimating Gradients of the Data Distribution, Generative Space
McLachlan, Geoffrey J. and Lee, Sharon X. and Rathnayake, Suren I. (2000) Finite Mixture Models, Annual Review of Statistics and Its Application (Volume 6)
Postels, Janis and Liu, Mengya and Spezialetti, Riccardo and Van Gool, Luc and Tombari, Federico (2021) Go with the Flows: Mixtures of Normalizing Flows for Point Cloud Generation and Reconstruction, Proceedings of the 2021 International Conference on 3D Vision (3DV)
Ciobanu, Sebastian (2021) Mixtures of Normalizing Flows, Proceedings of ISCA 34th International Conference on Computer Applications in Industry and Engineering
Kingma, Durk and Rezende, Danilo Jimenez and Mohamed, Shakir and Welling, Max (2014) Semi-Supervised Learning with Deep Generative Models Advances in Neural Information Processing Systems 27 (NeurIPS 2014)
Shu, Rui (2015) multi-modality
Dilokthanakul, Nat and Mediano, Pedro A.M. and Garnelo, Marta and Lee, Matthew C.H. and Salimbeni, Hugh and Arulkumaran, Kai and Shanahan, Murray (2016) Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders
Kopf, Andreas and Fortuin, Vincent and Somnath, Vignesh Ram and Claassen, Manfred (2019) Mixture-of-Experts Variational Autoencoder for clustering and generating from similarity-based representations on single cell data, PLOS Computational Biology
Tomczak, Jakub and Welling, Max (2018) VAE with a VampPrior, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics (AISTATS 2018)
Izmailov, Pavel and Kirichenko, Polina and Finzi, Marc and Wilson, Andrew Gordon (2020) Semi-Supervised Learning with Normalizing Flows, Proceedings of the 37th International Conference on Machine Learning (ICML 2020)
Jang, Eric (2018) Normalizing Flows Tutorial
Kosiorek, Adam (2018) Normalizing Flows
Wang, Lillian (2018) Flow-based Deep Generative Models, Lil’log
Grover, Aditya and Song, Jiaming (2018) Normalizing Flow Models, Deep Generative Models
Abbeel, Pieter and Chen, Peter Xi and Ho, Jonathan and Srinivas, Aravind (2019) Flow Models Deep Unsupervised Learning
Seff, Ari (2019) What are Normalizing Flows?
Dinh, Laurent (2020) Invertible Models and Normalizing Flows: A Retrospective Talk, Keynote at the 8th International Conference on Learning Representations (ICLR 2020)
Brubaker, Marcus (2020) Normalizing Flows and Invertible Neural Networks in Computer Vision, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
Wikipedia contributors (2022) Probability density function/Function of random variables and change of variables in the probability density function, Wikipedia, The Free Encyclopedia
Lee, Chen-Yu and Xie, Saining and Gallagher, Patrick and Zhang, Zhengyou and Tu, Zhuowen (2015) Deeply-Supervised Nets, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2015)
Huang, Gao and Sun, Yu and Liu, Zhuang and Sedra, Daniel and Weinberger, Kilian Q. (2016) Deep Networks with Stochastic Depth, Proceedings of the 14th European Conference on Computer Vision (ECCV 2016)
Bishop, Christopher Michael (1994) Mixture density networks
van den Oord, Aäron and Dambre, Joni (2015) Locally-connected transformations for deep GMMs, Deep learning Workshop (32nd International Conference on Machine Learning - ICML 2015)
Graves, Alex (2016) Adaptive Computation Time for Recurrent Neural Networks
Watson, Daniel and Ho, Jonathan and Norouzi, Mohammad and Chan, William (2021) Learning to Efficiently Sample from Diffusion Probabilistic Models
Bauer, Matthias and Mnih, Andriy (2018) Resampled Priors for Variational Autoencoders, Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics (AISTATS 2019)
Aneja, Jyoti and Schwing, Alexander and Kautz, Jan and Vahdat, Arash (2020) A Contrastive Learning Approach for Training Variational Autoencoder Priors, Advances in Neural Information Processing Systems 34 (NeurIPS 2021)
Tang, Yichuan Charlie and Salakhutdinov, Ruslan and Hinton, Geoffrey (2012) Deep Mixtures of Factor Analysers, Proceedings of the 29th International Coference on International Conference on Machine Learning (ICML 2012)
van den Oord, Aäron and Schrauwen, Benjamin (2014) Factoring Variations in Natural Images with Deep Gaussian Mixture Models, Advances in Neural Information Processing Systems 27 (NeurIPS 2014)
Viroli, Cinzia and McLachlan, Geoffrey J. (2017) Deep Gaussian Mixture Models, Statistics and Computing (Volume: 29 Issue: 1)
Selosse, Margot and Gormley, Claire and Jacques, Julien and Biernacki, Christophe (2020) A bumpy journey: exploring deep Gaussian mixture model, I Can’t Believe It’s Not Better! Workshop (NeurIPS 2020)
Dinh, Laurent and Sohl-Dickstein, Jascha and Larochelle, Hugo and Pascanu, Razvan (2019) A RAD approach to deep mixture models, Deep Generative Models for Highly Structured Data Workshop (7th International Conference on Learning Representations - ICLR 2019)
Frey, Brendan (1998) Graphical Models for Machine Learning and Digital Communication, MIT Press
van den Oord, Aäron and Kalchbrenner, Nal and Kavukcuoglu, Koray (2015) Pixel Recurrent Neural Networks, Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML 2016)
van den Oord, Aäron and Kalchbrenner, Nal and Vinyals, Oriol and Espeholt, Lasse and Graves, Alex and Kavukcuoglu, Koray (2016) Conditional Image Generation with PixelCNN Decoders, Advances in Neural Information Processing Systems 29 (NeurIPS 2016)

Citation

For attribution in academic contexts, please cite this work as

Dinh, Laurent (2022) "Mixtures From Depth", Thesis by Blogpost.

BibTeX citation:

@misc{
  dinh_depth_2022,
  title={‌Mixtures From Depth‌},
  url={‌https://laurent-dinh.github.io/2022/01/22/mixture.html‌},
  journal={‌Thesis by Blogpost‌},
  author={Dinh, Laurent},
  year={‌2022‌},
  month={‌Jan‌}
}