$\global\def\myx#1{{\color{green}\mathbf{x}_{#1}}}$ $\global\def\myxa#1{{\color{green}\mathbf{x}_{#1}^{(a)}}}$ $\global\def\myza#1{{\color{green}\mathbf{z}_{#1}^{(a)}}}$ $\global\def\myxv#1{{\color{purple}\mathbf{x}_{#1}^{(v)}}}$ $\global\def\myzv#1{{\color{purple}\mathbf{z}_{#1}^{(v)}}}$ $\global\def\myzav#1{{\color{brown}\mathbf{z}_{#1}^{(av)}}}$ $\global\def\myzds#1{{\color{brown}\mathbf{z}_{#1}^{(av)}}}$ $\global\def\mywav{{\color{brown}\mathbf{w}^{(av)}}}$ $\global\def\myw{{\color{brown}\mathbf{w}}}$ $\global\def\mys#1{{\color{green}\mathbf{s}_{#1}}}$ $\global\def\myS#1{{\color{green}\mathbf{S}_{#1}}}$ $\global\def\myz#1{{\color{brown}\mathbf{z}_{#1}}}$ $\global\def\myztilde#1{{\color{brown}\tilde{\mathbf{z}}_{#1}}}$ $\global\def\myhnmf#1{{\color{brown}\mathbf{h}_{#1}}}$ $\global\def\myztilde#1{{\color{brown}\tilde{\mathbf{z}}_{#1}}}$ $\global\def\myu#1{\mathbf{u}_{#1}}$ $\global\def\mya#1{\mathbf{a}_{#1}}$ $\global\def\myv#1{\mathbf{v}_{#1}}$ $\global\def\mythetaz{\theta_\myz{}}$ $\global\def\mythetax{\theta_\myx{}}$ $\global\def\mythetas{\theta_\mys{}}$ $\global\def\mythetaa{\theta_\mya{}}$ $\global\def\bs#1{{\boldsymbol{#1}}}$ $\global\def\diag{\text{diag}}$ $\global\def\mbf{\mathbf}$ $\global\def\myh#1{{\color{purple}\mbf{h}_{#1}}}$ $\global\def\myhfw#1{{\color{purple}\overrightarrow{\mbf{h}}_{#1}}}$ $\global\def\myhbw#1{{\color{purple}\overleftarrow{\mbf{h}}_{#1}}}$ $\global\def\myg#1{{\color{purple}\mbf{g}_{#1}}}$ $\global\def\mygfw#1{{\color{purple}\overrightarrow{\mbf{g}}_{#1}}}$ $\global\def\mygbw#1{{\color{purple}\overleftarrow{\mbf{g}}_{#1}}}$ $\global\def\neq{\mathrel{\char`≠}}$

A Multimodal Dynamical Variational Autoencoder for Audiovisual Speech Representation Learning

Simon Leglaive

CentraleSupélec, IETR (UMR CNRS 6164), France

Seminar @ LISTEN joint laboratory

Palaiseau (France) - January 12, 2023

Joint work with

Samir Sadok¹

Laurent Girin²

Xavier Alameda-Pineda³

Renaud Séguier¹

¹ CentraleSupélec, IETR (UMR CNRS 6164), France

² Univ. Grenoble Alpes, CNRS, Grenoble-INP, GIPSA-lab, France

³ Inria, Univ. Grenoble Alpes, CNRS, LJK, France

🚧 $\hspace{.5cm}$ Ongoing work $\hspace{.5cm}$ 🚧

Introduction3

Supervised learning

The majority of successful applications of machine/deep learning rely on supervised learning.

Semantic urban image segmentation with ICNet (Zhao et al., 2018)

The Cityscapes dataset
(Cordts et al., 2016)

5k images with high quality pixel-level annotations
1.5h to annotate each single image

More than 300 days of annotation! 😱

It is intractable to collect labels for every scenario and task.

M. Cordts et al., The Cityscapes dataset for semantic urban scene understanding, IEEE CVPR 2016
Zhao et al., ICNet for real-time semantic segmentation on high-resolution images, ECCV 2018

Unsupervised learning

We need unsupervised methods that can learn to unveil the underlying structure of the data without or with few ground-truth labels.

GENESIS (Engelcke et al., 2020), a generative model of 3D scenes capable of both decomposing and generating scenes by capturing relationships between scene components. Image credits: (Engelcke et al., 2020).

Deep latent variable generative models have emerged as promising approaches.

M. Engelcke et al., GENESIS: Generative scene inference and sampling with object-centric latent representations, ICLR 2020.

Generative modeling of structured high-dimensional data

High-dimensional data $\myx{} \in \mathbb{R}^d$ such as natural images or speech signals exhibit some form of regularity, preventing their dimensions from varying independently.

From a generative perspective, this regularity suggests that there exists a smaller dimensional latent variable $\myz{} \in \mathbb{R}^\ell$ that generated $\myx{} \in \mathbb{R}^d$ , $\ell \ll d$ .

Picture credits: wayhomestudio on Freepik.

Latent-variable generative modeling

Generative modeling consists in estimating the parameters $\theta$ so that $p_\theta(\myx{}) \approx p^\star(\myx{})$ according to some measure of fit, for instance the Kullback-Leibler (KL) divergence.
When the model includes a deep neural network, we obtain a deep generative model.

The variational autoencoder (VAE) (Kingma and Welling, 2014; Rezende et. al., 2014)

Prior

$\small p(\myz{}) = \mathcal{N}(\myz{}; \mathbf{0}, \mathbf{I})$

Generative model

$\small p_\theta(\myx{} | \myz{}) = \mathcal{N}\left( \myx{}; \boldsymbol{\mu}_\theta(\myz{}), \boldsymbol{\Sigma}_\theta(\myz{}) \right)$

Inference model

$\small q_\phi(\myz{} | \myx{}) = \mathcal{N}\left( \myz{}; \boldsymbol{\mu}_\phi(\myx{}), \boldsymbol{\Sigma}_\phi(\myx{}) \right) \\$

The inference model approximates the intractable exact posterior distribution

$\displaystyle p_\theta(\myz{} | \myx{}) = \frac{p_\theta(\myx{} | \myz{})p(\myz{})}{\int p_\theta(\myx{} | \myz{})p(\myz{})d\myz{}}$

$\footnotesize \hspace{.5cm} \boldsymbol{\Sigma}_\phi(\myx{}) = \diag\{ \mathbf{v}_\phi(\myx{})\} \qquad\qquad \boldsymbol{\Sigma}_\theta(\myz{}) = \diag\{ \mathbf{v}_\theta(\myz{}) \}$

D.P. Kingma and M. Welling, Auto-encoding variational Bayes, ICLR 2014.
D.J. Rezende et al., Stochastic backpropagation and approximate inference in deep generative models, ICML 2014.

The VAE parameters are estimated by maximizing the evidence lower bound (ELBO) (Neal and Hinton, 1999; Jordan et al. 1999) defined by: $\begin{aligned} \mathcal{L}(\phi, \theta) &= \underbrace{\mathbb{E}_{q_\phi(\myz{} | \myx{})} [\ln p_\theta(\myx{} | \myz{})]}_{\text{reconstruction accuracy}} - \underbrace{D_{\text{KL}}(q_\phi(\myz{} | \myx{}) \parallel p(\myz{}))}_{\text{regularization}}. \end{aligned}$

The ELBO can also be decomposed as: $\begin{aligned} \mathcal{L}(\phi, \theta) &= \ln p_\theta(\myx{}) - D_{\text{KL}}(q_\phi(\myz{} | \myx{}) \parallel p_\theta(\myz{} | \myx{})). \end{aligned}$

Generative model parameters estimation

$\underset{\theta}{\max}\, \Big\{ \mathcal{L}(\phi, \theta) \le \ln p_\theta(\myx{}) \Big\}$

Inference model parameters estimation

$\underset{\phi}{\max}\, \mathcal{L}(\phi, \theta) \,\,\Leftrightarrow\,\, \underset{\phi}{\min}\, D_{\text{KL}}(q_\phi(\myz{} | \myx{}) \parallel p_\theta(\myz{} | \myx{}))$

R.M. Neal and G.E. Hinton, A view of the EM algorithm that justifies incremental, sparse, and other variants, in M. I. Jordan (Ed.), Learning in graphical models, 1999.
M.I. Jordan et al., An introduction to variational methods for graphical models, Machine Learning, 1999.

A trained VAE can be used for generation, transformation, and downstream tasks.
Ideally, the learned representation should be disentangled (Higgins et al., 2018), i.e., somehow easy to relate to independent and interpretable high-level characteristics of the data.

Supervised learning from disentangled representations has been found to be more sample-efficient, more robust, and better in terms of generalization (van Steenkiste et al., 2019).

I. Higgins et al., Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230, 2018.
S. Sadok et al., Learning and controlling the source-filter representation of speech with a variational autoencoder, arXiv preprint arXiv:2204.07075, 2022.
S. van Steenkiste et al., Are disentangled representations helpful for abstract visual reasoning?, NeurIPS, 2019.

Over the past few years, the VAE has been extended in many ways, including for processing dynamical or multimodal data.

In this talk, we will present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audiovisual speech representation learning.

L. Girin et al., Dynamical variational autoencoders: A comprehensive review, Foundations and Trends in Machine Learning, 2021.
M. Suzuki et al., Joint multimodal learning with deep denerative models, ICLR Workshop, 2017.
M. Wu and N. Goodman, Multimodal generative models for scalable weakly-supervised learning, NeurIPS, 2018.
W.-N. Hsu and J. R. Glass, Disentangling by partitioning: A representation learning framework for multimodal sensory data, arXiv preprint arXiv:1805.11264, 2018.
Y. Shi et al., Variational mixture-of-experts autoencoders for multi-modal deep denerative models, NeurIPS, 2019.
T. Sutter et al., Multimodal generative learning utilizing Jensen-Shannon divergence, NeurIPS, 2020.
T. Sutter et al., Generalized Multimodal ELBO, ICLR, 2021.

Audiovisual (AV) speech latent factors

Speaker's identity and global emotional state
↪ static, shared (AV)
Lip movements, phonemic information (part of)
↪ dynamical, shared (AV)
Other facial movements and head pose
↪ dynamical, modality-specific (V)
Pitch variations, phonemic information (part of)
↪ dynamical, modality-specific (A)

We seek to learn a multimodal dynamical VAE that disentangles these AV speech latent factors: dynamical and modality-specific, dynamical and audiovisual, static and audiovisual.

Video credits: K. Wang et al., MEAD: A large-scale audio-visual dataset for emotional talking-face generation, ECCV, 2020.

Multimodal dynamical VAE (MDVAE)

Generative model
Inference model
Two-stage training

Notations

$\myxa{} \in \mathbb{R}^{d_a \times T}$

Observed dynamical audio data

$\myxv{} \in \mathbb{R}^{d_v \times T}$

Observed dynamical visual data

$\myw \in \mathbb{R}^{\ell_w}$

Latent static audiovisual data

$\myzav{} \in \mathbb{R}^{\ell_{av} \times T}$

Latent dynamical audiovisual data

$\myza{} \in \mathbb{R}^{\ell_a \times T}$

Latent dynamical audio
data

$\myzv{} \in \mathbb{R}^{\ell_v \times T}$

Latent dynamical visual data

MDVAE generative model

Defining the generative model amounts to defining the joint distribution of all variables: $p_\theta\left(\myxa{}, \myxv{}, \myzav{},\myza{},\myzv{}, \myw{}\right).$

By structuring the dependencies between these variables we hope to learn the desired disentangled representation of audiovisual speech in an unsupervised manner.
The temporal model in MDVAE is largely inspired by the disentangled sequential autoencoder (DSAE) of Li and Mandt (2018).

MDVAE can be seen as a multimodal extension of DSAE.

Y. Li and S. Mandt, "Disentangled sequential autoencoder", ICML 2018.

The global probabilistic graphical model of MDVAE is defined by the following Bayesian network.

$\hspace{-.75cm} \small p_\theta\left(\myxa{}, \myxv{}, \myzav{},\myza{},\myzv{}, \myw{}\right) = p_\theta\left(\myxa{} \mid \myzav{},\myza{},\myw{}\right) p_\theta\left(\myxv{} \mid \myzav{}, \myzv{}, \myw{}\right) p_\theta\left(\myzav{}\right)p_\theta\left(\myza{}\right) p_\theta\left(\myzv{}\right)p_\theta\left(\myw{}\right)$

The global probabilistic graphical model of MDVAE is defined by the following Bayesian network.

$\hspace{-.75cm} \small p_\theta\left(\myxa{}, \myxv{}, \myzav{},\myza{},\myzv{}, \myw{}\right) = \boxed{p_\theta\left(\myxa{} \mid \myzav{},\myza{},\myw{}\right)}p_\theta\left(\myxv{} \mid \myzav{}, \myzv{}, \myw{}\right) p_\theta\left(\myzav{}\right)p_\theta\left(\myza{}\right) p_\theta\left(\myzv{}\right)p_\theta\left(\myw{}\right)$

The global probabilistic graphical model of MDVAE is defined by the following Bayesian network.

$\hspace{-.75cm} \small p_\theta\left(\myxa{}, \myxv{}, \myzav{},\myza{},\myzv{}, \myw{}\right) = p_\theta\left(\myxa{} \mid \myzav{},\myza{},\myw{}\right)\boxed{p_\theta\left(\myxv{} \mid \myzav{}, \myzv{}, \myw{}\right)} p_\theta\left(\myzav{}\right)p_\theta\left(\myza{}\right) p_\theta\left(\myzv{}\right)p_\theta\left(\myw{}\right)$

The global probabilistic graphical model of MDVAE is defined by the following Bayesian network.

$\hspace{-.75cm} \small p_\theta\left(\myxa{}, \myxv{}, \myzav{},\myza{},\myzv{}, \myw{}\right) = p_\theta\left(\myxa{} \mid \myzav{},\myza{},\myw{}\right)p_\theta\left(\myxv{} \mid \myzav{}, \myzv{}, \myw{}\right) \boxed{p_\theta\left(\myzav{}\right)p_\theta\left(\myza{}\right) p_\theta\left(\myzv{}\right)p_\theta\left(\myw{}\right)}$

The global probabilistic graphical model of MDVAE is defined by the following Bayesian network.

To complete the generative model, we also need to define the temporal dependencies for the sequential variables.

$\hspace{-.5cm} \small {p_\theta\left(\myxa{} \mid \myzav{},\myza{},\myw{}\right) = \prod_{t=1}^T p_\theta\left(\myxa{t} \mid \myzav{t},\myza{t},\myw{}\right)}, \hspace{.3cm} p_\theta\left(\myxv{} \mid \myzav{},\myzv{},\myw{}\right) = \prod_{t=1}^T p_\theta\left(\myxv{t} \mid \myzav{t},\myzv{t},\myw{}\right)$

$\hspace{-.5cm} \small p_\theta\left(\myza{}\right) = \prod_{t=1}^T p_\theta\left(\myza{t} \mid \myza{1:t-1} \right), \hspace{.3cm} p_\theta\left(\myzav{}\right) = \prod_{t=1}^T p_\theta\left(\myzav{t} \mid \myzav{1:t-1} \right), \hspace{.3cm} p_\theta\left(\myzv{}\right) = \prod_{t=1}^T p_\theta\left(\myzv{t} \mid \myzv{1:t-1} \right)$

$\hspace{-.5cm} \small \boxed{p_\theta\left(\myxa{} \mid \myzav{},\myza{},\myw{}\right) = \prod_{t=1}^T p_\theta\left(\myxa{t} \mid \myzav{t},\myza{t},\myw{}\right)}, \hspace{.3cm} p_\theta\left(\myxv{} \mid \myzav{},\myzv{},\myw{}\right) = \prod_{t=1}^T p_\theta\left(\myxv{t} \mid \myzav{t},\myzv{t},\myw{}\right)$

$\hspace{-.5cm} \small p_\theta\left(\myxa{} \mid \myzav{},\myza{},\myw{}\right) = \prod_{t=1}^T p_\theta\left(\myxa{t} \mid \myzav{t},\myza{t},\myw{}\right), \hspace{.3cm} \boxed{p_\theta\left(\myxv{} \mid \myzav{},\myzv{},\myw{}\right) = \prod_{t=1}^T p_\theta\left(\myxv{t} \mid \myzav{t},\myzv{t},\myw{}\right)}$

$\hspace{-.5cm} \small p_\theta\left(\myxa{} \mid \myzav{},\myza{},\myw{}\right) = \prod_{t=1}^T p_\theta\left(\myxa{t} \mid \myzav{t},\myza{t},\myw{}\right), \hspace{.3cm} p_\theta\left(\myxv{} \mid \myzav{},\myzv{},\myw{}\right) = \prod_{t=1}^T p_\theta\left(\myxv{t} \mid \myzav{t},\myzv{t},\myw{}\right)$

$\hspace{-.5cm} \small \boxed{p_\theta\left(\myza{}\right) = \prod_{t=1}^T p_\theta\left(\myza{t} \mid \myza{1:t-1} \right)}, \hspace{.3cm} \boxed{p_\theta\left(\myzav{}\right) = \prod_{t=1}^T p_\theta\left(\myzav{t} \mid \myzav{1:t-1} \right)}, \hspace{.3cm} \boxed{p_\theta\left(\myzv{}\right) = \prod_{t=1}^T p_\theta\left(\myzv{t} \mid \myzv{1:t-1} \right)}$

These distributions are Gaussians parametrized by neural networks and $\small p(\myw{}) = \mathcal{N}(\myw{}; \mathbf{0}, \mathbf{I})$ .

Multimodal dynamical VAE (MDVAE)

Generative model
Inference model
Two-stage training

MDVAE inference model

As in the standard VAE, we need to define an inference model that approximates the posterior:

$q_\phi\left(\myzav{},\myza{},\myzv{}, \myw{} \mid \myxa{}, \myxv{}\right) \approx p_\theta\left(\myzav{},\myza{},\myzv{}, \myw{} \mid \myxa{}, \myxv{}\right).$

The exact posterior is intractable, but using the chain rule, the Bayesian network of MDVAE and the D-separation principle (Geiger et al., 1990; Bishop, 2006), we can analyze the exact posterior dependencies, i.e. how the observed and latent variables depend on each other given the observations.

See (Girin et al., 2021) for an extensive discussion of D-separation in the context of DVAEs.

L. Girin et al., Dynamical variational autoencoders: A comprehensive review, Foundations and Trends in Machine Learning, 2021.
D. Geiger et al., Identifying independence in Bayesian networks, Networks, 1990.
C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.