class: middle, center

$$ \global\def\myx#1{{\color{green}\mathbf{x}\_{#1}}} $$
$$ \global\def\myxa#1{{\color{green}\mathbf{x}\_{#1}^{(a)}}} $$
$$ \global\def\myza#1{{\color{green}\mathbf{z}\_{#1}^{(a)}}} $$
$$ \global\def\myxv#1{{\color{purple}\mathbf{x}\_{#1}^{(v)}}} $$
$$ \global\def\myzv#1{{\color{purple}\mathbf{z}\_{#1}^{(v)}}} $$
$$ \global\def\myzav#1{{\color{brown}\mathbf{z}\_{#1}^{(av)}}} $$
$$ \global\def\myzds#1{{\color{brown}\mathbf{z}\_{#1}^{(av)}}} $$
$$ \global\def\mywav{{\color{brown}\mathbf{w}^{(av)}}} $$
$$ \global\def\myw{{\color{brown}\mathbf{w}}} $$
$$ \global\def\mys#1{{\color{green}\mathbf{s}\_{#1}}} $$
$$ \global\def\myS#1{{\color{green}\mathbf{S}\_{#1}}} $$
$$ \global\def\myz#1{{\color{brown}\mathbf{z}\_{#1}}} $$
$$ \global\def\myztilde#1{{\color{brown}\tilde{\mathbf{z}}\_{#1}}} $$
$$ \global\def\myhnmf#1{{\color{brown}\mathbf{h}\_{#1}}} $$
$$ \global\def\myztilde#1{{\color{brown}\tilde{\mathbf{z}}\_{#1}}} $$
$$ \global\def\myu#1{\mathbf{u}\_{#1}} $$
$$ \global\def\mya#1{\mathbf{a}\_{#1}} $$
$$ \global\def\myv#1{\mathbf{v}\_{#1}} $$
$$ \global\def\mythetaz{\theta\_\myz{}} $$
$$ \global\def\mythetax{\theta\_\myx{}} $$
$$ \global\def\mythetas{\theta\_\mys{}} $$
$$ \global\def\mythetaa{\theta\_\mya{}} $$
$$ \global\def\bs#1{{\boldsymbol{#1}}} $$
$$ \global\def\diag{\text{diag}} $$
$$ \global\def\mbf{\mathbf} $$
$$ \global\def\myh#1{{\color{purple}\mbf{h}\_{#1}}} $$
$$ \global\def\myhfw#1{{\color{purple}\overrightarrow{\mbf{h}}\_{#1}}} $$
$$ \global\def\myhbw#1{{\color{purple}\overleftarrow{\mbf{h}}\_{#1}}} $$
$$ \global\def\myg#1{{\color{purple}\mbf{g}\_{#1}}} $$
$$ \global\def\mygfw#1{{\color{purple}\overrightarrow{\mbf{g}}\_{#1}}} $$
$$ \global\def\mygbw#1{{\color{purple}\overleftarrow{\mbf{g}}\_{#1}}} $$
$$ \global\def\neq{\mathrel{\char`≠}} $$

.vspace[

]

# Low-dimensional generative modeling of <br>multimodal sequential data

.small-vspace[

]

### Applications to audiovisual speech processing

.small-vspace[

]

.center[Simon Leglaive]

.small.center[CentraleSupélec, IETR (UMR CNRS 6164), France]

.small-vspace[

]

.center.small[.underline[Collaborators]: Xavier Alameda-Pineda, Guénolé Fiche, Laurent Girin, Radu Horaud,<br> Francesc Moreno-Noguer, Samir Sadok, Renaud Séguier.]

.center.width-7[![](images/logo_PoS.png)]

.grid[
.kol-1-6[

.left.width-120[![](images/logo_CS.svg)]
]
.kol-2-3[

.small.center[Universität Hamburg - September 11, 2025]

]
.kol-1-6[

.right.width-65[![](images/logo_IETR.svg)]]

]

---
class: middle
count: false

## .bold[Outline]

<h3> 1. Low-dimensional generative modeling... </h3>

- The variational autoencoder (VAE).
- A rapid overview of 3 contributions around the VAE.

<h3> 2. ... of multimodal sequential data  </h3>

- The multimodal dynamical VAE (MDVAE).
- Applications to audiovisual speech processing.

.vspace[

]

.center.bold[Feel free to ask questions at any time!]

---
class: middle

## Low-dimensional modeling of high-dimensional structured data

.center.width-90[![](images/true_data_distribution_3.svg)]

**High-dimensional data** $\myx{} \in \mathbb{R}^d$ such as 3D human meshes, natural images, or speech signals exhibit some form of **structure**, preventing their dimensions from varying independently.

- From a **geometric perspective**, this regularity suggests that the high-dimensional data actually live in a much lower-dimensional manifold.
  - From a **generative perspective**, it suggests that there exists a smaller dimensional **latent variable** $\myz{} \in \mathbb{R}^\ell$ that generated $\myx{} \in \mathbb{R}^d$, $\ell \ll d$.

.credit[Picture credits: <a href="https://fr.freepik.com/photos-gratuite/heureuse-fille-aux-cheveux-boucles-fait-signe-pouce-air-demontre-son-soutien-son-respect-quelqu-sourit-agreablement-atteint-objectif-souhaitable-porte-t-shirt-blanc-isole-mur-jaune_11932454.htm#query=black%20woman%20face&position=2&from_view=search">wayhomestudio</a> on Freepik. ]

---
class: middle

## Latent-variable generative modeling

.small-vspace[

]

.center.width-90[![](images/generative_model.svg)]

- Generative modeling consists of estimating the parameters $\theta$ so that $p\_\theta(\myx{}) \approx p^\star(\myx{})$ according to some measure of fit, for instance the Kullback-Leibler (KL) divergence.
- When the model includes a deep neural network, we obtain a **deep generative model**.

---
class: middle
## The variational autoencoder (VAE) .tiny[(Kingma and Welling, 2014; Rezende et. al., 2014)]

.grid[

.kol-1-2[

- .underline[Prior]: $\qquad p(\myz{}) = \mathcal{N}(\myz{}; \mathbf{0}, \mathbf{I})$

- .underline[Generative model]:

$ p\_\theta(\myx{} | \myz{}) = \mathcal{N}\left( \myx{}; \boldsymbol{\mu}\_\theta(\myz{}), \boldsymbol{\Sigma}\_\theta(\myz{}) \right) $

.small[where] $\small \boldsymbol{\mu}\_\theta(\myz{}), \boldsymbol{\Sigma}\_\theta(\myz{})$ .small[are the outputs of the **decoder**.]

- .underline[Inference model]:

$ q\_\phi(\myz{} | \myx{}) =  \mathcal{N}\left( \myz{}; \boldsymbol{\mu}\_\phi(\myx{}), \boldsymbol{\Sigma}\_\phi(\myx{}) \right) \\\\$

.small[where] $\small \boldsymbol{\mu}\_\phi(\myx{}), \boldsymbol{\Sigma}\_\phi(\myx{})$ .small[are the outputs of the **encoder**.]
]
.kol-1-2[

.vspace[

]

.center.width-90[![](images/full_VAE_2.svg)]

]
]

.small-nvspace[
]

.credit[
  .vpsace[

]
  D.P. Kingma and M. Welling (2014). Auto-encoding variational Bayes. ICLR.
  <br>
  D.J. Rezende, S. Mohamed, D. Wierstra, (2014).Stochastic backpropagation and approximate inference in deep generative models. ICML.
  ]

---
class: middle

## VAE objective

The VAE parameters are estimated by maximizing the **evidence lower bound** (ELBO) .small[(Neal and Hinton, 1999; Jordan et al. 1999)] defined by:
$$\begin{aligned}
\mathcal{L}(\phi, \theta) &= \underbrace{\mathbb{E}\_{q\_\phi(\myz{} | \myx{})} [\ln p\_\theta(\myx{} | \myz{})]}\_{\text{reconstruction accuracy}} - \underbrace{D\_{\text{KL}}(q\_\phi(\myz{} | \myx{}) \parallel p(\myz{}))}\_{\text{regularization}}.
\end{aligned}
$$

The ELBO can also be decomposed as:
$$\begin{aligned}
\mathcal{L}(\phi, \theta) &= \ln p\_\theta(\myx{}) - D\_{\text{KL}}(q\_\phi(\myz{} | \myx{}) \parallel p\_\theta(\myz{} | \myx{})).
\end{aligned}
$$

.alert[

.left-column[

.underline[Generative model parameters estimation]

$$ \underset{\theta}{\max}\, \Big\\{ \mathcal{L}(\phi, \theta) \le \ln p\_\theta(\myx{}) \Big\\} $$

]

.right-column[

.underline[Inference model parameters estimation]

$$ \underset{\phi}{\max}\, \mathcal{L}(\phi, \theta) \,\,\Leftrightarrow\,\,  \underset{\phi}{\min}\, D\_{\text{KL}}(q\_\phi(\myz{} | \myx{}) \parallel p\_\theta(\myz{} | \myx{}))$$
]

.reset-column[
]

]

.credit[
  
  R.M. Neal and G.E. Hinton (1999). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (Ed.), .italic[Learning in graphical models].
  <br>
  M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, L.K. Saul (1999). An introduction to variational methods for graphical models. Machine Learning. 1999.
  ]

---
## One model, many tasks

We can use a pre-trained VAE for several different tasks.

.center.width-80[![](images/VAE_appli_spectro.svg)]

---
## Generation task

How to generate new data samples?

.center.width-80[![](images/VAE_appli_spectro_gen.svg)]

---
## Decoder-based downstream task

How to infer the latent variables given auxiliary signals?

.center.width-80[![](images/VAE_appli_spectro_dec_down.svg)]

---
## Transformation task

How to disentangle, interpret, and control the latent representation?

.center.width-80[![](images/VAE_appli_spectro_trans.svg)]

.credit[I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, A. Lerchner (2018). Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230.]

---
## Encoder-based downstream task

How to build sample-efficient, robust, and generalizable information extraction systems?

.center.width-80[![](images/VAE_appli_spectro_enc_down.svg)]

.credit[S. van Steenkiste, F. Locatello, J. Schmidhuber, O. Bachem. Are disentangled representations helpful for abstract visual reasoning?, NeurIPS, 2019.]

---
class: middle, center

3 examples of transformation and decoder-based downstream tasks.

---
## Transformation task example
### Source-filter VAE .small[(Sadok et al., 2023)]

.center.width-70[![](images/latent_representation_2.svg)]

- We want to analyze the structure of the latent space of a VAE pretrained on short-term speech power spectra.

.credit[
S. Sadok, S. Leglaive, L. Girin, X. Alameda-Pineda, R. Séguier (2023). Learning and controlling the source-filter representation of speech with a variational autoencoder. Speech Communication.
]

---
count:false
## Transformation task example
### Source-filter VAE .small[(Sadok et al., 2023)]

.center.width-95[![](images/source_filter.svg)]

- We take the perspective of the **source filter-model of speech production** proposed by Fant .small[(1970)].
- The production of speech results from the interaction of a source with a time-varying filter.
- Assumption: We can control the source and the filter independently from each other.
- Important parameters are the **fundamental frequency** and the **formant frequencies**.

G. Fant (1970). Acoustic theory of speech production. 2. Walter de Gruyter.
]

---
## Transformation task example
### Source-filter VAE .small[(Sadok et al., 2023)]

.center.width-60[![](images/soundgen_trajectories_hor.svg)]

- We want to study how the fundamental and formant frequencies are encoded within the VAE latent space.
- We generate short synthetic speech signals with one single attribute varying at a time.

---
## Transformation task example
### Source-filter VAE .small[(Sadok et al., 2023)]

.center.width-95[![](images/sf_vae_overview_2a.svg)]

- Each sequence of short-term speech power spectra is encoded by the pretrained VAE encoder. 
- We obtain a sequence of 16-dimensional latent vectors.
- Intuition: Because only one attribute varies, we expect the latent vectors to live in a lower-dimensional manifold of the original latent space.

---
## Transformation task example
### Source-filter VAE .small[(Sadok et al., 2023)]

.center.width-95[![](images/sf_vae_overview_2c.svg)]

- Indeed, we can learn an "invertible" mapping between the VAE latent space and a linear subspace of lower dimension while retaining most of the data variance.

---
## Transformation task example
### Source-filter VAE .small[(Sadok et al., 2023)]

.center.width-95[![](images/sf_vae_overview_3.svg)]

- The subspaces for different attributes are orthogonal between each other, as can be verified by simply computing the dot product between the vectors that span each subspace.
- The latent representation learned by the VAE is consistent with the source-filter model.
- It suggests that we could manipulate one attribute in its latent subspace without affecting the others, similarly as how humans produce speech according to the source-filter model.

---
## Transformation task example
### Source-filter VAE .small[(Sadok et al., 2023)]

.center.width-95[![](images/sf_vae_regression.svg)]

- Using only our few-seconds of artificial and automatically-labeled speech data, we learned to regress from the speech attribute to a specific position in the corresponding subspace.
- This allows us to perform disentangled speech manipulations by moving into the latent subspaces, with a simple translation.

---
class: middle

## Transformation task example
### Source-filter VAE .small[(Sadok et al., 2023)]

.center.width-80[![](images/gen_spec.svg)]

.alert[We can generate / transform speech spectrograms given input trajectories of the fundamental and formant frequencies.]

---
## Decoder-based downstream task example
### Unsupervised speech enhancement .small[(Bando et al., 2018; Leglaive et al., 2018)]

.center.width-80[![](images/VAE_appli_spectro_SE.svg)]

.medium[
- Given a VAE pretrained on clean speech, the speech enhancement task consists of **inferring the VAE latent variables given a noisy speech recording**.
- Approximate posterior inference using Markov chain Monte Carlo .small[(Bando et al., 2018; Leglaive et al., 2018)] or variational inference .small[(Leglaive et al., 2020)].
- Test-time iterative optimization: good in terms of adaptation but slow and performance limited compared to supervised methods on in-domain data.

]

.credit[

Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, T. Kawahara (2018). Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-negative matrix factorization. IEEE ICASSP.
<br>
S. Leglaive, L. Girin, R. Horaud (2018). A variance modeling framework based on variational autoencoders for speech enhancement. IEEE MLSP.
<br>
S. Leglaive, X. Alameda-Pineda, L. Girin, R. Horaud (2020). A Recurrent Variational Autoencoder for Speech Enhancement. IEEE ICASSP.
]

---

.grid[
.kol-1-3.center[
<img src="audio/speech_enhancement/3-0-mix.svg", width=75%>
.small-nvspace[

]
<audio controls src="audio/speech_enhancement/3-0-mix.wav"></audio>
.center.small[Mixture]
]
.kol-1-3.center[
<img src="audio/speech_enhancement/3-0-RVAE-BRNN-VEM.svg", width=75%>
.small-nvspace[

]
<audio controls src="audio/speech_enhancement/3-0-RVAE-BRNN-VEM.wav"></audio>
.center.small[RVAE VEM (Leglaive et al., 2020)]
]
.kol-1-3.center[
<img src="audio/speech_enhancement/3-0-SEMamba-VB-DMD.svg", width=71%>
.small-nvspace[

]
<audio controls src="audio/speech_enhancement/3-0-SEMamba-VB-DMD.wav"></audio>
.center.small[SEMamba (Chao et al., 2024)]
]
]

.small-nvspace[

]

.grid[
.kol-1-3.center[
<img src="audio/speech_enhancement/matmatah-mix.svg", width=75%>
.small-nvspace[

]
<audio controls src="audio/speech_enhancement/matmatah-mix.wav"></audio>
.center.small[Mixture]
]
.kol-1-3.center[
<img src="audio/speech_enhancement/matmatah-BRNN-VEM-voice.svg", width=75%>
.small-nvspace[

]
<audio controls src="audio/speech_enhancement/matmatah-BRNN-VEM-voice.wav"></audio>
.center.small[RVAE VEM (Leglaive et al., 2020)]
]
.kol-1-3.center[
<img src="audio/speech_enhancement/matmatah-SEMamba-VB-DMD-voice.svg", width=71%>
.small-nvspace[

]
<audio controls src="audio/speech_enhancement/matmatah-SEMamba-VB-DMD-voice.wav"></audio>
.center.small[SEMamba (Chao et al., 2024)]
]
]

.credit[
  R. Chao, W.-H. Cheng, M. La Quatra, S. M. Siniscalchi, C.-H. H. Yang, S.-W. Fu, Y. Tsao (2024). An investigation of incorporating Mamba for speech enhancement. IEEE SLT. <br> https://huggingface.co/spaces/rc19477/Speech_Enhancement_Mamba
  ]

---
## Decoder-based downstream task example
### Masked generative human mesh recovery (HMR) .small[(Fiche et al., 2025)]

.grid[
.kol-1-2[
  .width-100[![](images/VAE_appli_HMR.svg)]
]
.kol-1-2[
.tiny-vspace[

]
<video controls width="100%" loop autoplay muted>
  <source src="images/MEGA_multi_hor.mp4" type="video/mp4">
</video>
]
]

.medium[
- **HMR is an ill-posed problem**: There exists an infinity of 3D human meshes that can explain equally well the 2D observation.
- We account for this ambiguity by designing the inference model as a **conditional masked generative model**. 
- We can obtain the **most likely output** by computing the argmax of the conditional distribution, or generate **multiple plausible outputs** by stochastic sampling.
]

.credit[

G. Fiche, S. Leglaive, X. Alameda-Pineda, F. Moreno-Noguer (2025). MEGA: Masked generative autoencoder for human mesh recovery. IEEE/CVF CVPR.

]

---
class: middle, center

The variance of the predictions can be interpreted as a measure of **uncertainty**.

.center.width-100[![](images/MEGA_uncertainty.png)]

---
class: center, middle

Underlying these downstream tasks are fundamental questions concerning the **design of the VAE generative and inference models themselves**, particularly in terms of **probabilistic modeling** and neural network **architectures** that should be well aligned with the **structure of the data**.

In the rest of this presentation we will focus on **multimodal and sequential data**.

---
class: middle

.grid[

.kol-2-3[

## Multimodality

- Different modalities provide complementary information. 
- Multimodality is useful for the **perception** and the **analysis of humans** using **signal processing** and **machine learning**:
  - Visual modality is **robust** to acoustic noise .small[(useful for speech enhancement, ASR, etc.)];
  - Audio modality is **robust** to occlusion and lighting conditions .small[(useful for person identification, etc.)];
  - Some phenomena or tasks are **ambiguous** when perceived or addressed through a single modality <br>.small[(e.g., emotional expressions, visemes, speech extraction, etc.)]
  
  .alert[Multimodality provides robustness and helps disambiguation.]

]
.kol-1-3[.center.width-60[![](images/modalities_vert.svg)]
]
]

.reset-column[

]

.credit.pull-right[Picture credits: <a href="https://fr.freepik.com/photos-gratuite/heureuse-fille-aux-cheveux-boucles-fait-signe-pouce-air-demontre-son-soutien-son-respect-quelqu-sourit-agreablement-atteint-objectif-souhaitable-porte-t-shirt-blanc-isole-mur-jaune_11932454.htm#query=black%20woman%20face&position=2&from_view=search">wayhomestudio</a> on Freepik.]

---

## VAEs for multimodal and sequential data

.center.width-80[![](images/BN_VAE_DVAE_MVAE_2.png)]

- The VAE has been extended in many ways, including for processing multimodal **or** sequential data.
- But a model for both multimodal and sequential data was lacking.

.credit[
.vspace[
  
]
L. Girin, S. Leglaive, X. Bie, J. Diard, T. Hueber, X. Alameda-Pineda (2021). Dynamical variational autoencoders: A comprehensive review. Foundations and Trends in Machine Learning.
<br>
M. Suzuki, K. Nakayama, Y. Matsuo (2017). Joint multimodal learning with deep denerative models. ICLR Workshop.
<br>
M. Wu and N. Goodman (2018). Multimodal generative models for scalable weakly-supervised learning. NeurIPS.
<br>
W.-N. Hsu and J. R. Glass (2018). Disentangling by partitioning: A representation learning framework for multimodal sensory data. arXiv preprint arXiv:1805.11264.
<br>
Y. Shi, N. Siddharth, B. Paige, P. Torr (2019). Variational mixture-of-experts autoencoders for multi-modal deep denerative models. NeurIPS.
<br>
T. Sutter, I. Daunhawer, J. E. Vogt (2021). Generalized Multimodal ELBO. ICLR.
]

---
class: middle, center

In .small[(Sadok et al., 2024)] we proposed MDVAE, a multimodal **and** dynamical VAE applied to audiovisual speech representation learning.

.credit[
S. Sadok, S. Leglaive, L. Girin, X. Alameda-Pineda, R. Séguier (2024). A multimodal dynamical variational autoencoder for audiovisual speech representation learning. Neural Networks. 
]

---
class: middle
count: false

.center[
  # Multimodal dynamical VAE (MDVAE)
]

<br>

- .bold[Generative model]
- Inference model
- Training

---
## Expressive audiovisual speech generative process

.grid[

.kol-1-2[

.underline[**Observed** audiovisual speech data:]

- <font color="#258212">Audio</font>: $\hspace{1cm} \myxa{} \in \mathbb{R}^{d\_a \times T}$
- <font color="#800080">Visual</font>: $\hspace{.98cm} \myxv{} \in \mathbb{R}^{d\_v \times T}$

.center[
<video controls width="400">
  <source src="videos/W016_happy_level2_004.mp4" type="video/mp4">
</video>
]
]
.kol-1-2[

.underline[**Latent** variables:]

- Static, <font color="brown">audiovisual</font>: $\hspace{1.2cm}\myw \in  \mathbb{R}^{\ell\_w}$ <br> .small[(e.g., speaker's identity and global emotional state)]

- Dynamical, <font color="brown">audiovisual</font>: $\hspace{.65cm}\myzav{} \in \mathbb{R}^{\ell\_{av} \times T}$ <br> .small[(e.g., lip movements, phonemic information (part of))]

- Dynamical, <font color="#800080">visual</font>: $\hspace{1.4cm}\myzv{} \in \mathbb{R}^{\ell\_v \times T}$ <br> .small[(e.g., other facial movements and head pose)]

- Dynamical, <font color="#258212">audio</font>: $\hspace{1.45cm}\myza{} \in \mathbb{R}^{\ell\_a \times T}$ <br> .small[(e.g., pitch variations, phonemic information (part of))]

]

.alert[We want to decompose / generate audiovisual speech into / from these latent factors, without supervision.]

.credit[Video credits: K. Wang et al., MEAD: A large-scale audio-visual dataset for emotional talking-face generation, ECCV, 2020.]

---
class: middle
## MDVAE generative model

.center.width-80[![](images/BN_MDVAE.png)]

- MDVAE generative model is defined as a **Bayesian network** .small[(Peal, 1985)]. 
- It corresponds to a **factorization of the joint distribution** $ p\_\theta\left(\myxa{}, \myxv{}, \myzav{},\myza{},\myzv{}, \myw{}\right).$
- By **defining "appropriate" probabilistic dependencies** between the variables we will learn the desired disentangled representation of audiovisual speech, **without supervision**.
- Let's go over each term involved in the factorization.

.credit[J. Peal (1985). Bayesian networks: A model of self-activated memory for evidential reasoning. In Proceedings of the Annual Meeting of the Cognitive Science Society (Vol. 7).]

---
class: middle
count:false

.center.width-100[![](images/BN_MDVAE_xa.png)]

$$ \hspace{-.5cm} \small \boxed{p\_\theta\left(\myxa{} \mid \myzav{},\myza{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxa{t} \mid \myzav{t},\myza{t},\myw{}\right)}, \hspace{.3cm} p\_\theta\left(\myxv{} \mid \myzav{},\myzv{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxv{t} \mid \myzav{t},\myzv{t},\myw{}\right)  $$

$$ \hspace{-.5cm} \small p\_\theta\left(\myza{}\right) = \prod\_{t=1}^T  p\_\theta\left(\myza{t} \mid \myza{1:t-1} \right), \hspace{.3cm} p\_\theta\left(\myzav{}\right) = \prod\_{t=1}^T  p\_\theta\left(\myzav{t} \mid \myzav{1:t-1} \right), \hspace{.3cm} p\_\theta\left(\myzv{}\right) = \prod\_{t=1}^T  p\_\theta\left(\myzv{t} \mid \myzv{1:t-1} \right)  $$

---
class: middle
count: false

.center.width-100[![](images/BN_MDVAE_xv.png)]

$$ \hspace{-.5cm} \small p\_\theta\left(\myxa{} \mid \myzav{},\myza{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxa{t} \mid \myzav{t},\myza{t},\myw{}\right), \hspace{.3cm} \boxed{p\_\theta\left(\myxv{} \mid \myzav{},\myzv{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxv{t} \mid \myzav{t},\myzv{t},\myw{}\right)}  $$

---
class: middle
count: false

.center.width-100[![](images/BN_MDVAE_z.png)]

$$ \hspace{-.5cm} \small p\_\theta\left(\myxa{} \mid \myzav{},\myza{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxa{t} \mid \myzav{t},\myza{t},\myw{}\right), \hspace{.3cm} p\_\theta\left(\myxv{} \mid \myzav{},\myzv{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxv{t} \mid \myzav{t},\myzv{t},\myw{}\right)  $$

$$ \hspace{-.5cm} \small \boxed{p\_\theta\left(\myza{}\right) = \prod\_{t=1}^T  p\_\theta\left(\myza{t} \mid \myza{1:t-1} \right)}, \hspace{.3cm} \boxed{p\_\theta\left(\myzav{}\right) = \prod\_{t=1}^T  p\_\theta\left(\myzav{t} \mid \myzav{1:t-1} \right)}, \hspace{.3cm} \boxed{p\_\theta\left(\myzv{}\right) = \prod\_{t=1}^T  p\_\theta\left(\myzv{t} \mid \myzv{1:t-1} \right)}  $$

---
exclude: true
class: middle

Important remark:
- The variables $\myxa{t}$ and $\myxv{t}$ are considered conditionally independent over time, but they are not marginally independent thanks to the autoregressive prior model.
- Said differently, the autoregressive model over the latent variables induces the temporal dynamics over the observed audio and visual speech variables.

.vspace[

]

$$ \hspace{-.5cm} \small {p\_\theta\left(\myxa{} \mid \myzav{},\myza{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxa{t} \mid \myzav{t},\myza{t},\myw{}\right)}, \hspace{.3cm} p\_\theta\left(\myxv{} \mid \myzav{},\myzv{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxv{t} \mid \myzav{t},\myzv{t},\myw{}\right)  $$
$$ \hspace{-.5cm} \small p\_\theta\left(\myza{}\right) = \prod\_{t=1}^T  p\_\theta\left(\myza{t} \mid \myza{1:t-1} \right), \hspace{.3cm} p\_\theta\left(\myzav{}\right) = \prod\_{t=1}^T  p\_\theta\left(\myzav{t} \mid \myzav{1:t-1} \right), \hspace{.3cm} p\_\theta\left(\myzv{}\right) = \prod\_{t=1}^T  p\_\theta\left(\myzv{t} \mid \myzv{1:t-1} \right)  $$

---
class: middle
count: false

.small-nvspace[

]

.center.width-100[![](images/BN_MDVAE.png)]

.medium[ All distributions are Gaussians parametrized by neural networks (MLPs or RNNs) and $ \small p(\myw{}) = \mathcal{N}(\myw{}; \mathbf{0}, \mathbf{I})$.
]

---
class: middle
count: false

.center[
  # Multimodal dynamical VAE (MDVAE)
]

<br>

- Generative model
- .bold[Inference model]
- Training

---
class: middle
## MDVAE inference model

- The inference model is an approximation of the intractable posterior. It defines **how we go from the observed audiovisual speech data to the latent variables**. 
  
- It can also be represented as a probabilistic graphical model, which corresponds to a factorization of the approximate posterior 
  $$q\_\phi\left(\myzav{},\myza{},\myzv{}, \myw{} \mid \myxa{}, \myxv{}\right).$$

- This factorization is **not chosen but induced** by the generative model and conditional independence criteria following the principle of D-separation .small[(Geiger et al., 1990)].

.credit[
D. Geiger et al., Identifying independence in Bayesian networks, Networks, 1990.
]

---
class: middle
count: false

.center.width-100[![](images/BN_MDVAE_inf_w.svg)]

The inference model $\small q\_\phi\left(\myzav{},\myza{},\myzv{}, \myw{} \mid \myxa{}, \myxv{}\right)$ decomposes as the product of four terms:

$$\small \hspace{-.5cm} \boxed{ q\_\phi\left(\myw{} \mid \myxa{}, \myxv{}\right)} \times q\_\phi\left(\myzav{}\mid \myxa{}, \myxv{}, \myw{} \right) \times q\_\phi\left(\myza{} \mid \myxa{}, \myzav{}, \myw{}\right) \times q\_\phi\left(\myzv{} \mid \myxv{}, \myzav{}, \myw{}\right).$$

---
class: middle
count: false

.center.width-100[![](images/BN_MDVAE_inf_zav.svg)]

The inference model $\small q\_\phi\left(\myzav{},\myza{},\myzv{}, \myw{} \mid \myxa{}, \myxv{}\right)$ decomposes as the product of four terms:

$$\small \hspace{-.5cm} q\_\phi\left(\myw{} \mid \myxa{}, \myxv{}\right) \times \boxed{ q\_\phi\left(\myzav{}\mid \myxa{}, \myxv{}, \myw{} \right) } \times q\_\phi\left(\myza{} \mid \myxa{}, \myzav{}, \myw{}\right) \times q\_\phi\left(\myzv{} \mid \myxv{}, \myzav{}, \myw{}\right).$$

---
class: middle
count: false

.center.width-100[![](images/BN_MDVAE_inf_all.svg)]

The inference model $\small q\_\phi\left(\myzav{},\myza{},\myzv{}, \myw{} \mid \myxa{}, \myxv{}\right)$ decomposes as the product of four terms:

$$\small \hspace{-.5cm} q\_\phi\left(\myw{} \mid \myxa{}, \myxv{}\right) \times q\_\phi\left(\myzav{}\mid \myxa{}, \myxv{}, \myw{} \right) \times \boxed{ q\_\phi\left(\myza{} \mid \myxa{}, \myzav{}, \myw{}\right) } \times \boxed{ q\_\phi\left(\myzv{} \mid \myxv{}, \myzav{}, \myw{}\right) }. $$

---
class: middle
count: false

.center.width-100[![](images/BN_MDVAE_inf_all.svg)]

The inference model $\small q\_\phi\left(\myzav{},\myza{},\myzv{}, \myw{} \mid \myxa{}, \myxv{}\right)$ decomposes as the product of four terms:

$$\small \hspace{-.5cm} q\_\phi\left(\myw{} \mid \myxa{}, \myxv{}\right) \times q\_\phi\left(\myzav{}\mid \myxa{}, \myxv{}, \myw{} \right) \times { q\_\phi\left(\myza{} \mid \myxa{}, \myzav{}, \myw{}\right) } \times { q\_\phi\left(\myzv{} \mid \myxv{}, \myzav{}, \myw{}\right) }. $$

- Temporal dependencies are not detailed.
- Gaussian distributions parametrized by neural networks taking conditioning variables as input.

---
class: middle
count: false

.center[
  # Multimodal dynamical VAE (MDVAE)
]

<br>

- Generative model
- Inference model
- .bold[Training]

---
class: middle
## MDVAE training

.grid[

.kol-3-5[

- The MDVAE model is trained on the **MEAD** dataset<br>.small[(K. Wang et al., 2020)]:
  - 30 hours of expressive audiovisual speech;
  - 48 speakers;
  - 8 emotions with 3 levels of intensity.

- The objective function is the ELBO, as in standard VAEs.

- **2-stage training** approach using vector quantized VAEs (VQ-VAEs) .small[(van den Oord et al., 2017)].
]

.kol-2-5[

.center.width-100[![](images/MEAD-data.png)]
.center[.small[Image credits: (K. Wang et al., 2020)]]

]

.credit[

K. Wang et al., MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation, ECCV, 2020 <br>
A. van den Oord et al., Neural discrete representation learning, NeurIPS 2017.]

---
class: middle
## 1st training stage

.center[We learn a VQ-VAE model independently for **each modality** and without **temporal modeling**.]

.vspace[

]

.center.width-100[![](images/VQ-MDVAE-training-stage1.svg)]

---
class: middle

## 2nd training stage

.center[We then learn the MDVAE model **"inside" the frozen VQ-VAEs**, using as input the compressed representations of the VQ-VAEs before quantization.]

.vspace[

]

.center.width-100[![](images/VQ-MDVAE-training-stage2.svg)]

.center[The use of VQ-VAEs improves the reconstruction quality and speeds up the training.]

---
class: middle
count: false

.center[
  # Experiments on audiovisual speech
]

<br>

- .bold[Audiovisual speech manipulations]
- Audiovisual facial image denoising
- Audiovisual speech emotion recognition

---
class: middle

## Qualitative analysis of the learned representations

- We want to study what characteristics of the audiovisual speech data are encoded in the latent variables $\myw{}$, $\myzav{}$, $\myzv{}$, and $\myza{}$. 
- We will reconstruct an audiovisual speech sequence using some of the latent variables from another sequence.

.small-vspace[
  
]

.center.width-90[![](images/exp_swap.svg)]

---
class: middle, center

.grid[
.kol-1-2[

We transfer $\myzav{}$ from the central sequence in red to the surrounding sequences.

.center[
<video controls width="450" loop autoplay muted>
  <source src="demo/mosaic/z_av.mp4" type="video/mp4">
</video>
]

Lip and jaw movements are transfered.

]
.kol-1-2[

We transfer $\myzv{}$ from the central sequence in red to the surrounding sequences.

.center[
<video controls width="450" loop autoplay muted>
  <source src="demo/mosaic/z_visual.mp4" type="video/mp4">
</video>
]

Head and eyelid movements are transfered.

]
]

---
class: middle, center

.small-nvspace[

]

We change $\myw{}$ of the central sequence to obtain the surrounding sequences.

.center[
<video controls width="650"  loop autoplay muted>
  <source src="demo/swap_w_visual.mp4" type="video/mp4">
</video>
]

---
class: middle

.center[Interpolation of the static audiovisual latent variable $\myw{}$]

.grid[

.kol-1-2[
.center[
<video controls width="400"  loop autoplay muted>
  <source src="demo/interpolation/identity_interpolation.mp4" type="video/mp4">
</video>
]
.caption[Same emotion, different identities.]
]
.kol-1-2[
.center[
<video controls width="400"  loop autoplay muted>
  <source src="demo/interpolation/emotions_interpolation.mp4" type="video/mp4">
</video>
].caption[Same identity, different emotions.]
]
]

$$ \small \hspace{-.5cm} p\_\theta\left(\myxv{} \mid \myzav{},\myzv{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxv{t} \mid \myzav{t},\myzv{t},\boxed{\tilde{\myw{}}\_t}\right), \hspace{.3cm} \tilde{\myw{}}\_t = \alpha\_t \myw{} + (1- \alpha\_t) \myw{}', \hspace{.3cm} \alpha\_t = (T-t)/(T-1). $$

---
class: middle

- The qualitative analysis confirmed that:
  - The static audiovisual latent variable $\myw{}$ encodes the speaker's identity and global emotional state.
  - The dynamical audiovisual latent variable $\myzav{}$ encodes the speaker's lip and jaw movements.
  - The dynamical visual latent variable $\myzv{}$ encodes the remaining facial movements such as the eyes and head movements.

- These conclusions are confirmed quantitatively by measuring the impact of swapping latent variables on the action units (cf. paper).

---
class: middle
count: false

.center[
  # Experiments on audiovisual speech
]

<br>

- Audiovisual speech manipulations
- .bold[Audiovisual facial image denoising]
- Audiovisual speech emotion recognition

---
class: middle

- We artificially **corrupt the visual modality** by adding random Gaussian noise on localized regions of the 6 central frames of a 10 frame-long sequence. The **audio modality is unaltered**.

.center.width-70[![](images/perturbations/mouth_no_results.svg)]

- The task is to reconstruct the visual modality.

- We compare three different VAE-based models:
  - .bold[VQ-VAE] .small[(van den Oord et al., 2017)]: unimodal, without temporal modeling. 
  - .bold[DSAE] .small[(Li and Mandt, 2018)]: unimodal, with temporal modeling, trained in two stages. 
  - .bold[MDVAE]: multimodal, with temporal modeling.

.alert[Denoising is done by simply encoding and decoding the corrupted (audio)visual speech sequence.]

.credit[Y. Li and S. Mandt, "Disentangled sequential autoencoder", ICML 2018.]

---
class: middle, black-slide

.center.width-100[![](images/perturbations/mouth.svg)]

---
class: middle

.grid[

.kol-3-5[
.center.width-95[![](images/perturbations/robustness_to_noise_PSNR_SSIM_2.svg)]
]
.kol-2-5[

<br>
- Results are obtained by averaging over 200 test sequences.
- Metrics are computed on the corrupted region.

.small-vspace[

]

.alert-90[
  
The performance gap between MDVAE and the unimodal baselines is larger for the corruption of the mouth region.

This is because MDVAE exploits the audio modality.]

]

---
class: middle
count: false

.center[
  # Experiments on audiovisual speech
]

<br>

- Audiovisual speech manipulations
- Audiovisual facial image denoising
- .bold[Audiovisual speech emotion recognition]

---
class: middle

- The qualitative analysis of the latent representations learned by MDVAE suggests that the static audiovisual latent variable $\myw{}$ encodes the speaker's emotion.

- We use the **mean vector of the Gaussian inference model** $q\_\phi\left(\myw{} \mid \myxa{}, \myxv{}\right)$ as the input of a **multinomial logistic regression model** trained for emotion classification on the MEAD dataset (8 classes).

- The mean vector is simply obtained by a forward through the encoder network corresponding to $q\_\phi\left(\myw{} \mid \myxa{}, \myxv{}\right)$.

- We compare the performance of MDVAE with its **unimodal counterparts**: 
  - .bold[A-DSAE] relies on the audio-only inference model $q\_\phi\left(\myw{} \mid \myxa{}\right)$;
  - .bold[V-DSAE] relies on the visual-only inference model $q\_\phi\left(\myw{} \mid \myxv{}\right)$.

---
class: middle

.center.width-80[![](images/emotion_classification_results_vs_data.svg)]

- Using the exact same experimental protocol, MDVAE outperforms its two unimodal counterparts by about 50% of accuracy.
- With less than 10% of the labeled data, MDVAE reaches 90% of its maximal performance.

---
class: middle

.grid[

.kol-2-3[

- We also evaluate the audiovisual emotion classification performance on **RAVDESS** .small[(Livingstone and Russo, 2018)].
  
- MDVAE is **fine-tuned** in an **unsupervised** manner on RAVDESS, **only the logistic regression model is supervised** (680 parameters).
  
- MDVAE is compared with a SOTA method based on an audiovisual transformer model .small[(Chumachenko et al., 2022)]. 
  
  It uses a **strong EfficientFace backbone** for image feature extraction, pre-trained on AffectNet, the largest dataset of in-the-wild facial images labeled in emotions.

]
.kol-1-3[
.center.width-100[![](images/Chumachenko.png)]
.center.small[Credits: (Chumachenko et al., 2022)]

]
]

.credit[

S.R Livingstone and F.A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English", PloS one, 2018.<br>
K. Chumachenko et al., "Self-attention fusion for audiovisual emotion recognition with incomplete data", IEEE ICPR, 2022.

]

---
class: middle

.vspace[

]

- MDVAE obtains competitive results even though it only contains 680 parameters learned using emotion labels ↪ confirms the effectiveness of the static audiovisual representation $\myw{}$.
- The supervised AV transformer could not be easily trained on RAVDESS and MEAD due to label mismatch, which further shows the interest of unsupervised representation learning.

---
class: middle

# Conclusion

- We proposed the MDVAE model to learn **structured representations of multimodal and dynamical data**.

-  The key to learn a meaningful representation was to structure the probabilistic generative model using different latent variables to encode different characteristics of the data.

-  Experiments on audiovisual speech have shown that **the model effectively disentangles static and dynamical information** as well as **audiovisual and modality-specific information**.

- The **same model** was used for different generation and recognition tasks (audiovisual speech manipulations, audiovisual facial image denoising, audiovisual speech emotion recognition).

---
class: middle
count: false

.center[
  # Thank you!
]

.big-vspace[

]

.footnote[
  
  Sadok, S., Leglaive, S., Girin, L., Alameda-Pineda, X., & Séguier, R. (2024). A multimodal dynamical variational autoencoder for<br> audiovisual speech representation learning. Neural Networks, 172, 106120. 
- .bold[arXiv]: https://arxiv.org/pdf/2305.03582.pdf
- .bold[Code]: https://github.com/samsad35/code-mdvae
- .bold[Demo]: https://samsad35.github.io/site-mdvae/
]