class: middle, center

$$ \global\def\myx#1{{\color{green}\mathbf{x}\_{#1}}} $$
$$ \global\def\myxa#1{{\color{green}\mathbf{x}\_{#1}^{(a)}}} $$
$$ \global\def\myza#1{{\color{green}\mathbf{z}\_{#1}^{(a)}}} $$
$$ \global\def\myxv#1{{\color{purple}\mathbf{x}\_{#1}^{(v)}}} $$
$$ \global\def\myzv#1{{\color{purple}\mathbf{z}\_{#1}^{(v)}}} $$
$$ \global\def\myzav#1{{\color{brown}\mathbf{z}\_{#1}^{(av)}}} $$
$$ \global\def\myzds#1{{\color{brown}\mathbf{z}\_{#1}^{(av)}}} $$
$$ \global\def\mywav{{\color{brown}\mathbf{w}^{(av)}}} $$
$$ \global\def\myw{{\color{brown}\mathbf{w}}} $$
$$ \global\def\mys#1{{\color{green}\mathbf{s}\_{#1}}} $$
$$ \global\def\myS#1{{\color{green}\mathbf{S}\_{#1}}} $$
$$ \global\def\myz#1{{\color{brown}\mathbf{z}\_{#1}}} $$
$$ \global\def\myztilde#1{{\color{brown}\tilde{\mathbf{z}}\_{#1}}} $$
$$ \global\def\myhnmf#1{{\color{brown}\mathbf{h}\_{#1}}} $$
$$ \global\def\myztilde#1{{\color{brown}\tilde{\mathbf{z}}\_{#1}}} $$
$$ \global\def\myu#1{\mathbf{u}\_{#1}} $$
$$ \global\def\mya#1{\mathbf{a}\_{#1}} $$
$$ \global\def\myv#1{\mathbf{v}\_{#1}} $$
$$ \global\def\mythetaz{\theta\_\myz{}} $$
$$ \global\def\mythetax{\theta\_\myx{}} $$
$$ \global\def\mythetas{\theta\_\mys{}} $$
$$ \global\def\mythetaa{\theta\_\mya{}} $$
$$ \global\def\bs#1{{\boldsymbol{#1}}} $$
$$ \global\def\diag{\text{diag}} $$
$$ \global\def\mbf{\mathbf} $$
$$ \global\def\myh#1{{\color{purple}\mbf{h}\_{#1}}} $$
$$ \global\def\myhfw#1{{\color{purple}\overrightarrow{\mbf{h}}\_{#1}}} $$
$$ \global\def\myhbw#1{{\color{purple}\overleftarrow{\mbf{h}}\_{#1}}} $$
$$ \global\def\myg#1{{\color{purple}\mbf{g}\_{#1}}} $$
$$ \global\def\mygfw#1{{\color{purple}\overrightarrow{\mbf{g}}\_{#1}}} $$
$$ \global\def\mygbw#1{{\color{purple}\overleftarrow{\mbf{g}}\_{#1}}} $$
$$ \global\def\neq{\mathrel{\char`≠}} $$

.big-vspace[

]

# A quick overview of the variational autoencoder, downstream tasks and extensions

.vspace[

]

.center[Simon Leglaive]

.small.center[CentraleSupélec, IETR (UMR CNRS 6164), France]

.big-vspace[

]

.center.width-7[![](images/logo_PoS.png)]

.grid[
.kol-1-6[

.left.width-120[![](images/logo_CS.svg)]
]
.kol-2-3[

.small.center[IMAGE department seminar - November 18, 2025]

]
.kol-1-6[

.right.width-65[![](images/logo_IETR.svg)]]

]

---
class: middle, center
count: false

# The variational autoencoder

---
class: middle

## Low-dimensional modeling of high-dimensional structured data

.center.width-90[![](images/true_data_distribution_3.svg)]

**High-dimensional data** $\myx{} \in \mathbb{R}^d$ such as 3D human meshes, natural images, or speech signals exhibit some form of **structure**, preventing their dimensions from varying independently.

- From a **geometric perspective**, this regularity suggests that the high-dimensional data actually live in a much lower-dimensional manifold.
  - From a **generative perspective**, it suggests that there exists a smaller dimensional **latent variable** $\myz{} \in \mathbb{R}^\ell$ that generated $\myx{} \in \mathbb{R}^d$, $\ell \ll d$.

.credit[Picture credits: <a href="https://fr.freepik.com/photos-gratuite/heureuse-fille-aux-cheveux-boucles-fait-signe-pouce-air-demontre-son-soutien-son-respect-quelqu-sourit-agreablement-atteint-objectif-souhaitable-porte-t-shirt-blanc-isole-mur-jaune_11932454.htm#query=black%20woman%20face&position=2&from_view=search">wayhomestudio</a> on Freepik. ]

---
class: middle

## Latent-variable generative modeling

.small-vspace[

]

.center.width-90[![](images/generative_model.svg)]

- Generative modeling consists of estimating the parameters $\theta$ so that $p\_\theta(\myx{}) \approx p^\star(\myx{})$ according to some measure of fit, for instance the Kullback-Leibler (KL) divergence.
- When the model includes a deep neural network, we obtain a **deep generative model**.

---
class: middle

.nvspace[

]

## The variational autoencoder (VAE) .tiny[(Kingma and Welling, 2014; Rezende et. al., 2014)]

.grid[

.kol-1-2[

- .underline[Prior]: $\qquad p(\myz{}) = \mathcal{N}(\myz{}; \mathbf{0}, \mathbf{I})$

- .underline[Generative model]:

$ p\_\theta(\myx{} | \myz{}) = \mathcal{N}\left( \myx{}; \boldsymbol{\mu}\_\theta(\myz{}), \boldsymbol{\Sigma}\_\theta(\myz{}) \right) $

.small[where] $\small \boldsymbol{\mu}\_\theta(\myz{}), \boldsymbol{\Sigma}\_\theta(\myz{})$ .small[are the outputs of the **decoder**.]

- .underline[Inference model]:

$ q\_\phi(\myz{} | \myx{}) =  \mathcal{N}\left( \myz{}; \boldsymbol{\mu}\_\phi(\myx{}), \boldsymbol{\Sigma}\_\phi(\myx{}) \right) \\\\$

.small[where] $\small \boldsymbol{\mu}\_\phi(\myx{}), \boldsymbol{\Sigma}\_\phi(\myx{})$ .small[are the outputs of the **encoder**.]
]
.kol-1-2[

.vspace[

]
.center.width-90[![](images/full_VAE_2.svg)]
]
]

The **model is trained without supervision**, inspiring from variational Bayesian inference techniques .small[(Jordan et al. 1999)].

.small-nvspace[
]

.credit[
  .vpsace[

]
  D.P. Kingma and M. Welling (2014). Auto-encoding variational Bayes. ICLR.
  <br>
  D.J. Rezende, S. Mohamed, D. Wierstra, (2014).Stochastic backpropagation and approximate inference in deep generative models. ICML.
  <br>
  M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, L.K. Saul (1999). An introduction to variational methods for graphical models. Machine Learning. 1999.
  ]

---
## One model, many tasks

We can use a pre-trained VAE for several different tasks.

.width-80[![](images/VAE_appli_spectro.svg)]

---
## Generation task

How to generate new data samples?

.center.width-80[![](images/VAE_appli_spectro_gen.svg)]

---
## Decoder-based downstream task

How to infer the latent variables given auxiliary signals?

.center.width-80[![](images/VAE_appli_spectro_dec_down.svg)]

---
## Transformation task

How to disentangle, interpret, and control the latent representation?

.center.width-80[![](images/VAE_appli_spectro_trans.svg)]

.credit[I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, A. Lerchner (2018). Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230.]

---
## Encoder-based downstream task

How to build sample-efficient, robust, and generalizable information extraction systems?

.center.width-80[![](images/VAE_appli_spectro_enc_down.svg)]

.credit[S. van Steenkiste, F. Locatello, J. Schmidhuber, O. Bachem. Are disentangled representations helpful for abstract visual reasoning?, NeurIPS, 2019.]

---
class: middle, center
count: false

# A rapid tour of downstream tasks based on VAEs

---
### Decoder-based downstream task
### .medium[Unsupervised speech enhancement] .tiny[(Bando et al., 2018; Leglaive et al., 2018)]

.small-vspace[

]
.center.width-70[![](images/VAE_appli_spectro_SE.svg)]

.medium[
- Given a VAE pretrained on clean speech, the speech enhancement task consists of **inferring the VAE latent variables given a noisy speech recording**.
- Approximate posterior inference using Markov chain Monte Carlo .small[(Bando et al., 2018; Leglaive et al., 2018)] or variational inference .small[(Leglaive et al., 2020)].
- Test-time iterative optimization: good in terms of adaptation but slow and performance limited compared to supervised methods on in-domain data.

]

.credit[

Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, T. Kawahara (2018). Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-negative matrix factorization. IEEE ICASSP.
<br>
S. Leglaive, L. Girin, R. Horaud (2018). A variance modeling framework based on variational autoencoders for speech enhancement. IEEE MLSP.
<br>
S. Leglaive, X. Alameda-Pineda, L. Girin, R. Horaud (2020). A Recurrent Variational Autoencoder for Speech Enhancement. IEEE ICASSP.
]

---

.grid[
.kol-1-3.center[
<img src="audio/speech_enhancement/3-0-mix.svg", width=75%>
.small-nvspace[

]
<audio controls src="audio/speech_enhancement/3-0-mix.wav"></audio>
.center.small[Mixture]
]
.kol-1-3.center[
<img src="audio/speech_enhancement/3-0-RVAE-BRNN-VEM.svg", width=75%>
.small-nvspace[

]
<audio controls src="audio/speech_enhancement/3-0-RVAE-BRNN-VEM.wav"></audio>
.center.small[RVAE VEM (Leglaive et al., 2020)]
]
.kol-1-3.center[
<img src="audio/speech_enhancement/3-0-SEMamba-VB-DMD.svg", width=71%>
.small-nvspace[

]
<audio controls src="audio/speech_enhancement/3-0-SEMamba-VB-DMD.wav"></audio>
.center.small[SEMamba (Chao et al., 2024)]
]
]

.small-nvspace[

]

.grid[
.kol-1-3.center[
<img src="audio/speech_enhancement/matmatah-mix.svg", width=75%>
.small-nvspace[

]
<audio controls src="audio/speech_enhancement/matmatah-mix.wav"></audio>
.center.small[Mixture]
]
.kol-1-3.center[
<img src="audio/speech_enhancement/matmatah-BRNN-VEM-voice.svg", width=75%>
.small-nvspace[

]
<audio controls src="audio/speech_enhancement/matmatah-BRNN-VEM-voice.wav"></audio>
.center.small[RVAE VEM (Leglaive et al., 2020)]
]
.kol-1-3.center[
<img src="audio/speech_enhancement/matmatah-SEMamba-VB-DMD-voice.svg", width=71%>
.small-nvspace[

]
<audio controls src="audio/speech_enhancement/matmatah-SEMamba-VB-DMD-voice.wav"></audio>
.center.small[SEMamba (Chao et al., 2024)]
]
]

.credit[
  R. Chao, W.-H. Cheng, M. La Quatra, S. M. Siniscalchi, C.-H. H. Yang, S.-W. Fu, Y. Tsao (2024). An investigation of incorporating Mamba for speech enhancement. IEEE SLT. <br> https://huggingface.co/spaces/rc19477/Speech_Enhancement_Mamba
  ]

---
### Decoder-based downstream task
### .medium[Masked generative human mesh recovery (HMR)] .tiny[(Fiche et al., 2025)]

.grid[
.kol-1-2[
  .width-100[![](images/VAE_appli_HMR.svg)]
]
.kol-1-2[
.tiny-vspace[

]
<video controls width="100%" loop autoplay muted>
  <source src="images/MEGA_multi_hor.mp4" type="video/mp4">
</video>
]
]

.medium[
- **HMR is an ill-posed problem**: There exists an infinity of 3D human meshes that can explain equally well the 2D observation.
- We account for this ambiguity by designing the inference model as a **conditional masked generative model**. 
- We can obtain the **most likely output** by computing the argmax of the conditional distribution, or generate **multiple plausible outputs** by stochastic sampling.
]

.credit[

G. Fiche, S. Leglaive, X. Alameda-Pineda, F. Moreno-Noguer (2025). MEGA: Masked generative autoencoder for human mesh recovery. IEEE/CVF CVPR.

]

---
class: middle, center

The variance of the predictions can be interpreted as a measure of **uncertainty**.

.center.width-100[![](images/MEGA_uncertainty.png)]

---
### Transformation downstream task
### .medium[Expressive audiovisual speech processing] .tiny[(Sadok et al., 2024)]

.small-nvspace[

]

.grid[

.kol-1-2[

.underline.medium[**Observed** audiovisual speech data:]

.medium[
- <font color="#258212">Audio</font>: $\hspace{1cm} \myxa{} \in \mathbb{R}^{d\_a \times T}$
- <font color="#800080">Visual</font>: $\hspace{.98cm} \myxv{} \in \mathbb{R}^{d\_v \times T}$
]

.center[
<video controls width="400">
  <source src="videos/W016_happy_level2_004.mp4" type="video/mp4">
</video>
]
]
.kol-1-2[

.underline.medium[**Latent** variables:]

.medium[
- Static, <font color="brown">audiovisual</font>: $\hspace{1.2cm}\myw \in  \mathbb{R}^{\ell\_w}$ <br> .small[(e.g., speaker's identity and global emotional state)]

- Dynamical, <font color="brown">audiovisual</font>: $\hspace{.65cm}\myzav{} \in \mathbb{R}^{\ell\_{av} \times T}$ <br> .small[(e.g., lip movements, phonemic information (part of))]

- Dynamical, <font color="#800080">visual</font>: $\hspace{1.4cm}\myzv{} \in \mathbb{R}^{\ell\_v \times T}$ <br> .small[(e.g., other facial movements and head pose)]

- Dynamical, <font color="#258212">audio</font>: $\hspace{1.45cm}\myza{} \in \mathbb{R}^{\ell\_a \times T}$ <br> .small[(e.g., pitch variations, phonemic information (part of))]
]

]

We proposed the multimodal dynamical VAE (MDVAE) model to decompose / generate audiovisual speech into / from these latent factors, without supervision.

.credit[Video credits: K. Wang et al., MEAD: A large-scale audio-visual dataset for emotional talking-face generation, ECCV, 2020.
<br>
S. Sadok, S. Leglaive, L. Girin, X. Alameda-Pineda, R. Séguier (2024). A multimodal dynamical variational autoencoder for audiovisual speech representation learning. Neural Networks. 
]

---
class: middle, center

.grid[
.kol-1-2[

We transfer $\myzav{}$ from the central sequence in red to the surrounding sequences.

.center[
<video controls width="450" loop autoplay muted>
  <source src="demo/mosaic/z_av.mp4" type="video/mp4">
</video>
]

Lip and jaw movements are transfered.

]
.kol-1-2[

We transfer $\myzv{}$ from the central sequence in red to the surrounding sequences.

.center[
<video controls width="450" loop autoplay muted>
  <source src="demo/mosaic/z_visual.mp4" type="video/mp4">
</video>
]

Head and eyelid movements are transfered.

]
]

---
class: middle, center

.small-nvspace[

]

We change $\myw{}$ of the central sequence to obtain the surrounding sequences.

.center[
<video controls width="650"  loop autoplay muted>
  <source src="demo/swap_w_visual.mp4" type="video/mp4">
</video>
]

---
### Encoder-based downstream task
### .medium[Emotion recognition using MDVAE] .tiny[(Sadok et al., 2024)]

.grid[
.kol-1-2[
  <video controls width="500"  loop autoplay muted>
  <source src="images/emotions.mp4" type="video/mp4">
  </video>
]
.kol-1-2[
.center.width-100[![](images/emotion_recog_accuracy_MDVAE.svg)]
]
]

The mean of the static audiovisual latent variable $\myw{}$ is used as the input of a **linear classifier** for emotion recognition. **Only this linear classifier with 680 parameters is trained using emotion labels.**

- The audiovisual approach outperforms its two unimodal counterparts by ~50% of accuracy.
- With less than 10% of the labeled data, the model reaches 90% of its maximal performance.

---
class: middle

.grid[

.kol-2-3[

- We compare MDVAE with a SOTA method based on audiovisual transformers .small[(Chumachenko et al., 2022)]. 
- It uses a **strong EfficientFace backbone** for image feature extraction, pre-trained on AffectNet, the largest dataset of in-the-wild facial images labeled in emotions.

]
.kol-1-3[
.center.width-100[![](images/Chumachenko.png)]
.center.small[Credits: (Chumachenko et al., 2022)]

]
]

.alert[MDVAE obtains competitive results with only 680 parameters learned using emotion labels<br> ↪ **confirms the effectiveness of the learned unsupervised representation** $\myw{}$.]

.credit[

K. Chumachenko et al., "Self-attention fusion for audiovisual emotion recognition with incomplete data", IEEE ICPR, 2022.

]

---
class: middle, center
count: false

# VAE extensions for structured data

---
class: middle

Underlying these downstream tasks are fundamental questions concerning the **design of the VAE generative and inference models themselves**, in terms of

1. **Probabilistic modeling**
   
  ↪ *What are the probabilistic dependencies between the observed data and the latent variables of interest?*
2. **Neural architectures**
   
  ↪ *How to implement these dependencies using neural networks?*

Both the probabilistic model and the neural architectures should be well aligned with the **structure of the data** .small[(spatial 2D/3D, sequential, continuous, discrete, etc.)].

---

## VAE extensions for sequential and multimodal data

.center.width-85[![](images/VAE_extensions.png)]

.credit[
D.P. Kingma and M. Welling (2014). Auto-encoding variational Bayes. ICLR.
<br>
D.J. Rezende, S. Mohamed, D. Wierstra, (2014).Stochastic backpropagation and approximate inference in deep generative models. ICML.
<br>
M. Wu and N. Goodman (2018). Multimodal generative models for scalable weakly-supervised learning. NeurIPS.
<br>
L. Girin, S. Leglaive, X. Bie, J. Diard, T. Hueber, X. Alameda-Pineda (2021). Dynamical variational autoencoders: A comprehensive review. Foundations and Trends in Machine Learning.
<br>
S. Sadok, S. Leglaive, L. Girin, X. Alameda-Pineda, R. Séguier (2024). A multimodal dynamical variational autoencoder for audiovisual speech representation learning. Neural Networks. 
]

---
class: middle
## Conclusion

- Collecting labels for every scenario and tasks is intractable, we need **alternatives to supervised learning**.

- **Deep generative** modeling with the **VAE** is a powerful **unsupervised** learning paradigm that can be applied to many different types of data, such as **multimodal** and **sequential data**.

- We can learn **structured** and **interpretable representations** by **modeling probabilistic dependencies** between the observed data and the latent variables of interest.

- One pretrained VAE can address **multiple downstream tasks**: generation, transformation, decoder-based or encoder-based.

- Yes, we also process **audio signals** in the IMAGE department 😉

---
class: middle
count: false

.center[
  # Thank you!
]

---
class: middle

## VAE objective

The VAE parameters are estimated by maximizing the **evidence lower bound** (ELBO) .small[(Neal and Hinton, 1999; Jordan et al. 1999)] defined by:
$$\begin{aligned}
\mathcal{L}(\phi, \theta) &= \underbrace{\mathbb{E}\_{q\_\phi(\myz{} | \myx{})} [\ln p\_\theta(\myx{} | \myz{})]}\_{\text{reconstruction accuracy}} - \underbrace{D\_{\text{KL}}(q\_\phi(\myz{} | \myx{}) \parallel p(\myz{}))}\_{\text{regularization}}.
\end{aligned}
$$

The ELBO can also be decomposed as:
$$\begin{aligned}
\mathcal{L}(\phi, \theta) &= \ln p\_\theta(\myx{}) - D\_{\text{KL}}(q\_\phi(\myz{} | \myx{}) \parallel p\_\theta(\myz{} | \myx{})).
\end{aligned}
$$

.alert[

.left-column[

.underline[Generative model parameters estimation]

$$ \underset{\theta}{\max}\, \Big\\{ \mathcal{L}(\phi, \theta) \le \ln p\_\theta(\myx{}) \Big\\} $$

]

.right-column[

.underline[Inference model parameters estimation]

$$ \underset{\phi}{\max}\, \mathcal{L}(\phi, \theta) \,\,\Leftrightarrow\,\,  \underset{\phi}{\min}\, D\_{\text{KL}}(q\_\phi(\myz{} | \myx{}) \parallel p\_\theta(\myz{} | \myx{}))$$
]

.reset-column[
]

]

.credit[
  
  R.M. Neal and G.E. Hinton (1999). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (Ed.), .italic[Learning in graphical models].
  <br>
  M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, L.K. Saul (1999). An introduction to variational methods for graphical models. Machine Learning. 1999.
  ]