class: middle, center $$ \global\def\myx#1{{\color{green}\mathbf{x}\_{#1}}} $$ $$ \global\def\myxa#1{{\color{green}\mathbf{x}\_{#1}^{(a)}}} $$ $$ \global\def\myza#1{{\color{green}\mathbf{z}\_{#1}^{(a)}}} $$ $$ \global\def\myxv#1{{\color{purple}\mathbf{x}\_{#1}^{(v)}}} $$ $$ \global\def\myzv#1{{\color{purple}\mathbf{z}\_{#1}^{(v)}}} $$ $$ \global\def\myzav#1{{\color{brown}\mathbf{z}\_{#1}^{(av)}}} $$ $$ \global\def\myzds#1{{\color{brown}\mathbf{z}\_{#1}^{(av)}}} $$ $$ \global\def\mywav{{\color{brown}\mathbf{w}^{(av)}}} $$ $$ \global\def\myw{{\color{brown}\mathbf{w}}} $$ $$ \global\def\mys#1{{\color{green}\mathbf{s}\_{#1}}} $$ $$ \global\def\myS#1{{\color{green}\mathbf{S}\_{#1}}} $$ $$ \global\def\myz#1{{\color{brown}\mathbf{z}\_{#1}}} $$ $$ \global\def\myztilde#1{{\color{brown}\tilde{\mathbf{z}}\_{#1}}} $$ $$ \global\def\myhnmf#1{{\color{brown}\mathbf{h}\_{#1}}} $$ $$ \global\def\myztilde#1{{\color{brown}\tilde{\mathbf{z}}\_{#1}}} $$ $$ \global\def\myu#1{\mathbf{u}\_{#1}} $$ $$ \global\def\mya#1{\mathbf{a}\_{#1}} $$ $$ \global\def\myv#1{\mathbf{v}\_{#1}} $$ $$ \global\def\mythetaz{\theta\_\myz{}} $$ $$ \global\def\mythetax{\theta\_\myx{}} $$ $$ \global\def\mythetas{\theta\_\mys{}} $$ $$ \global\def\mythetaa{\theta\_\mya{}} $$ $$ \global\def\bs#1{{\boldsymbol{#1}}} $$ $$ \global\def\diag{\text{diag}} $$ $$ \global\def\mbf{\mathbf} $$ $$ \global\def\myh#1{{\color{purple}\mbf{h}\_{#1}}} $$ $$ \global\def\myhfw#1{{\color{purple}\overrightarrow{\mbf{h}}\_{#1}}} $$ $$ \global\def\myhbw#1{{\color{purple}\overleftarrow{\mbf{h}}\_{#1}}} $$ $$ \global\def\myg#1{{\color{purple}\mbf{g}\_{#1}}} $$ $$ \global\def\mygfw#1{{\color{purple}\overrightarrow{\mbf{g}}\_{#1}}} $$ $$ \global\def\mygbw#1{{\color{purple}\overleftarrow{\mbf{g}}\_{#1}}} $$ $$ \global\def\neq{\mathrel{\char`≠}} $$ .big-vspace[ ] # A quick overview of the variational autoencoder, downstream tasks and extensions .vspace[ ] .center[Simon Leglaive] .small.center[CentraleSupélec, IETR (UMR CNRS 6164), France] .big-vspace[ ] .center.width-7[] .grid[ .kol-1-6[ .left.width-120[] ] .kol-2-3[ .small.center[IMAGE department seminar - November 18, 2025] ] .kol-1-6[ .right.width-65[]] ] --- class: middle, center count: false # The variational autoencoder --- class: middle ## Low-dimensional modeling of high-dimensional structured data .center.width-90[] **High-dimensional data** $\myx{} \in \mathbb{R}^d$ such as 3D human meshes, natural images, or speech signals exhibit some form of **structure**, preventing their dimensions from varying independently. - From a **geometric perspective**, this regularity suggests that the high-dimensional data actually live in a much lower-dimensional manifold. - From a **generative perspective**, it suggests that there exists a smaller dimensional **latent variable** $\myz{} \in \mathbb{R}^\ell$ that generated $\myx{} \in \mathbb{R}^d$, $\ell \ll d$. .credit[Picture credits:
wayhomestudio
on Freepik. ] --- class: middle ## Latent-variable generative modeling .small-vspace[ ] .center.width-90[] - Generative modeling consists of estimating the parameters $\theta$ so that $p\_\theta(\myx{}) \approx p^\star(\myx{})$ according to some measure of fit, for instance the Kullback-Leibler (KL) divergence. - When the model includes a deep neural network, we obtain a **deep generative model**. --- class: middle .nvspace[ ] ## The variational autoencoder (VAE) .tiny[(Kingma and Welling, 2014; Rezende et. al., 2014)] .grid[ .kol-1-2[ - .underline[Prior]: $\qquad p(\myz{}) = \mathcal{N}(\myz{}; \mathbf{0}, \mathbf{I})$ - .underline[Generative model]: $ p\_\theta(\myx{} | \myz{}) = \mathcal{N}\left( \myx{}; \boldsymbol{\mu}\_\theta(\myz{}), \boldsymbol{\Sigma}\_\theta(\myz{}) \right) $ .small[where] $\small \boldsymbol{\mu}\_\theta(\myz{}), \boldsymbol{\Sigma}\_\theta(\myz{})$ .small[are the outputs of the **decoder**.] - .underline[Inference model]: $ q\_\phi(\myz{} | \myx{}) = \mathcal{N}\left( \myz{}; \boldsymbol{\mu}\_\phi(\myx{}), \boldsymbol{\Sigma}\_\phi(\myx{}) \right) \\\\$ .small[where] $\small \boldsymbol{\mu}\_\phi(\myx{}), \boldsymbol{\Sigma}\_\phi(\myx{})$ .small[are the outputs of the **encoder**.] ] .kol-1-2[ .vspace[ ] .center.width-90[] ] ] The **model is trained without supervision**, inspiring from variational Bayesian inference techniques .small[(Jordan et al. 1999)]. .small-nvspace[ ] .credit[ .vpsace[ ] D.P. Kingma and M. Welling (2014). Auto-encoding variational Bayes. ICLR.
D.J. Rezende, S. Mohamed, D. Wierstra, (2014).Stochastic backpropagation and approximate inference in deep generative models. ICML.
M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, L.K. Saul (1999). An introduction to variational methods for graphical models. Machine Learning. 1999. ] --- ## One model, many tasks We can use a pre-trained VAE for several different tasks. .width-80[]
--- ## Generation task How to generate new data samples? .center.width-80[] --- ## Decoder-based downstream task How to infer the latent variables given auxiliary signals? .center.width-80[] --- ## Transformation task How to disentangle, interpret, and control the latent representation? .center.width-80[] .credit[I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, A. Lerchner (2018). Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230.] --- ## Encoder-based downstream task How to build sample-efficient, robust, and generalizable information extraction systems? .center.width-80[] .credit[S. van Steenkiste, F. Locatello, J. Schmidhuber, O. Bachem. Are disentangled representations helpful for abstract visual reasoning?, NeurIPS, 2019.] --- class: middle, center count: false # A rapid tour of downstream tasks based on VAEs --- ### Decoder-based downstream task ### .medium[Unsupervised speech enhancement] .tiny[(Bando et al., 2018; Leglaive et al., 2018)] .small-vspace[ ] .center.width-70[] .medium[ - Given a VAE pretrained on clean speech, the speech enhancement task consists of **inferring the VAE latent variables given a noisy speech recording**. - Approximate posterior inference using Markov chain Monte Carlo .small[(Bando et al., 2018; Leglaive et al., 2018)] or variational inference .small[(Leglaive et al., 2020)]. - Test-time iterative optimization: good in terms of adaptation but slow and performance limited compared to supervised methods on in-domain data. ] .credit[ Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, T. Kawahara (2018). Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-negative matrix factorization. IEEE ICASSP.
S. Leglaive, L. Girin, R. Horaud (2018). A variance modeling framework based on variational autoencoders for speech enhancement. IEEE MLSP.
S. Leglaive, X. Alameda-Pineda, L. Girin, R. Horaud (2020). A Recurrent Variational Autoencoder for Speech Enhancement. IEEE ICASSP. ] --- .grid[ .kol-1-3.center[
.small-nvspace[ ]
.center.small[Mixture] ] .kol-1-3.center[
.small-nvspace[ ]
.center.small[RVAE VEM (Leglaive et al., 2020)] ] .kol-1-3.center[
.small-nvspace[ ]
.center.small[SEMamba (Chao et al., 2024)] ] ] .small-nvspace[ ] .grid[ .kol-1-3.center[
.small-nvspace[ ]
.center.small[Mixture] ] .kol-1-3.center[
.small-nvspace[ ]
.center.small[RVAE VEM (Leglaive et al., 2020)] ] .kol-1-3.center[
.small-nvspace[ ]
.center.small[SEMamba (Chao et al., 2024)] ] ] .credit[ R. Chao, W.-H. Cheng, M. La Quatra, S. M. Siniscalchi, C.-H. H. Yang, S.-W. Fu, Y. Tsao (2024). An investigation of incorporating Mamba for speech enhancement. IEEE SLT.
https://huggingface.co/spaces/rc19477/Speech_Enhancement_Mamba ] --- ### Decoder-based downstream task ### .medium[Masked generative human mesh recovery (HMR)] .tiny[(Fiche et al., 2025)] .grid[ .kol-1-2[ .width-100[] ] .kol-1-2[ .tiny-vspace[ ]
] ] .medium[ - **HMR is an ill-posed problem**: There exists an infinity of 3D human meshes that can explain equally well the 2D observation. - We account for this ambiguity by designing the inference model as a **conditional masked generative model**. - We can obtain the **most likely output** by computing the argmax of the conditional distribution, or generate **multiple plausible outputs** by stochastic sampling. ] .credit[ G. Fiche, S. Leglaive, X. Alameda-Pineda, F. Moreno-Noguer (2025). MEGA: Masked generative autoencoder for human mesh recovery. IEEE/CVF CVPR. ] --- class: middle, center The variance of the predictions can be interpreted as a measure of **uncertainty**. .center.width-100[] --- ### Transformation downstream task ### .medium[Expressive audiovisual speech processing] .tiny[(Sadok et al., 2024)] .small-nvspace[ ] .grid[ .kol-1-2[ .underline.medium[**Observed** audiovisual speech data:] .medium[ -
Audio
: $\hspace{1cm} \myxa{} \in \mathbb{R}^{d\_a \times T}$ -
Visual
: $\hspace{.98cm} \myxv{} \in \mathbb{R}^{d\_v \times T}$ ] .center[
] ] .kol-1-2[ .underline.medium[**Latent** variables:] .medium[ - Static,
audiovisual
: $\hspace{1.2cm}\myw \in \mathbb{R}^{\ell\_w}$
.small[(e.g., speaker's identity and global emotional state)] - Dynamical,
audiovisual
: $\hspace{.65cm}\myzav{} \in \mathbb{R}^{\ell\_{av} \times T}$
.small[(e.g., lip movements, phonemic information (part of))] - Dynamical,
visual
: $\hspace{1.4cm}\myzv{} \in \mathbb{R}^{\ell\_v \times T}$
.small[(e.g., other facial movements and head pose)] - Dynamical,
audio
: $\hspace{1.45cm}\myza{} \in \mathbb{R}^{\ell\_a \times T}$
.small[(e.g., pitch variations, phonemic information (part of))] ] ] ] We proposed the multimodal dynamical VAE (MDVAE) model to decompose / generate audiovisual speech into / from these latent factors, without supervision. .credit[Video credits: K. Wang et al., MEAD: A large-scale audio-visual dataset for emotional talking-face generation, ECCV, 2020.
S. Sadok, S. Leglaive, L. Girin, X. Alameda-Pineda, R. Séguier (2024). A multimodal dynamical variational autoencoder for audiovisual speech representation learning. Neural Networks. ] --- class: middle, center .grid[ .kol-1-2[ We transfer $\myzav{}$ from the central sequence in red to the surrounding sequences. .center[
] Lip and jaw movements are transfered. ] .kol-1-2[ We transfer $\myzv{}$ from the central sequence in red to the surrounding sequences. .center[
] Head and eyelid movements are transfered. ] ] --- class: middle, center .small-nvspace[ ] We change $\myw{}$ of the central sequence to obtain the surrounding sequences. .center[
] --- ### Encoder-based downstream task ### .medium[Emotion recognition using MDVAE] .tiny[(Sadok et al., 2024)] .grid[ .kol-1-2[
] .kol-1-2[ .center.width-100[] ] ] The mean of the static audiovisual latent variable $\myw{}$ is used as the input of a **linear classifier** for emotion recognition. **Only this linear classifier with 680 parameters is trained using emotion labels.** - The audiovisual approach outperforms its two unimodal counterparts by ~50% of accuracy. - With less than 10% of the labeled data, the model reaches 90% of its maximal performance. --- class: middle .grid[ .kol-2-3[ - We compare MDVAE with a SOTA method based on audiovisual transformers .small[(Chumachenko et al., 2022)]. - It uses a **strong EfficientFace backbone** for image feature extraction, pre-trained on AffectNet, the largest dataset of in-the-wild facial images labeled in emotions.
Accuracy (%)
F1 score (%)
Audiovisual transformer
(Chumachenko et al., 2022)
79.2
78.2
Proposed approach
79.3
80.7
] .kol-1-3[ .center.width-100[] .center.small[Credits: (Chumachenko et al., 2022)] ] ] .alert[MDVAE obtains competitive results with only 680 parameters learned using emotion labels
↪ **confirms the effectiveness of the learned unsupervised representation** $\myw{}$.] .credit[ K. Chumachenko et al., "Self-attention fusion for audiovisual emotion recognition with incomplete data", IEEE ICPR, 2022. ] --- class: middle, center count: false # VAE extensions for structured data --- class: middle Underlying these downstream tasks are fundamental questions concerning the **design of the VAE generative and inference models themselves**, in terms of 1. **Probabilistic modeling** ↪ *What are the probabilistic dependencies between the observed data and the latent variables of interest?* 2. **Neural architectures** ↪ *How to implement these dependencies using neural networks?* Both the probabilistic model and the neural architectures should be well aligned with the **structure of the data** .small[(spatial 2D/3D, sequential, continuous, discrete, etc.)]. --- ## VAE extensions for sequential and multimodal data .center.width-85[] .credit[ D.P. Kingma and M. Welling (2014). Auto-encoding variational Bayes. ICLR.
D.J. Rezende, S. Mohamed, D. Wierstra, (2014).Stochastic backpropagation and approximate inference in deep generative models. ICML.
M. Wu and N. Goodman (2018). Multimodal generative models for scalable weakly-supervised learning. NeurIPS.
L. Girin, S. Leglaive, X. Bie, J. Diard, T. Hueber, X. Alameda-Pineda (2021). Dynamical variational autoencoders: A comprehensive review. Foundations and Trends in Machine Learning.
S. Sadok, S. Leglaive, L. Girin, X. Alameda-Pineda, R. Séguier (2024). A multimodal dynamical variational autoencoder for audiovisual speech representation learning. Neural Networks. ] --- class: middle ## Conclusion - Collecting labels for every scenario and tasks is intractable, we need **alternatives to supervised learning**. - **Deep generative** modeling with the **VAE** is a powerful **unsupervised** learning paradigm that can be applied to many different types of data, such as **multimodal** and **sequential data**. - We can learn **structured** and **interpretable representations** by **modeling probabilistic dependencies** between the observed data and the latent variables of interest. - One pretrained VAE can address **multiple downstream tasks**: generation, transformation, decoder-based or encoder-based. - Yes, we also process **audio signals** in the IMAGE department 😉 --- class: middle count: false .center[ # Thank you! ] --- class: middle ## VAE objective The VAE parameters are estimated by maximizing the **evidence lower bound** (ELBO) .small[(Neal and Hinton, 1999; Jordan et al. 1999)] defined by: $$\begin{aligned} \mathcal{L}(\phi, \theta) &= \underbrace{\mathbb{E}\_{q\_\phi(\myz{} | \myx{})} [\ln p\_\theta(\myx{} | \myz{})]}\_{\text{reconstruction accuracy}} - \underbrace{D\_{\text{KL}}(q\_\phi(\myz{} | \myx{}) \parallel p(\myz{}))}\_{\text{regularization}}. \end{aligned} $$ The ELBO can also be decomposed as: $$\begin{aligned} \mathcal{L}(\phi, \theta) &= \ln p\_\theta(\myx{}) - D\_{\text{KL}}(q\_\phi(\myz{} | \myx{}) \parallel p\_\theta(\myz{} | \myx{})). \end{aligned} $$ .alert[
.left-column[ .underline[Generative model parameters estimation] $$ \underset{\theta}{\max}\, \Big\\{ \mathcal{L}(\phi, \theta) \le \ln p\_\theta(\myx{}) \Big\\} $$ ] .right-column[ .underline[Inference model parameters estimation] $$ \underset{\phi}{\max}\, \mathcal{L}(\phi, \theta) \,\,\Leftrightarrow\,\, \underset{\phi}{\min}\, D\_{\text{KL}}(q\_\phi(\myz{} | \myx{}) \parallel p\_\theta(\myz{} | \myx{}))$$ ] .reset-column[ ]
] .credit[ R.M. Neal and G.E. Hinton (1999). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (Ed.), .italic[Learning in graphical models].
M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, L.K. Saul (1999). An introduction to variational methods for graphical models. Machine Learning. 1999. ]