class: middle, center $$ \global\def\myx#1{{\color{green}\mathbf{x}\_{#1}}} $$ $$ \global\def\myxa#1{{\color{green}\mathbf{x}\_{#1}^{(a)}}} $$ $$ \global\def\myza#1{{\color{green}\mathbf{z}\_{#1}^{(a)}}} $$ $$ \global\def\myxv#1{{\color{purple}\mathbf{x}\_{#1}^{(v)}}} $$ $$ \global\def\myzv#1{{\color{purple}\mathbf{z}\_{#1}^{(v)}}} $$ $$ \global\def\myzav#1{{\color{brown}\mathbf{z}\_{#1}^{(av)}}} $$ $$ \global\def\myzds#1{{\color{brown}\mathbf{z}\_{#1}^{(av)}}} $$ $$ \global\def\mywav{{\color{brown}\mathbf{w}^{(av)}}} $$ $$ \global\def\myw{{\color{brown}\mathbf{w}}} $$ $$ \global\def\mys#1{{\color{green}\mathbf{s}\_{#1}}} $$ $$ \global\def\myS#1{{\color{green}\mathbf{S}\_{#1}}} $$ $$ \global\def\myz#1{{\color{brown}\mathbf{z}\_{#1}}} $$ $$ \global\def\myztilde#1{{\color{brown}\tilde{\mathbf{z}}\_{#1}}} $$ $$ \global\def\myhnmf#1{{\color{brown}\mathbf{h}\_{#1}}} $$ $$ \global\def\myztilde#1{{\color{brown}\tilde{\mathbf{z}}\_{#1}}} $$ $$ \global\def\myu#1{\mathbf{u}\_{#1}} $$ $$ \global\def\mya#1{\mathbf{a}\_{#1}} $$ $$ \global\def\myv#1{\mathbf{v}\_{#1}} $$ $$ \global\def\mythetaz{\theta\_\myz{}} $$ $$ \global\def\mythetax{\theta\_\myx{}} $$ $$ \global\def\mythetas{\theta\_\mys{}} $$ $$ \global\def\mythetaa{\theta\_\mya{}} $$ $$ \global\def\bs#1{{\boldsymbol{#1}}} $$ $$ \global\def\diag{\text{diag}} $$ $$ \global\def\mbf{\mathbf} $$ $$ \global\def\myh#1{{\color{purple}\mbf{h}\_{#1}}} $$ $$ \global\def\myhfw#1{{\color{purple}\overrightarrow{\mbf{h}}\_{#1}}} $$ $$ \global\def\myhbw#1{{\color{purple}\overleftarrow{\mbf{h}}\_{#1}}} $$ $$ \global\def\myg#1{{\color{purple}\mbf{g}\_{#1}}} $$ $$ \global\def\mygfw#1{{\color{purple}\overrightarrow{\mbf{g}}\_{#1}}} $$ $$ \global\def\mygbw#1{{\color{purple}\overleftarrow{\mbf{g}}\_{#1}}} $$ $$ \global\def\neq{\mathrel{\char`≠}} $$ .vspace[ ] # Low-dimensional generative modeling of
multimodal sequential data .small-vspace[ ] ### Applications to audiovisual speech processing .small-vspace[ ] .center[Simon Leglaive] .small.center[CentraleSupélec, IETR (UMR CNRS 6164), France] .small-vspace[ ] .center.small[.underline[Collaborators]: Xavier Alameda-Pineda, Guénolé Fiche, Laurent Girin, Radu Horaud,
Francesc Moreno-Noguer, Samir Sadok, Renaud Séguier.] .center.width-7[] .grid[ .kol-1-6[ .left.width-120[] ] .kol-2-3[ .small.center[Universität Hamburg - September 11, 2025] ] .kol-1-6[ .right.width-65[]] ] --- class: middle count: false ## .bold[Outline]
1. Low-dimensional generative modeling...
- The variational autoencoder (VAE). - A rapid overview of 3 contributions around the VAE.
2. ... of multimodal sequential data
- The multimodal dynamical VAE (MDVAE). - Applications to audiovisual speech processing. .vspace[ ] .center.bold[Feel free to ask questions at any time!] --- class: middle ## Low-dimensional modeling of high-dimensional structured data .center.width-90[] **High-dimensional data** $\myx{} \in \mathbb{R}^d$ such as 3D human meshes, natural images, or speech signals exhibit some form of **structure**, preventing their dimensions from varying independently. - From a **geometric perspective**, this regularity suggests that the high-dimensional data actually live in a much lower-dimensional manifold. - From a **generative perspective**, it suggests that there exists a smaller dimensional **latent variable** $\myz{} \in \mathbb{R}^\ell$ that generated $\myx{} \in \mathbb{R}^d$, $\ell \ll d$. .credit[Picture credits:
wayhomestudio
on Freepik. ] --- class: middle ## Latent-variable generative modeling .small-vspace[ ] .center.width-90[] - Generative modeling consists of estimating the parameters $\theta$ so that $p\_\theta(\myx{}) \approx p^\star(\myx{})$ according to some measure of fit, for instance the Kullback-Leibler (KL) divergence. - When the model includes a deep neural network, we obtain a **deep generative model**. --- class: middle ## The variational autoencoder (VAE) .tiny[(Kingma and Welling, 2014; Rezende et. al., 2014)] .grid[ .kol-1-2[ - .underline[Prior]: $\qquad p(\myz{}) = \mathcal{N}(\myz{}; \mathbf{0}, \mathbf{I})$ - .underline[Generative model]: $ p\_\theta(\myx{} | \myz{}) = \mathcal{N}\left( \myx{}; \boldsymbol{\mu}\_\theta(\myz{}), \boldsymbol{\Sigma}\_\theta(\myz{}) \right) $ .small[where] $\small \boldsymbol{\mu}\_\theta(\myz{}), \boldsymbol{\Sigma}\_\theta(\myz{})$ .small[are the outputs of the **decoder**.] - .underline[Inference model]: $ q\_\phi(\myz{} | \myx{}) = \mathcal{N}\left( \myz{}; \boldsymbol{\mu}\_\phi(\myx{}), \boldsymbol{\Sigma}\_\phi(\myx{}) \right) \\\\$ .small[where] $\small \boldsymbol{\mu}\_\phi(\myx{}), \boldsymbol{\Sigma}\_\phi(\myx{})$ .small[are the outputs of the **encoder**.] ] .kol-1-2[ .vspace[ ] .center.width-90[] ] ] .small-nvspace[ ] .credit[ .vpsace[ ] D.P. Kingma and M. Welling (2014). Auto-encoding variational Bayes. ICLR.
D.J. Rezende, S. Mohamed, D. Wierstra, (2014).Stochastic backpropagation and approximate inference in deep generative models. ICML. ] --- class: middle ## VAE objective The VAE parameters are estimated by maximizing the **evidence lower bound** (ELBO) .small[(Neal and Hinton, 1999; Jordan et al. 1999)] defined by: $$\begin{aligned} \mathcal{L}(\phi, \theta) &= \underbrace{\mathbb{E}\_{q\_\phi(\myz{} | \myx{})} [\ln p\_\theta(\myx{} | \myz{})]}\_{\text{reconstruction accuracy}} - \underbrace{D\_{\text{KL}}(q\_\phi(\myz{} | \myx{}) \parallel p(\myz{}))}\_{\text{regularization}}. \end{aligned} $$ The ELBO can also be decomposed as: $$\begin{aligned} \mathcal{L}(\phi, \theta) &= \ln p\_\theta(\myx{}) - D\_{\text{KL}}(q\_\phi(\myz{} | \myx{}) \parallel p\_\theta(\myz{} | \myx{})). \end{aligned} $$ .alert[
.left-column[ .underline[Generative model parameters estimation] $$ \underset{\theta}{\max}\, \Big\\{ \mathcal{L}(\phi, \theta) \le \ln p\_\theta(\myx{}) \Big\\} $$ ] .right-column[ .underline[Inference model parameters estimation] $$ \underset{\phi}{\max}\, \mathcal{L}(\phi, \theta) \,\,\Leftrightarrow\,\, \underset{\phi}{\min}\, D\_{\text{KL}}(q\_\phi(\myz{} | \myx{}) \parallel p\_\theta(\myz{} | \myx{}))$$ ] .reset-column[ ]
] .credit[ R.M. Neal and G.E. Hinton (1999). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (Ed.), .italic[Learning in graphical models].
M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, L.K. Saul (1999). An introduction to variational methods for graphical models. Machine Learning. 1999. ] --- ## One model, many tasks We can use a pre-trained VAE for several different tasks. .center.width-80[] --- ## Generation task How to generate new data samples? .center.width-80[] --- ## Decoder-based downstream task How to infer the latent variables given auxiliary signals? .center.width-80[] --- ## Transformation task How to disentangle, interpret, and control the latent representation? .center.width-80[] .credit[I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, A. Lerchner (2018). Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230.] --- ## Encoder-based downstream task How to build sample-efficient, robust, and generalizable information extraction systems? .center.width-80[] .credit[S. van Steenkiste, F. Locatello, J. Schmidhuber, O. Bachem. Are disentangled representations helpful for abstract visual reasoning?, NeurIPS, 2019.] --- class: middle, center 3 examples of transformation and decoder-based downstream tasks. --- ## Transformation task example ### Source-filter VAE .small[(Sadok et al., 2023)] .center.width-70[] - We want to analyze the structure of the latent space of a VAE pretrained on short-term speech power spectra. .credit[ S. Sadok, S. Leglaive, L. Girin, X. Alameda-Pineda, R. Séguier (2023). Learning and controlling the source-filter representation of speech with a variational autoencoder. Speech Communication. ] --- count:false ## Transformation task example ### Source-filter VAE .small[(Sadok et al., 2023)] .center.width-95[] - We take the perspective of the **source filter-model of speech production** proposed by Fant .small[(1970)]. - The production of speech results from the interaction of a source with a time-varying filter. - Assumption: We can control the source and the filter independently from each other. - Important parameters are the **fundamental frequency** and the **formant frequencies**. .credit[ S. Sadok, S. Leglaive, L. Girin, X. Alameda-Pineda, R. Séguier (2023). Learning and controlling the source-filter representation of speech with a variational autoencoder. Speech Communication. G. Fant (1970). Acoustic theory of speech production. 2. Walter de Gruyter. ] --- ## Transformation task example ### Source-filter VAE .small[(Sadok et al., 2023)] .center.width-60[] - We want to study how the fundamental and formant frequencies are encoded within the VAE latent space. - We generate short synthetic speech signals with one single attribute varying at a time. --- ## Transformation task example ### Source-filter VAE .small[(Sadok et al., 2023)] .center.width-95[] - Each sequence of short-term speech power spectra is encoded by the pretrained VAE encoder. - We obtain a sequence of 16-dimensional latent vectors. - Intuition: Because only one attribute varies, we expect the latent vectors to live in a lower-dimensional manifold of the original latent space. --- ## Transformation task example ### Source-filter VAE .small[(Sadok et al., 2023)] .center.width-95[] - Indeed, we can learn an "invertible" mapping between the VAE latent space and a linear subspace of lower dimension while retaining most of the data variance. --- ## Transformation task example ### Source-filter VAE .small[(Sadok et al., 2023)] .center.width-95[] - The subspaces for different attributes are orthogonal between each other, as can be verified by simply computing the dot product between the vectors that span each subspace. - The latent representation learned by the VAE is consistent with the source-filter model. - It suggests that we could manipulate one attribute in its latent subspace without affecting the others, similarly as how humans produce speech according to the source-filter model. --- ## Transformation task example ### Source-filter VAE .small[(Sadok et al., 2023)] .center.width-95[] - Using only our few-seconds of artificial and automatically-labeled speech data, we learned to regress from the speech attribute to a specific position in the corresponding subspace. - This allows us to perform disentangled speech manipulations by moving into the latent subspaces, with a simple translation. --- class: middle ## Transformation task example ### Source-filter VAE .small[(Sadok et al., 2023)] .center.width-80[] .alert[We can generate / transform speech spectrograms given input trajectories of the fundamental and formant frequencies.] --- ## Decoder-based downstream task example ### Unsupervised speech enhancement .small[(Bando et al., 2018; Leglaive et al., 2018)] .center.width-80[] .medium[ - Given a VAE pretrained on clean speech, the speech enhancement task consists of **inferring the VAE latent variables given a noisy speech recording**. - Approximate posterior inference using Markov chain Monte Carlo .small[(Bando et al., 2018; Leglaive et al., 2018)] or variational inference .small[(Leglaive et al., 2020)]. - Test-time iterative optimization: good in terms of adaptation but slow and performance limited compared to supervised methods on in-domain data. ] .credit[ Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, T. Kawahara (2018). Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-negative matrix factorization. IEEE ICASSP.
S. Leglaive, L. Girin, R. Horaud (2018). A variance modeling framework based on variational autoencoders for speech enhancement. IEEE MLSP.
S. Leglaive, X. Alameda-Pineda, L. Girin, R. Horaud (2020). A Recurrent Variational Autoencoder for Speech Enhancement. IEEE ICASSP. ] --- .grid[ .kol-1-3.center[
.small-nvspace[ ]
.center.small[Mixture] ] .kol-1-3.center[
.small-nvspace[ ]
.center.small[RVAE VEM (Leglaive et al., 2020)] ] .kol-1-3.center[
.small-nvspace[ ]
.center.small[SEMamba (Chao et al., 2024)] ] ] .small-nvspace[ ] .grid[ .kol-1-3.center[
.small-nvspace[ ]
.center.small[Mixture] ] .kol-1-3.center[
.small-nvspace[ ]
.center.small[RVAE VEM (Leglaive et al., 2020)] ] .kol-1-3.center[
.small-nvspace[ ]
.center.small[SEMamba (Chao et al., 2024)] ] ] .credit[ R. Chao, W.-H. Cheng, M. La Quatra, S. M. Siniscalchi, C.-H. H. Yang, S.-W. Fu, Y. Tsao (2024). An investigation of incorporating Mamba for speech enhancement. IEEE SLT.
https://huggingface.co/spaces/rc19477/Speech_Enhancement_Mamba ] --- ## Decoder-based downstream task example ### Masked generative human mesh recovery (HMR) .small[(Fiche et al., 2025)] .grid[ .kol-1-2[ .width-100[] ] .kol-1-2[ .tiny-vspace[ ]
] ] .medium[ - **HMR is an ill-posed problem**: There exists an infinity of 3D human meshes that can explain equally well the 2D observation. - We account for this ambiguity by designing the inference model as a **conditional masked generative model**. - We can obtain the **most likely output** by computing the argmax of the conditional distribution, or generate **multiple plausible outputs** by stochastic sampling. ] .credit[ G. Fiche, S. Leglaive, X. Alameda-Pineda, F. Moreno-Noguer (2025). MEGA: Masked generative autoencoder for human mesh recovery. IEEE/CVF CVPR. ] --- class: middle, center The variance of the predictions can be interpreted as a measure of **uncertainty**. .center.width-100[] --- class: center, middle Underlying these downstream tasks are fundamental questions concerning the **design of the VAE generative and inference models themselves**, particularly in terms of **probabilistic modeling** and neural network **architectures** that should be well aligned with the **structure of the data**. In the rest of this presentation we will focus on **multimodal and sequential data**. --- class: middle .grid[ .kol-2-3[ ## Multimodality - Different modalities provide complementary information. - Multimodality is useful for the **perception** and the **analysis of humans** using **signal processing** and **machine learning**: - Visual modality is **robust** to acoustic noise .small[(useful for speech enhancement, ASR, etc.)]; - Audio modality is **robust** to occlusion and lighting conditions .small[(useful for person identification, etc.)]; - Some phenomena or tasks are **ambiguous** when perceived or addressed through a single modality
.small[(e.g., emotional expressions, visemes, speech extraction, etc.)] .alert[Multimodality provides robustness and helps disambiguation.] ] .kol-1-3[.center.width-60[] ] ] .reset-column[ ] .credit.pull-right[Picture credits:
wayhomestudio
on Freepik.] --- ## VAEs for multimodal and sequential data .center.width-80[] - The VAE has been extended in many ways, including for processing multimodal **or** sequential data. - But a model for both multimodal and sequential data was lacking. .credit[ .vspace[ ] L. Girin, S. Leglaive, X. Bie, J. Diard, T. Hueber, X. Alameda-Pineda (2021). Dynamical variational autoencoders: A comprehensive review. Foundations and Trends in Machine Learning.
M. Suzuki, K. Nakayama, Y. Matsuo (2017). Joint multimodal learning with deep denerative models. ICLR Workshop.
M. Wu and N. Goodman (2018). Multimodal generative models for scalable weakly-supervised learning. NeurIPS.
W.-N. Hsu and J. R. Glass (2018). Disentangling by partitioning: A representation learning framework for multimodal sensory data. arXiv preprint arXiv:1805.11264.
Y. Shi, N. Siddharth, B. Paige, P. Torr (2019). Variational mixture-of-experts autoencoders for multi-modal deep denerative models. NeurIPS.
T. Sutter, I. Daunhawer, J. E. Vogt (2021). Generalized Multimodal ELBO. ICLR. ] --- class: middle, center In .small[(Sadok et al., 2024)] we proposed MDVAE, a multimodal **and** dynamical VAE applied to audiovisual speech representation learning. .credit[ S. Sadok, S. Leglaive, L. Girin, X. Alameda-Pineda, R. Séguier (2024). A multimodal dynamical variational autoencoder for audiovisual speech representation learning. Neural Networks. ] --- class: middle count: false .center[ # Multimodal dynamical VAE (MDVAE) ]
- .bold[Generative model] - Inference model - Training --- ## Expressive audiovisual speech generative process .grid[ .kol-1-2[ .underline[**Observed** audiovisual speech data:] -
Audio
: $\hspace{1cm} \myxa{} \in \mathbb{R}^{d\_a \times T}$ -
Visual
: $\hspace{.98cm} \myxv{} \in \mathbb{R}^{d\_v \times T}$ .center[
] ] .kol-1-2[ .underline[**Latent** variables:] - Static,
audiovisual
: $\hspace{1.2cm}\myw \in \mathbb{R}^{\ell\_w}$
.small[(e.g., speaker's identity and global emotional state)] - Dynamical,
audiovisual
: $\hspace{.65cm}\myzav{} \in \mathbb{R}^{\ell\_{av} \times T}$
.small[(e.g., lip movements, phonemic information (part of))] - Dynamical,
visual
: $\hspace{1.4cm}\myzv{} \in \mathbb{R}^{\ell\_v \times T}$
.small[(e.g., other facial movements and head pose)] - Dynamical,
audio
: $\hspace{1.45cm}\myza{} \in \mathbb{R}^{\ell\_a \times T}$
.small[(e.g., pitch variations, phonemic information (part of))] ] ] .alert[We want to decompose / generate audiovisual speech into / from these latent factors, without supervision.] .credit[Video credits: K. Wang et al., MEAD: A large-scale audio-visual dataset for emotional talking-face generation, ECCV, 2020.] --- class: middle ## MDVAE generative model .center.width-80[] - MDVAE generative model is defined as a **Bayesian network** .small[(Peal, 1985)]. - It corresponds to a **factorization of the joint distribution** $ p\_\theta\left(\myxa{}, \myxv{}, \myzav{},\myza{},\myzv{}, \myw{}\right).$ - By **defining "appropriate" probabilistic dependencies** between the variables we will learn the desired disentangled representation of audiovisual speech, **without supervision**. - Let's go over each term involved in the factorization. .credit[J. Peal (1985). Bayesian networks: A model of self-activated memory for evidential reasoning. In Proceedings of the Annual Meeting of the Cognitive Science Society (Vol. 7).] --- class: middle count:false .center.width-100[] $$ \hspace{-.5cm} \small \boxed{p\_\theta\left(\myxa{} \mid \myzav{},\myza{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxa{t} \mid \myzav{t},\myza{t},\myw{}\right)}, \hspace{.3cm} p\_\theta\left(\myxv{} \mid \myzav{},\myzv{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxv{t} \mid \myzav{t},\myzv{t},\myw{}\right) $$ $$ \hspace{-.5cm} \small p\_\theta\left(\myza{}\right) = \prod\_{t=1}^T p\_\theta\left(\myza{t} \mid \myza{1:t-1} \right), \hspace{.3cm} p\_\theta\left(\myzav{}\right) = \prod\_{t=1}^T p\_\theta\left(\myzav{t} \mid \myzav{1:t-1} \right), \hspace{.3cm} p\_\theta\left(\myzv{}\right) = \prod\_{t=1}^T p\_\theta\left(\myzv{t} \mid \myzv{1:t-1} \right) $$ --- class: middle count: false .center.width-100[] $$ \hspace{-.5cm} \small p\_\theta\left(\myxa{} \mid \myzav{},\myza{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxa{t} \mid \myzav{t},\myza{t},\myw{}\right), \hspace{.3cm} \boxed{p\_\theta\left(\myxv{} \mid \myzav{},\myzv{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxv{t} \mid \myzav{t},\myzv{t},\myw{}\right)} $$ $$ \hspace{-.5cm} \small p\_\theta\left(\myza{}\right) = \prod\_{t=1}^T p\_\theta\left(\myza{t} \mid \myza{1:t-1} \right), \hspace{.3cm} p\_\theta\left(\myzav{}\right) = \prod\_{t=1}^T p\_\theta\left(\myzav{t} \mid \myzav{1:t-1} \right), \hspace{.3cm} p\_\theta\left(\myzv{}\right) = \prod\_{t=1}^T p\_\theta\left(\myzv{t} \mid \myzv{1:t-1} \right) $$ --- class: middle count: false .center.width-100[] $$ \hspace{-.5cm} \small p\_\theta\left(\myxa{} \mid \myzav{},\myza{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxa{t} \mid \myzav{t},\myza{t},\myw{}\right), \hspace{.3cm} p\_\theta\left(\myxv{} \mid \myzav{},\myzv{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxv{t} \mid \myzav{t},\myzv{t},\myw{}\right) $$ $$ \hspace{-.5cm} \small \boxed{p\_\theta\left(\myza{}\right) = \prod\_{t=1}^T p\_\theta\left(\myza{t} \mid \myza{1:t-1} \right)}, \hspace{.3cm} \boxed{p\_\theta\left(\myzav{}\right) = \prod\_{t=1}^T p\_\theta\left(\myzav{t} \mid \myzav{1:t-1} \right)}, \hspace{.3cm} \boxed{p\_\theta\left(\myzv{}\right) = \prod\_{t=1}^T p\_\theta\left(\myzv{t} \mid \myzv{1:t-1} \right)} $$ --- exclude: true class: middle Important remark: - The variables $\myxa{t}$ and $\myxv{t}$ are considered conditionally independent over time, but they are not marginally independent thanks to the autoregressive prior model. - Said differently, the autoregressive model over the latent variables induces the temporal dynamics over the observed audio and visual speech variables. .vspace[ ] $$ \hspace{-.5cm} \small {p\_\theta\left(\myxa{} \mid \myzav{},\myza{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxa{t} \mid \myzav{t},\myza{t},\myw{}\right)}, \hspace{.3cm} p\_\theta\left(\myxv{} \mid \myzav{},\myzv{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxv{t} \mid \myzav{t},\myzv{t},\myw{}\right) $$ $$ \hspace{-.5cm} \small p\_\theta\left(\myza{}\right) = \prod\_{t=1}^T p\_\theta\left(\myza{t} \mid \myza{1:t-1} \right), \hspace{.3cm} p\_\theta\left(\myzav{}\right) = \prod\_{t=1}^T p\_\theta\left(\myzav{t} \mid \myzav{1:t-1} \right), \hspace{.3cm} p\_\theta\left(\myzv{}\right) = \prod\_{t=1}^T p\_\theta\left(\myzv{t} \mid \myzv{1:t-1} \right) $$ --- class: middle count: false .small-nvspace[ ] .center.width-100[] $$ \hspace{-.5cm} \small {p\_\theta\left(\myxa{} \mid \myzav{},\myza{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxa{t} \mid \myzav{t},\myza{t},\myw{}\right)}, \hspace{.3cm} p\_\theta\left(\myxv{} \mid \myzav{},\myzv{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxv{t} \mid \myzav{t},\myzv{t},\myw{}\right) $$ $$ \hspace{-.5cm} \small p\_\theta\left(\myza{}\right) = \prod\_{t=1}^T p\_\theta\left(\myza{t} \mid \myza{1:t-1} \right), \hspace{.3cm} p\_\theta\left(\myzav{}\right) = \prod\_{t=1}^T p\_\theta\left(\myzav{t} \mid \myzav{1:t-1} \right), \hspace{.3cm} p\_\theta\left(\myzv{}\right) = \prod\_{t=1}^T p\_\theta\left(\myzv{t} \mid \myzv{1:t-1} \right) $$ .medium[ All distributions are Gaussians parametrized by neural networks (MLPs or RNNs) and $ \small p(\myw{}) = \mathcal{N}(\myw{}; \mathbf{0}, \mathbf{I})$. ] --- class: middle count: false .center[ # Multimodal dynamical VAE (MDVAE) ]
- Generative model - .bold[Inference model] - Training --- class: middle ## MDVAE inference model - The inference model is an approximation of the intractable posterior. It defines **how we go from the observed audiovisual speech data to the latent variables**. - It can also be represented as a probabilistic graphical model, which corresponds to a factorization of the approximate posterior $$q\_\phi\left(\myzav{},\myza{},\myzv{}, \myw{} \mid \myxa{}, \myxv{}\right).$$ - This factorization is **not chosen but induced** by the generative model and conditional independence criteria following the principle of D-separation .small[(Geiger et al., 1990)]. .credit[ D. Geiger et al., Identifying independence in Bayesian networks, Networks, 1990. ] --- class: middle count: false .center.width-100[] The inference model $\small q\_\phi\left(\myzav{},\myza{},\myzv{}, \myw{} \mid \myxa{}, \myxv{}\right)$ decomposes as the product of four terms: $$\small \hspace{-.5cm} \boxed{ q\_\phi\left(\myw{} \mid \myxa{}, \myxv{}\right)} \times q\_\phi\left(\myzav{}\mid \myxa{}, \myxv{}, \myw{} \right) \times q\_\phi\left(\myza{} \mid \myxa{}, \myzav{}, \myw{}\right) \times q\_\phi\left(\myzv{} \mid \myxv{}, \myzav{}, \myw{}\right).$$ --- class: middle count: false .center.width-100[] The inference model $\small q\_\phi\left(\myzav{},\myza{},\myzv{}, \myw{} \mid \myxa{}, \myxv{}\right)$ decomposes as the product of four terms: $$\small \hspace{-.5cm} q\_\phi\left(\myw{} \mid \myxa{}, \myxv{}\right) \times \boxed{ q\_\phi\left(\myzav{}\mid \myxa{}, \myxv{}, \myw{} \right) } \times q\_\phi\left(\myza{} \mid \myxa{}, \myzav{}, \myw{}\right) \times q\_\phi\left(\myzv{} \mid \myxv{}, \myzav{}, \myw{}\right).$$ --- class: middle count: false .center.width-100[] The inference model $\small q\_\phi\left(\myzav{},\myza{},\myzv{}, \myw{} \mid \myxa{}, \myxv{}\right)$ decomposes as the product of four terms: $$\small \hspace{-.5cm} q\_\phi\left(\myw{} \mid \myxa{}, \myxv{}\right) \times q\_\phi\left(\myzav{}\mid \myxa{}, \myxv{}, \myw{} \right) \times \boxed{ q\_\phi\left(\myza{} \mid \myxa{}, \myzav{}, \myw{}\right) } \times \boxed{ q\_\phi\left(\myzv{} \mid \myxv{}, \myzav{}, \myw{}\right) }. $$ --- class: middle count: false .center.width-100[] The inference model $\small q\_\phi\left(\myzav{},\myza{},\myzv{}, \myw{} \mid \myxa{}, \myxv{}\right)$ decomposes as the product of four terms: $$\small \hspace{-.5cm} q\_\phi\left(\myw{} \mid \myxa{}, \myxv{}\right) \times q\_\phi\left(\myzav{}\mid \myxa{}, \myxv{}, \myw{} \right) \times { q\_\phi\left(\myza{} \mid \myxa{}, \myzav{}, \myw{}\right) } \times { q\_\phi\left(\myzv{} \mid \myxv{}, \myzav{}, \myw{}\right) }. $$ - Temporal dependencies are not detailed. - Gaussian distributions parametrized by neural networks taking conditioning variables as input. --- class: middle count: false .center[ # Multimodal dynamical VAE (MDVAE) ]
- Generative model - Inference model - .bold[Training] --- class: middle ## MDVAE training .grid[ .kol-3-5[ - The MDVAE model is trained on the **MEAD** dataset
.small[(K. Wang et al., 2020)]: - 30 hours of expressive audiovisual speech; - 48 speakers; - 8 emotions with 3 levels of intensity. - The objective function is the ELBO, as in standard VAEs. - **2-stage training** approach using vector quantized VAEs (VQ-VAEs) .small[(van den Oord et al., 2017)]. ] .kol-2-5[ .center.width-100[] .center[.small[Image credits: (K. Wang et al., 2020)]] ] ] .credit[ K. Wang et al., MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation, ECCV, 2020
A. van den Oord et al., Neural discrete representation learning, NeurIPS 2017.] --- class: middle ## 1st training stage .center[We learn a VQ-VAE model independently for **each modality** and without **temporal modeling**.] .vspace[ ] .center.width-100[] --- class: middle ## 2nd training stage .center[We then learn the MDVAE model **"inside" the frozen VQ-VAEs**, using as input the compressed representations of the VQ-VAEs before quantization.] .vspace[ ] .center.width-100[] .center[The use of VQ-VAEs improves the reconstruction quality and speeds up the training.] --- class: middle count: false .center[ # Experiments on audiovisual speech ]
- .bold[Audiovisual speech manipulations] - Audiovisual facial image denoising - Audiovisual speech emotion recognition --- class: middle ## Qualitative analysis of the learned representations - We want to study what characteristics of the audiovisual speech data are encoded in the latent variables $\myw{}$, $\myzav{}$, $\myzv{}$, and $\myza{}$. - We will reconstruct an audiovisual speech sequence using some of the latent variables from another sequence. .small-vspace[ ] .center.width-90[] --- class: middle, center .grid[ .kol-1-2[ We transfer $\myzav{}$ from the central sequence in red to the surrounding sequences. .center[
] Lip and jaw movements are transfered. ] .kol-1-2[ We transfer $\myzv{}$ from the central sequence in red to the surrounding sequences. .center[
] Head and eyelid movements are transfered. ] ] --- class: middle, center .small-nvspace[ ] We change $\myw{}$ of the central sequence to obtain the surrounding sequences. .center[
] --- class: middle .center[Interpolation of the static audiovisual latent variable $\myw{}$] .grid[ .kol-1-2[ .center[
] .caption[Same emotion, different identities.] ] .kol-1-2[ .center[
].caption[Same identity, different emotions.] ] ] $$ \small \hspace{-.5cm} p\_\theta\left(\myxv{} \mid \myzav{},\myzv{},\myw{}\right) = \prod\_{t=1}^T p\_\theta\left(\myxv{t} \mid \myzav{t},\myzv{t},\boxed{\tilde{\myw{}}\_t}\right), \hspace{.3cm} \tilde{\myw{}}\_t = \alpha\_t \myw{} + (1- \alpha\_t) \myw{}', \hspace{.3cm} \alpha\_t = (T-t)/(T-1). $$ --- class: middle - The qualitative analysis confirmed that: - The static audiovisual latent variable $\myw{}$ encodes the speaker's identity and global emotional state. - The dynamical audiovisual latent variable $\myzav{}$ encodes the speaker's lip and jaw movements. - The dynamical visual latent variable $\myzv{}$ encodes the remaining facial movements such as the eyes and head movements. - These conclusions are confirmed quantitatively by measuring the impact of swapping latent variables on the action units (cf. paper). --- class: middle count: false .center[ # Experiments on audiovisual speech ]
- Audiovisual speech manipulations - .bold[Audiovisual facial image denoising] - Audiovisual speech emotion recognition --- class: middle - We artificially **corrupt the visual modality** by adding random Gaussian noise on localized regions of the 6 central frames of a 10 frame-long sequence. The **audio modality is unaltered**. .center.width-70[] - The task is to reconstruct the visual modality. - We compare three different VAE-based models: - .bold[VQ-VAE] .small[(van den Oord et al., 2017)]: unimodal, without temporal modeling. - .bold[DSAE] .small[(Li and Mandt, 2018)]: unimodal, with temporal modeling, trained in two stages. - .bold[MDVAE]: multimodal, with temporal modeling. .alert[Denoising is done by simply encoding and decoding the corrupted (audio)visual speech sequence.] .credit[Y. Li and S. Mandt, "Disentangled sequential autoencoder", ICML 2018.] --- class: middle, black-slide .center.width-100[] --- class: middle .grid[ .kol-3-5[ .center.width-95[] ] .kol-2-5[
- Results are obtained by averaging over 200 test sequences. - Metrics are computed on the corrupted region. .small-vspace[ ] .alert-90[ The performance gap between MDVAE and the unimodal baselines is larger for the corruption of the mouth region. This is because MDVAE exploits the audio modality.] ] ] --- class: middle count: false .center[ # Experiments on audiovisual speech ]
- Audiovisual speech manipulations - Audiovisual facial image denoising - .bold[Audiovisual speech emotion recognition] --- class: middle - The qualitative analysis of the latent representations learned by MDVAE suggests that the static audiovisual latent variable $\myw{}$ encodes the speaker's emotion. - We use the **mean vector of the Gaussian inference model** $q\_\phi\left(\myw{} \mid \myxa{}, \myxv{}\right)$ as the input of a **multinomial logistic regression model** trained for emotion classification on the MEAD dataset (8 classes). - The mean vector is simply obtained by a forward through the encoder network corresponding to $q\_\phi\left(\myw{} \mid \myxa{}, \myxv{}\right)$. - We compare the performance of MDVAE with its **unimodal counterparts**: - .bold[A-DSAE] relies on the audio-only inference model $q\_\phi\left(\myw{} \mid \myxa{}\right)$; - .bold[V-DSAE] relies on the visual-only inference model $q\_\phi\left(\myw{} \mid \myxv{}\right)$. --- class: middle .center.width-80[] - Using the exact same experimental protocol, MDVAE outperforms its two unimodal counterparts by about 50% of accuracy. - With less than 10% of the labeled data, MDVAE reaches 90% of its maximal performance. --- class: middle .grid[ .kol-2-3[ - We also evaluate the audiovisual emotion classification performance on **RAVDESS** .small[(Livingstone and Russo, 2018)]. - MDVAE is **fine-tuned** in an **unsupervised** manner on RAVDESS, **only the logistic regression model is supervised** (680 parameters). - MDVAE is compared with a SOTA method based on an audiovisual transformer model .small[(Chumachenko et al., 2022)]. It uses a **strong EfficientFace backbone** for image feature extraction, pre-trained on AffectNet, the largest dataset of in-the-wild facial images labeled in emotions. ] .kol-1-3[ .center.width-100[] .center.small[Credits: (Chumachenko et al., 2022)] ] ] .credit[ S.R Livingstone and F.A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English", PloS one, 2018.
K. Chumachenko et al., "Self-attention fusion for audiovisual emotion recognition with incomplete data", IEEE ICPR, 2022. ] --- class: middle
Accuracy (%)
F1 score (%)
Audiovisual transformer
(Chumachenko et al., 2022)
79.2
78.2
MDVAE + logisitic regression
79.3
80.7
.vspace[ ] - MDVAE obtains competitive results even though it only contains 680 parameters learned using emotion labels ↪ confirms the effectiveness of the static audiovisual representation $\myw{}$. - The supervised AV transformer could not be easily trained on RAVDESS and MEAD due to label mismatch, which further shows the interest of unsupervised representation learning. --- class: middle # Conclusion - We proposed the MDVAE model to learn **structured representations of multimodal and dynamical data**. - The key to learn a meaningful representation was to structure the probabilistic generative model using different latent variables to encode different characteristics of the data. - Experiments on audiovisual speech have shown that **the model effectively disentangles static and dynamical information** as well as **audiovisual and modality-specific information**. - The **same model** was used for different generation and recognition tasks (audiovisual speech manipulations, audiovisual facial image denoising, audiovisual speech emotion recognition). --- class: middle count: false .center[ # Thank you! ] .big-vspace[ ] .footnote[ Sadok, S., Leglaive, S., Girin, L., Alameda-Pineda, X., & Séguier, R. (2024). A multimodal dynamical variational autoencoder for
audiovisual speech representation learning. Neural Networks, 172, 106120. - .bold[arXiv]: https://arxiv.org/pdf/2305.03582.pdf - .bold[Code]: https://github.com/samsad35/code-mdvae - .bold[Demo]: https://samsad35.github.io/site-mdvae/ ]