Simon Leglaive
CentraleSupélec, IETR (UMR CNRS 6164), France
Seminar @ LISTEN joint laboratory
Palaiseau (France) - January 12, 2023
Samir Sadok1
Laurent Girin2
Xavier Alameda-Pineda3
Renaud Séguier1
1 CentraleSupélec, IETR (UMR CNRS 6164), France
2 Univ. Grenoble Alpes, CNRS, Grenoble-INP, GIPSA-lab, France
3 Inria, Univ. Grenoble Alpes, CNRS, LJK, France
🚧 Ongoing work 🚧
The majority of successful applications of machine/deep learning rely on supervised learning.
The Cityscapes dataset
(Cordts et al., 2016)
More than 300 days of annotation! 😱
It is intractable to collect labels for every scenario and task.
M. Cordts et al., The Cityscapes dataset for semantic urban scene understanding, IEEE CVPR 2016
Zhao et al., ICNet for real-time semantic segmentation on high-resolution images, ECCV 2018
We need unsupervised methods that can learn to unveil the underlying structure of the data without or with few ground-truth labels.
GENESIS (Engelcke et al., 2020), a generative model of 3D scenes capable of both decomposing and generating scenes by capturing relationships between scene components. Image credits: (Engelcke et al., 2020).
Deep latent variable generative models have emerged as promising approaches.
M. Engelcke et al., GENESIS: Generative scene inference and sampling with object-centric latent representations, ICLR 2020.
High-dimensional data x∈Rd such as natural images or speech signals exhibit some form of regularity, preventing their dimensions from varying independently.
From a generative perspective, this regularity suggests that there exists a smaller dimensional latent variable z∈Rℓ that generated x∈Rd, ℓ≪d.
Picture credits: wayhomestudio on Freepik.
Generative modeling consists in estimating the parameters θ so that pθ(x)≈p⋆(x) according to some measure of fit, for instance the Kullback-Leibler (KL) divergence.
When the model includes a deep neural network, we obtain a deep generative model.
Prior
p(z)=N(z;0,I)
Generative model
pθ(x∣z)=N(x;μθ(z),Σθ(z))
Inference model
qϕ(z∣x)=N(z;μϕ(x),Σϕ(x))
The inference model approximates the intractable exact posterior distribution
pθ(z∣x)=∫pθ(x∣z)p(z)dzpθ(x∣z)p(z)
Σϕ(x)=diag{vϕ(x)}Σθ(z)=diag{vθ(z)}
D.P. Kingma and M. Welling, Auto-encoding variational Bayes, ICLR 2014.
D.J. Rezende et al., Stochastic backpropagation and approximate inference in deep generative models, ICML 2014.
The VAE parameters are estimated by maximizing the evidence lower bound (ELBO) (Neal and Hinton, 1999; Jordan et al. 1999) defined by: L(ϕ,θ)=reconstruction accuracyEqϕ(z∣x)[lnpθ(x∣z)]−regularizationDKL(qϕ(z∣x)∥p(z)).
The ELBO can also be decomposed as: L(ϕ,θ)=lnpθ(x)−DKL(qϕ(z∣x)∥pθ(z∣x)).
Generative model parameters estimation
θmax{L(ϕ,θ)≤lnpθ(x)}
Inference model parameters estimation
ϕmaxL(ϕ,θ)⇔ϕminDKL(qϕ(z∣x)∥pθ(z∣x))
R.M. Neal and G.E. Hinton, A view of the EM algorithm that justifies incremental, sparse, and other variants, in M. I. Jordan (Ed.), Learning in graphical models, 1999.
M.I. Jordan et al., An introduction to variational methods for graphical models, Machine Learning, 1999.
A trained VAE can be used for generation, transformation, and downstream tasks.
Ideally, the learned representation should be disentangled (Higgins et al., 2018), i.e., somehow easy to relate to independent and interpretable high-level characteristics of the data.
Supervised learning from disentangled representations has been found to be more sample-efficient, more robust, and better in terms of generalization (van Steenkiste et al., 2019).
Over the past few years, the VAE has been extended in many ways, including for processing dynamical or multimodal data.
In this talk, we will present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audiovisual speech representation learning.
Speaker's identity and global emotional state
↪ static, shared (AV)
Lip movements, phonemic information (part of)
↪ dynamical, shared (AV)
Other facial movements and head pose
↪ dynamical, modality-specific (V)
Pitch variations, phonemic information (part of)
↪ dynamical, modality-specific (A)
We seek to learn a multimodal dynamical VAE that disentangles these AV speech latent factors: dynamical and modality-specific, dynamical and audiovisual, static and audiovisual.
Video credits: K. Wang et al., MEAD: A large-scale audio-visual dataset for emotional talking-face generation, ECCV, 2020.
x(a)∈Rda×T
Observed dynamical audio data
x(v)∈Rdv×T
Observed dynamical visual data
w∈Rℓw
Latent static audiovisual data
z(av)∈Rℓav×T
Latent dynamical audiovisual data
z(a)∈Rℓa×T
Latent dynamical audio
data
z(v)∈Rℓv×T
Latent dynamical visual data
Defining the generative model amounts to defining the joint distribution of all variables: pθ(x(a),x(v),z(av),z(a),z(v),w).
By structuring the dependencies between these variables we hope to learn the desired disentangled representation of audiovisual speech in an unsupervised manner.
The temporal model in MDVAE is largely inspired by the disentangled sequential autoencoder (DSAE) of Li and Mandt (2018).
MDVAE can be seen as a multimodal extension of DSAE.
Y. Li and S. Mandt, "Disentangled sequential autoencoder", ICML 2018.
The global probabilistic graphical model of MDVAE is defined by the following Bayesian network.
pθ(x(a),x(v),z(av),z(a),z(v),w)=pθ(x(a)∣z(av),z(a),w)pθ(x(v)∣z(av),z(v),w)pθ(z(av))pθ(z(a))pθ(z(v))pθ(w)
The global probabilistic graphical model of MDVAE is defined by the following Bayesian network.
pθ(x(a),x(v),z(av),z(a),z(v),w)=pθ(x(a)∣z(av),z(a),w)pθ(x(v)∣z(av),z(v),w)pθ(z(av))pθ(z(a))pθ(z(v))pθ(w)
The global probabilistic graphical model of MDVAE is defined by the following Bayesian network.
pθ(x(a),x(v),z(av),z(a),z(v),w)=pθ(x(a)∣z(av),z(a),w)pθ(x(v)∣z(av),z(v),w)pθ(z(av))pθ(z(a))pθ(z(v))pθ(w)
The global probabilistic graphical model of MDVAE is defined by the following Bayesian network.
pθ(x(a),x(v),z(av),z(a),z(v),w)=pθ(x(a)∣z(av),z(a),w)pθ(x(v)∣z(av),z(v),w)pθ(z(av))pθ(z(a))pθ(z(v))pθ(w)
The global probabilistic graphical model of MDVAE is defined by the following Bayesian network.
pθ(x(a),x(v),z(av),z(a),z(v),w)=pθ(x(a)∣z(av),z(a),w)pθ(x(v)∣z(av),z(v),w)pθ(z(av))pθ(z(a))pθ(z(v))pθ(w)
To complete the generative model, we also need to define the temporal dependencies for the sequential variables.
pθ(x(a)∣z(av),z(a),w)=t=1∏Tpθ(xt(a)∣zt(av),zt(a),w),pθ(x(v)∣z(av),z(v),w)=t=1∏Tpθ(xt(v)∣zt(av),zt(v),w)
pθ(z(a))=t=1∏Tpθ(zt(a)∣z1:t−1(a)),pθ(z(av))=t=1∏Tpθ(zt(av)∣z1:t−1(av)),pθ(z(v))=t=1∏Tpθ(zt(v)∣z1:t−1(v))
pθ(x(a)∣z(av),z(a),w)=t=1∏Tpθ(xt(a)∣zt(av),zt(a),w),pθ(x(v)∣z(av),z(v),w)=t=1∏Tpθ(xt(v)∣zt(av),zt(v),w)
pθ(z(a))=t=1∏Tpθ(zt(a)∣z1:t−1(a)),pθ(z(av))=t=1∏Tpθ(zt(av)∣z1:t−1(av)),pθ(z(v))=t=1∏Tpθ(zt(v)∣z1:t−1(v))
pθ(x(a)∣z(av),z(a),w)=t=1∏Tpθ(xt(a)∣zt(av),zt(a),w),pθ(x(v)∣z(av),z(v),w)=t=1∏Tpθ(xt(v)∣zt(av),zt(v),w)
pθ(z(a))=t=1∏Tpθ(zt(a)∣z1:t−1(a)),pθ(z(av))=t=1∏Tpθ(zt(av)∣z1:t−1(av)),pθ(z(v))=t=1∏Tpθ(zt(v)∣z1:t−1(v))
pθ(x(a)∣z(av),z(a),w)=t=1∏Tpθ(xt(a)∣zt(av),zt(a),w),pθ(x(v)∣z(av),z(v),w)=t=1∏Tpθ(xt(v)∣zt(av),zt(v),w)
pθ(z(a))=t=1∏Tpθ(zt(a)∣z1:t−1(a)),pθ(z(av))=t=1∏Tpθ(zt(av)∣z1:t−1(av)),pθ(z(v))=t=1∏Tpθ(zt(v)∣z1:t−1(v))
pθ(x(a)∣z(av),z(a),w)=t=1∏Tpθ(xt(a)∣zt(av),zt(a),w),pθ(x(v)∣z(av),z(v),w)=t=1∏Tpθ(xt(v)∣zt(av),zt(v),w)
pθ(z(a))=t=1∏Tpθ(zt(a)∣z1:t−1(a)),pθ(z(av))=t=1∏Tpθ(zt(av)∣z1:t−1(av)),pθ(z(v))=t=1∏Tpθ(zt(v)∣z1:t−1(v))
These distributions are Gaussians parametrized by neural networks and p(w)=N(w;0,I).
As in the standard VAE, we need to define an inference model that approximates the posterior:
qϕ(z(av),z(a),z(v),w∣x(a),x(v))≈pθ(z(av),z(a),z(v),w∣x(a),x(v)).
The exact posterior is intractable, but using the chain rule, the Bayesian network of MDVAE and the D-separation principle (Geiger et al., 1990; Bishop, 2006), we can analyze the exact posterior dependencies, i.e. how the observed and latent variables depend on each other given the observations.
See (Girin et al., 2021) for an extensive discussion of D-separation in the context of DVAEs.
L. Girin et al., Dynamical variational autoencoders: A comprehensive review, Foundations and Trends in Machine Learning, 2021.
D. Geiger et al., Identifying independence in Bayesian networks, Networks, 1990.
C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
The inference model qϕ(z(av),z(a),z(v),w∣x(a),x(v)) decomposes as the product of four terms:
qϕ(w∣x(a),x(v))×qϕ(z(av)∣x(a),x(v),w)×qϕ(z(a)∣x(a),z(av),w)×qϕ(z(v)∣x(v),z(av),w)
The inference model qϕ(z(av),z(a),z(v),w∣x(a),x(v)) decomposes as the product of four terms:
qϕ(w∣x(a),x(v))×qϕ(z(av)∣x(a),x(v),w)×qϕ(z(a)∣x(a),z(av),w)×qϕ(z(v)∣x(v),z(av),w)
The inference model qϕ(z(av),z(a),z(v),w∣x(a),x(v)) decomposes as the product of four terms:
qϕ(w∣x(a),x(v))×qϕ(z(av)∣x(a),x(v),w)×qϕ(z(a)∣x(a),z(av),w)×qϕ(z(v)∣x(v),z(av),w)
The inference model qϕ(z(av),z(a),z(v),w∣x(a),x(v)) decomposes as the product of four terms:
qϕ(w∣x(a),x(v))×qϕ(z(av)∣x(a),x(v),w)×qϕ(z(a)∣x(a),z(av),w)×qϕ(z(v)∣x(v),z(av),w)
The inference model qϕ(z(av),z(a),z(v),w∣x(a),x(v)) decomposes as the product of four terms:
qϕ(w∣x(a),x(v))×qϕ(z(av)∣x(a),x(v),w)×qϕ(z(a)∣x(a),z(av),w)×qϕ(z(v)∣x(v),z(av),w)
To complete the inference model, we also need to look at the posterior temporal dependencies for the dynamical latent variables z(av), z(a) and z(v).
Using again the chain rule, MDVAE Bayesian network and the D-separation principle, we have:
qϕ(z(av)∣x(a),x(v),w)qϕ(z(a)∣x(a),z(av),w)qϕ(z(v)∣x(v),z(av),w)=t=1∏Tqϕ(zt(av)∣z1:t−1(av),xt:T(a),xt:T(v),w)=t=1∏Tqϕ(zt(a)∣z1:t−1(a),xt:T(a),zt(av),w)=t=1∏Tqϕ(zt(v)∣z1:t−1(v),xt:T(v),zt(av),w)
These distributions and qϕ(w∣x(a),x(v)) are Gaussians parametrized by neural networks.
Drawing the corresponding probabilistic graphical model at inference time would be difficult and not really informative.
As in the standard VAE, learning the MDVAE generative and inference model parameters consists in maximizing the ELBO
L(ϕ,θ)=reconstruction accuracyEqϕ(z(av),z(a),z(v),w∣x(a),x(v))[lnpθ(x(a),x(v)∣z(av),z(a),z(v),w)]−regularizationDKL(qϕ(z(av),z(a),z(v),w∣x(a),x(v))∣∣∣∣pθ(z(av),z(a),z(v),w)).
Developing this expression is a bit more complicated than with the standard VAE, but there is no fundamental difficulty.
Standard VAEs tend to reconstruct blurred outputs, which is particularly true for image data.
The vector-quantized VAE (VQ-VAE) (van den Oord et al., 2017) learns a discrete latent representation to overcome this limitation.
Before being fed to the decoder, the continuous latent vector is quantized using a discrete codebook that is jointly learned with the network architecture.
A. van den Oord et al., Neural discrete representation learning, NeurIPS 2017.
The first stage consists in learning a VQ-VAE independently for each modality and without temporal modeling.
Rather than learning the MDVAE on the raw audio and visual data, we will use the intermediate compressed representation of the VQ-VAEs before quantization.
In the second stage, we learn the MDVAE model "inside" the frozen VQ-VAE. This 2-stage training improves the reconstruction quality, but it also speeds up the training of the MDVAE model.
The disentanglement between static versus dynamical and modality-specific versus audiovisual latent speech factors occurs during this second training stage.
We use about 30 hours of audiovisual emotional speech from the MEAD dataset
Preprocessing:
Image credits: (K. Wang et al., 2020)
K. Wang et al., MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation, ECCV, 2020
The first set of experiments consists in studying what characteristics of the audiovisual speech data are encoded in w, z(av), z(v), and z(a).
We will present qualitative results obtained by reconstructing an audiovisual speech sequence using some of the latent variables from another sequence.
We transfer z(av) from the central sequence in red to the surrounding sequences.
Lip and jaw movements are transfered.
We transfer z(v) from the central sequence in red to the surrounding sequences.
Head and eyelid movements are transfered.
We transfer z(av) and z(v) from the central sequence in red to the surrounding sequences. The identity and global emotional state are preserved because w is unaltered.
Interpolation of the static audiovisual latent variable w
pθ(x(v)∣z(av),z(v),w)=t=1∏Tpθ(xt(v)∣zt(av),zt(v),w~t),w~t=αtw+(1−αt)w′,αt=(T−t)/(T−1).
The qualitative analysis confirmed that:
These conclusions are confirmed quantitatively by measuring the impact of swapping latent variables on the action units (not presented today).
The task is to denoise the visual modality.
We compare MDVAE with two unimodal baselines trained on the visual modality only:
Denoising is done by simply encoding and decoding the corrupted (audio)visual speech sequence.
Y. Li and S. Mandt, "Disentangled sequential autoencoder", ICML 2018.
The performance gap between MDVAE and the unimodal baselines is larger for the corruption of the mouth region.
This is because MDVAE exploits the audio modality.
The qualitative analysis of the latent representations learned by MDVAE suggests that the static audiovisual latent variable w encodes the speaker's emotion.
We propose to use the mean vector of the Gaussian inference model qϕ(w∣x(a),x(v)) as the input of a multinomial logistic regression model trained for emotion classification on the MEAD dataset (8 classes).
The mean vector is simply obtained by a forward through the encoder network corresponding to qϕ(w∣x(a),x(v)).
We compare the performance of MDVAE with its unimodal counterparts:
F1 score (%) | Precision (%) | Recall (%) | |
---|---|---|---|
Audiovisual transformer (Chumachenko et al., 2022) |
88.65 | 89.37 | 87.96 |
MDVAE w/o finetuning + multinomial logisitic reg. |
82.86 | 81.98 | 83.76 |
MDVAE w/ finetuning + multinomial logisitic reg. |
89.62 | 89.55 | 89.71 |
Credits: (Chumachenko et al., 2022)
EfficientFace is pre-trained on AffectNet, the largest dataset of in-the-wild facial images labeled in emotions.
Finetuning MDVAE is unsupervised.
S.R Livingstone and F.A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English", PloS one, 2018.
K. Chumachenko et al., "Self-attention fusion for audiovisual emotion recognition with incomplete data", arXiv preprint arXiv:2201.11095, 2022.
We proposed the MDVAE to learn structured representations of multimodal and dynamical data.
Why?
Collecting labels for every scenario and tasks is intractable, we need alternatives to supervised learning.
How?
Deep generative modeling is a powerful unsupervised learning paradigm that can be applied to many different types of data, in particular multimodal and sequential data.
We can learn structured and interpretable representations by modeling probabilistic dependencies between observed and latent variables.
What?
Various applications in audiovisual speech processing, using one single model.
MDVAE effectively combines the audio and visual information in static (w) and dynamical (z(av)) latent variables:
Talking faces can be synthesized by transfering z(av) from one sequence to another, which preserves the speaker's identity, emotional state and visual-only facial movements.
The audio modality provides robstuness with respect to corruption of the visual modality on the mouth region.
The static audiovisual latent variable w can be used for emotion recognition with few labeled data, and with much better accuracy compared with unimodal baselines.
MDVAE is also competitive with a state-of-the-art method based on audiovisual transformers.
Extensions and applications of MDVAE include:
M. Sadeghi et al., "Audio-visual speech enhancement using conditional variational auto-encoders", IEEE/ACM Transactions on Audio, Speech and Language Processing, 2020
A. Caillon and P. Esling, "RAVE: A variational autoencoder for fast and high-quality neural audio synthesis", arXiv preprint arXiv:2111.05011, 2021
Samir Sadok1
Laurent Girin2
Xavier Alameda-Pineda3
Renaud Séguier1
1 CentraleSupélec, IETR (UMR CNRS 6164), France
2 Univ. Grenoble Alpes, CNRS, Grenoble-INP, GIPSA-lab, France
3 Inria, Univ. Grenoble Alpes, CNRS, LJK, France
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |