class: middle, center $$ \global\def\myx#1{{\color{green}\mathbf{x}\_{#1}}} $$ $$ \global\def\mys#1{{\color{green}\mathbf{s}\_{#1}}} $$ $$ \global\def\myS#1{{\color{green}\mathbf{S}\_{#1}}} $$ $$ \global\def\myz#1{{\color{brown}\mathbf{z}\_{#1}}} $$ $$ \global\def\myztilde#1{{\color{brown}\tilde{\mathbf{z}}\_{#1}}} $$ $$ \global\def\myhnmf#1{{\color{brown}\mathbf{h}\_{#1}}} $$ $$ \global\def\myztilde#1{{\color{brown}\tilde{\mathbf{z}}\_{#1}}} $$ $$ \global\def\myu#1{\mathbf{u}\_{#1}} $$ $$ \global\def\mya#1{\mathbf{a}\_{#1}} $$ $$ \global\def\myv#1{\mathbf{v}\_{#1}} $$ $$ \global\def\mythetaz{\theta\_\myz{}} $$ $$ \global\def\mythetax{\theta\_\myx{}} $$ $$ \global\def\mythetas{\theta\_\mys{}} $$ $$ \global\def\mythetaa{\theta\_\mya{}} $$ $$ \global\def\bs#1{{\boldsymbol{#1}}} $$ $$ \global\def\diag{\text{diag}} $$ $$ \global\def\mbf{\mathbf} $$ $$ \global\def\myh#1{{\color{purple}\mbf{h}\_{#1}}} $$ $$ \global\def\myhfw#1{{\color{purple}\overrightarrow{\mbf{h}}\_{#1}}} $$ $$ \global\def\myhbw#1{{\color{purple}\overleftarrow{\mbf{h}}\_{#1}}} $$ $$ \global\def\myg#1{{\color{purple}\mbf{g}\_{#1}}} $$ $$ \global\def\mygfw#1{{\color{purple}\overrightarrow{\mbf{g}}\_{#1}}} $$ $$ \global\def\mygbw#1{{\color{purple}\overleftarrow{\mbf{g}}\_{#1}}} $$ $$ \global\def\neq{\mathrel{\char`≠}} $$ .vspace[ ] # Learning and controlling the source-filter representation of speech with a VAE .vspace[ ] .center.bold[Simon Leglaive] .small.center[CentraleSupélec, IETR (UMR CNRS 6164), France] .vspace[ ] .center.width-7[![](images/logo_PoS.png)] .grid[ .kol-1-5[ .vspace[ ] .left.width-120[![](images/logo_CS.svg)] ] .kol-3-5[ .small.center[June 20, 2022
Neural Interfacing Lab, Department for Neurosurgery, Maastricht University, The Netherlands ] ] .kol-1-5[ .vspace[ ] .right.width-50[![](images/logo_IETR.png)]] ] --- class: middle ## Joint work with .grid[ .kol-1-4[ .center.width-90.circle[![](images/samir.jpeg)
Samir Sadok
.small[1]
] ] .kol-1-4[ .center.width-90.circle[![](images/laurent.png)
Laurent Girin
.small[2]
] ] .kol-1-4[ .center.width-90.circle[![](images/xavi-square-light.jpg)
Xavier Alameda-Pineda
.small[3]
] ] .kol-1-4[ .center.width-90.circle[![](images/renaud.jpg)
Renaud Séguier
.small[1]
] ] .kol-1-3[ .small.center[
1
CentraleSupélec, IETR UMR CNRS 6164, France] ] .kol-1-3[ .small.center[
2
Univ. Grenoble Alpes, CNRS, Grenoble-INP, GIPSA-lab, France] ] .kol-1-3[ .small.center[
3
Inria, Univ. Grenoble Alpes, CNRS, LJK, France] ] ] .credit[ S. Sadok, S. Leglaive, L. Girin, X. Alameda-Pineda, R. Séguier, [Learning and controlling the source-filter representation of speech with a variational autoencoder](https://arxiv.org/abs/2204.07075), arXiv preprint arXiv:2204.07075, 2022. ] --- class: middle exclude: true ## Outline - Motivation - Identifying and learning latent subspaces encoding the source-filter characteristics of speech - Controlling the source-filter characteristics by moving in the VAE latent space - Experimental results - Conclusion --- class: middle, center # Motivation --- class: middle ## Inverse problems in audio signal processing .small-vspace[ ] .center.width-100[![](images/speech_enhancement_illust.svg)] .center.medium[Source separation, speech enhancement, inpainting, phase retrieval, bandwidth extension, ...] .alert[We need a probabilistic/generative model of the latent signal of interest] --- class: middle ## Non-stationary Gaussian model .tiny[(Ephraim and Malah, 1984)] .grid[ .kol-2-3[ - Let $\myS{} \in \mathbb{C}^{F \times T}$ denote an audio/speech signal in the short-time Fourier transform (STFT) domain, with $$ p(\myS{}) = \prod\_\{t=1\}^T p(\mys{t}) = \prod\_\{t=1\}^T \mathcal{N}\_c\left( \mys{t}; \mathbf{0}, \diag \\{ \mbf{v}\_{s,t} \\} \right). $$ - $\mys{t} \in \mathbb{C}^F$ denotes the complex-valued spectrum of the signal at time frame $t$. - $\mbf{v}\_{s,t} \in \mathbb{R}\_+^{F} $ represents the **expected power spectrum of the signal** at time frame $t$. ] .kol-1-3[ .center.width-80[![](images/speech_spectrogram.svg)] ] ] .alert[The variance is usually constrained to encode specific spectro-temporal characteristics.] .credit[Y. Ephraim and D. Malah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator, IEEE TASSP 1984.] .footnote[This Gaussian model implies that the entries of $|\mys{t}|^{\odot 2}$ follow an exponential or Gamma distribution parametrized by $\mbf{v}\_{s,t}$. ] ---
## The variance modeling framework .tiny[(Vincent et al., 2010)] From **"explicit" signal models** to **data-driven** approaches: - Structured-sparsity-inducing priors for **modeling** tonal and transient sounds
.small[(Févotte et al., 2007)] - Non-negative matrix factorization (NMF) for **modeling** spectrograms as non-negative linear combinations of **learned** spectral templates
.small[(Benaroya et al., 2003; Févotte et al., 2009; Ozerov et al., 2012)] - (Dynamical) variational autoencoder (VAE) for **learning** (spectro-temporal) spectral structures
.small[(Bando et al., 2018; Leglaive et al., 2018; 2020; Girin et al., 2021)] .credit[ E. Vincent et al., Probabilistic modeling paradigms for audio source separation, In: Machine Audition: Principles, Algorithms and Systems, 2010.
C. Févotte et al., Sparse linear regression with structured priors and application to denoising of musical audio, IEEE TASLP, 2007.
L. Benaroya et al., Non negative sparse representation for Wiener based source separation with a single sensor, IEEE ICASSP 2003.
C. Févotte et al., Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis, Neural Computation, 2009.
A. Ozerov et al., A general flexible framework for the handling of prior information in audio source separation, IEEE/ACM TASLP, 2012.
Y. Bando et al., Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-negative matrix factorization, IEEE ICASSP 2018.
S. Leglaive et al., A variance modeling framework based on variational autoencoders for speech enhancement, IEEE MLSP 2018.
S. Leglaive et al., A recurrent variational autoencoder for speech enhancement, IEEE ICASSP 2020.
L. Girin et al., Dynamical variational autoencoders: A comprehensive review, Foundations and Trends in Machine Learning, 2021. ] --- class: middle ## NMF-based variance modeling .tiny[(Févotte et al., 2009)] .grid[ .kol-1-2[
$$ p(\mys{t}) = \mathcal{N}\_c\Big( \mys{t}; \mathbf{0}, \diag \\{ \mbf{v}\_{s,t} = \mathbf{W} \myhnmf{t} \\} \Big), $$
- $\mathbf{W} \in \mathbb{R}\_+^{F \times K}$ is a **dictionary** matrix of spectral templates; - $\myhnmf{t} \in \mathbb{R}\_+^{K}$ is the **low-dimensional activation** vector at time frame $t$; - $K$ is the rank of the factorization. ] .kol-1-2[ .right.width-100[![](images/NMF_sound.svg)] ] ] .credit[ C. Févotte et al., Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis, Neural Computation, 2009. ] --- class: middle exclude: true **Maximum likelihood parameters estimation** is equivalent to solving .small[(Févotte et al., 2009)]: $$ \underset{\mathbf{W}\in \mathbb{R}\_+^{F \times K},\, \mathbf{H} \in \mathbb{R}\_+^{K \times T}}{\min} \sum\limits\_{t=1}^T d\_{\text{IS}} \left( |\mys{t}|^{\odot 2} ; \mathbf{W} \myhnmf{t} \right), $$ where $\myhnmf{t} = (\mathbf{H})\_{:,t}$ and $d\_{\text{IS}}$ is the **Itakura-Saito (IS) divergence**. .credit[ C. Févotte et al., [Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis](), Neural Computation, 2009. ] --- class: middle ## VAE-based variance modeling .tiny[(Kingma and Welling, 2014; Bando et al., 2018)] .grid[ .kol-3-5[ $p(\mys{t} \mid \myz{t} ) = \mathcal{N}\_c\Big( \mys{t}; \mathbf{0}, \diag\left\\{ \mbf{v}\_{s,t} = \mathbf{v}\_\theta(\myz{t}) \right\\} \Big),$ - $\myz{t} \in \mathbb{R}^K$ is a **low-dimensional latent vector** with $p(\myz{t}) = \mathcal{N}(\myz{t}; \mathbf{0}, \mathbf{I})$. ] .kol-2-5[ - $\mathbf{v}\_\theta: \mathbb{R}^K \mapsto \mathbb{R}\_+^F$ is a neural network (decoder) of parameters $\theta$. - $p(\mys{t} ) = \displaystyle \int p(\mys{t} \mid \myz{t} ) p(\myz{t}) d\myz{t}$. ] ] .small-nvspace[ ] .center.width-70[![](images/decoder_VAE_sound.svg)] .credit[ D.P. Kingma and M. Welling, Auto-encoding variational Bayes, ICLR 2014.
Y. Bando et al., Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-negative matrix factorization, IEEE ICASSP 2018.] --- class: middle exclude: true - Similarly to NMF, **maximum likelihood parameters estimation** is equivalent to solving an optimization problem involving the **IS divergence** .small[(Bando et al., 2018)]: $$ \underset{\theta}{\min} \sum\limits\_{t=1}^T \mathbb{E}\_{q\_{\phi}(\myz{t} | \mys{t})} \left[ d\_{\text{IS}} \left( |\mys{t}|^{\odot 2} ; \mathbf{v}\_\theta(\myz{t}) \right) \right], $$ where $q\_{\phi}(\myz{t} | \mys{t})$ is the inference model which approximates the intractable exact posterior, implemented with an encoder neural network .small[(and whose parameters should also be estimated)]. - This problem corresponds to the maximization of the reconstruction accuracy term in the VAE objective function (ELBO). .credit[ Y. Bando et al., [Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-negative matrix factorization](), IEEE ICASSP 2018. ] --- class: middle ## NMF vs. VAE for variance modeling .width-100[![](images/VAE_vs_NMF_2.svg)] .center[In speech enhancement, the VAE model outperforms the NMF model .small[(Leglaive et al., 2018)].] .alert[ .small-nvspace[ ] However, contrary to NMF, .bold[we cannot directly relate the learned representation to interpretable properties of the signal.]
This is the problem we are going to tackle. .small-nvspace[ ] ] .credit[S. Leglaive et al., A variance modeling framework based on variational autoencoders for speech enhancement, IEEE MLSP 2018.] --- class: middle, center # Analyzing the VAE latent space --- class: middle ## Complete VAE model .grid[ .center.kol-1-5[ .underline[Prior] $ \small p(\myz{}) = \mathcal{N}(\myz{}; \mathbf{0}, \mathbf{I})$ ] .center.kol-2-5[ .underline[Generative model] $ \small p\_\theta(\mys{} | \myz{}) = \mathcal{N}\_c\left( \mys{}; \mathbf{0}, \text{diag}\left\\{ \mathbf{v}\_\theta(\myz{}) \right\\} \right) $ ] .center.kol-2-5[ .underline[Inference model] $ \small q\_\phi(\myz{} | \mys{}) = \mathcal{N}\left( \myz{}; \boldsymbol{\mu}\_\phi(\mys{}), \text{diag}\left\\{ \mathbf{v}\_\phi(\mys{}) \right\\} \right)$ ] ] .center.width-80[![](images/full_VAE_spectro.svg)] .center[We trained **a vanilla VAE** on about 25 hours of unlabeled speech signals at 16 kHz.] --- class: middle ## Analysis-resynthesis by encoding-decoding .grid[ .kol-9-12[ .left.width-90[![](images/analysis_resynthesis.svg)] ] .kol-3-12[ .small[Original signal]
.small[Reconstruction with
] .small[$\hspace{.25cm}$1.] .small[oracle phase]
.small[$\hspace{.25cm}$2.] .small[Griffin-Lim] .tiny[(Griffin and Lim, 1984)]
.small[$\hspace{.25cm}$3.] .small[WaveGlow] .tiny[(Prenger et al., 2019)]
] ] .credit[ D. Griffin and J.S. Lim, Signal estimation from modified short-time Fourier transform, IEEE TASSP, 1984.
R. Prenger et al., Waveglow: A flow-based generative network for speech synthesis, IEEE ICASSP, 2019. ] --- class: center, middle
.alert[Understanding the structure of the latent space using natural speech signals is difficult, let's "open the black box" with .bold[simpler speech signals].]
.center.width-80[![](images/latent_representation_2.svg)] --- class: middle ## Source-filter model of .small[(voiced)] speech production .small-vspace[ ] .center.width-100[![](images/source_filter.svg)] .small-vspace[ ] .alert[ The source-filter model proposed by (Fant, 1970) considers that the production of speech results from the interaction of a .bold[source signal] with a .bold[linear filter]. ] .small-vspace[ ] .grid[ .kol-1-2[ .small[ - In voiced speech, the source originates from the vibration of the **vocal folds**. This vibration is characterized by the **fundamental frequency**, loosely referred to as the **pitch**. ] ] .kol-1-2[ .small[ - The source signal is modified by the **vocal tract**, which is assumed to act as a **linear filter**. The cavities of the vocal tract give rise to **resonances**, which are called the **formants**. ] ] ] .credit[G. Fant, Acoustic theory of speech production (No. 2), Walter de Gruyter, 1970.] --- class: middle .grid[ .kol-2-5[ .center.width-95[![](images/aeiou_indiv_spec.svg)] ] .kol-3-5[ - By moving the speech articulators (tongue, lips, jaw), humans modify the shape of their vocal tract, which results in a change of the formant frequencies.
.center[
]
- The source-filter model tells us that **we can control the source ($f_0$) independently of the filter** (the formants) .small[(Fant, 1970)]. - The first formant frequencies $\\{f\_i\\}\_{i \ge 1}$ can also be controled independently of each other
.small[(MacDonald et al., 2011)].
.credit[E. N. MacDonald, Probing the independence of formant control using altered auditory feedback, JASA, 2011.] ] ] --- class: middle ## Automatically-labeled artificial speech trajectories .center.width-100[![](images/soundgen_trajectories_hor.svg)] - We generate datasets $\\{\mathcal{D}\_i\\}\_{i=0}^3$ containing a **few seconds of vowel-like speech power spectra** where only one factor $f\_i$ varies, all other factors $\\{f\_j\\}\_{j \neq i}$, being arbitrarily fixed. - We used Soundgen .small[(Anikin, 2019)], an artificial speech synthesizer based on the source-filter model. - All examples in $\mathcal{D}\_i$ are **automatically-labeled** with $f_i$ (this is an input of soundgen). .alert[We are going to investigate the VAE latent representation associated with these trajectories.] .credit[A. Anikin, Soundgen: An open-source tool for synthesizing nonverbal vocalizations, Behavior Research Methods, 2019.] --- class: middle ## Aggregated posterior - Let $ \hat{p}^{(i)}(\mys{}) = \frac{1}{\\#\mathcal{D}\_i} \sum\limits\_{\mys{n} \in \mathcal{D}\_i} \delta(\mys{} - \mys{n})$ denote the empirical distribution associated with $\mathcal{D}\_i$. - The **aggregated posterior** is a marginal distribution over $\myz{}$ defined by "aggregating, or averaging, the VAE approximate posterior $q\_\phi(\myz{} | \mys{})$ over $\hat{p}^{(i)}(\mys{})$": $$ \hat{q}\_\phi^{(i)}(\myz{}) = \mathbb{E}\_{{p}^{(i)}(\mys{})}[q\_\phi(\myz{} | \mys{})] = \int q\_\phi(\myz{} | \mys{}) \hat{p}^{(i)}(\mys{}) d\mys{} = \frac{1}{\\#\mathcal{D}\_i} \sum\limits\_{\mys{n} \in \mathcal{D}\_i} q\_\phi(\myz{} | \mys{n}). $$ - For instance, we have $$\boldsymbol{\mu}\_{\phi}(\mathcal{D}\_i) = \mathbb{E}\_{\hat{q}\_\phi^{(i)}(\myz{})}[\myz{}] = \frac{1}{\\# \mathcal{D}\_i}\sum\limits\_{\mys{n} \in \mathcal{D}\_i} \mathbb{E}\_{q\_\phi(\myz{}|\mys{n})}[ \myz{} ] = \frac{1}{\\# \mathcal{D}\_i}\sum\limits\_{\mys{n} \in \mathcal{D}\_i} \boldsymbol{\mu}\_{\phi}(\mys{n}). $$ - In the following, without loss of generality, we assume centered latent vectors: $$ \myz{} \leftarrow \myz{} - \boldsymbol{\mu}\_{\phi}(\mathcal{D}\_i). $$ --- exclude: true ## Latent subspace learning .center.width-100[![](images/sf_vae_project_traj.svg)] - Because one single factor $f_i$ varies in $\mathcal{D}\_i$, we expect the corresponding latent vectors .small[(obtained with the VAE encoder)] to live in a **lower-dimensional manifold of the original latent space $\mathbb{R}^K$**. - We assume this manifold to be a **linear subspace** characterized by its semi-orthogonal basis matrix $\mathbf{U}\_i \in \mathbb{R}^{K \times M\_i}, M\_i < K$, computed by solving $$ \underset{\mathbf{U} \in \mathbb{R}^{K \times M\_i}}{\min} \frac{1}{\\#\mathcal{D}\_i} \sum\_{\mys{n} \in \mathcal{D}\_i} \mathbb{E}\_{q\_\phi(\myz{}|\mys{n})}\left[ \parallel \myz{} - \mathbf{U}\mathbf{U}^\top\myz{} \parallel\_2^2 \right], \qquad s.t.\, \mathbf{U}^\top \mathbf{U} = \mathbf{I}. $$ .footnote-b[ Without loss of generality, we assume that the latent vector $\myz{}$ has been centered by subtracting $\frac{1}{\\# \mathcal{D}\_i}\sum\limits\_{\mys{n} \in \mathcal{D}\_i} \mathbb{E}\_{q\_\phi(\myz{}|\mys{n})}[ \myz{} ] = \frac{1}{\\# \mathcal{D}\_i}\sum\limits\_{\mys{n} \in \mathcal{D}\_i} \boldsymbol{\mu}\_{\phi}(\mys{n})$. ] --- class: middle ## Source-filter latent subspace learning - .bold[Intuition]: Because one single factor $f_i$ varies in $\mathcal{D}\_i$, we expect the corresponding latent vectors to live in a **lower-dimensional manifold of the original latent space $\mathbb{R}^K$**. .small-vspace[ ] .center.width-80[![](images/sf_vae_project_traj.svg)] .small-vspace[ ] - We assume this manifold to be a **linear subspace** characterized by its semi-orthogonal basis matrix $\mathbf{U}\_i \in \mathbb{R}^{K \times M\_i}, M\_i < K$, computed by solving $$ \underset{\mathbf{U} \in \mathbb{R}^{K \times M\_i}}{\min}\,\, \mathbb{E}\_{\hat{q}\_\phi^{(i)}(\myz{})}\left[ \parallel \myz{} - \mathbf{U}\mathbf{U}^\top\myz{} \parallel\_2^2 \right], \qquad s.t.\,\, \mathbf{U}^\top \mathbf{U} = \mathbf{I}. $$ - As in principal component analysis (PCA), a closed-form solution is obtained by an eigendecomposition of a symmetric positive semi-definite matrix. --- class: middle ## Trajectories in the learned latent subspaces - For each element $\mys{} \in \mathcal{D}\_i$, we plot $\mathbb{E}\_{q\_\phi(\myz{}|\mys{})}[ \mathbf{U}\_i^\top\myz{} ] = \mathbf{U}\_i^\top \boldsymbol{\mu}\_{\phi}(\mys{}) \in \mathbb{R}^{M\_i}$ $\footnotesize (M\_i = 3)$. .grid[ .kol-1-4[ .width-115[![](images/pitch_viz.gif)] ] .kol-1-4[ .width-100[![](images/f1_viz.gif)] ] .kol-1-4[ .width-100[![](images/f2_viz.gif)] ] .kol-1-4[ .width-100[![](images/f3_viz.gif)] ] ] .nvspace[ ] .grid[ .kol-1-4[ .caption[$f\_0$ latent trajectory] ] .kol-1-4[ .caption[$f\_1$ latent trajectory] ] .kol-1-4[ .caption[$f\_2$ latent trajectory] ] .kol-1-4[ .caption[$f\_3$ latent trajectory] ] ] - Two speech spectra with close values for the factor $f\_i$ have latent representations that are also close in the learned subspaces. .alert[The latent representation learned by the VAE preserves the notion of proximity in terms of fundamental and formant frequencies.] --- class: middle ## Disentanglement analysis .alert[The proposed approach offers a natural and straightforward way to .bold[quantitatively measure] if the VAE managed to learn a .bold[disentangled representation] of the source-filter characteristics of speech.] - By looking at the eigenvalues associated with the columns of $\mathbf{U}\_i \in \mathbb{R}^{K \times M\_i}$, we can measure the **amount of variance that is retained by the projection** $\mathbf{U}\_i \mathbf{U}\_i^\top$. - If a small number of components $M\_i$ represents most of the variance, it indicates that **only a few intrinsic dimensions of the latent space are dedicated to the factor** $f\_i$. - If for two different factors $f\_i$ and $f\_j$, the columns of $\mathbf{U}\_i$ are orthogonal to those of $\mathbf{U}\_j$, the two factors are encoded in **orthogonal subspaces and therefore disentangled** .small[(Higgins et al., 2018)]. .credit[I. Higgins et al., Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230, 2018.] --- class: middle .grid[ .kol-1-2[ .center.width-100[![](images/correlation.svg)] ] .kol-1-2[ - We choose $M\_i$ so as to retain 80% of the data variance after projection onto the latent subspaces. It gives $M\_0 = 4, M\_1 = 1, M\_2 = 3, M\_3 = 3$. - We compute the dot product between all pairs of unit vectors in the matrices
$\\{\mathbf{U}\_i \in \mathbb{R}^{K \times M\_i}\\}\_{i=0}^3$. - Except for a correlation value of $−0.21$ between $f\_1$ and the 1st component of $f\_2$, all values are below $0.13$ .small[(in absolute value)]. ] ] .alert[This analysis confirms the orthogonality of the source-filter latent subspaces and the disentanglement of the corresponding factors in the VAE latent space. ] --- class: middle ## Conclusion - Using only a **few seconds of artificially generated speech**, we put in evidence that **a VAE trained in an unsupervised manner** learns a latent representation that is consistent with the **source-filter model** of speech production. Indeed, the fundamental frequency and first formant frequencies are encoded in **orthogonal subspaces** of the original VAE latent space. - It suggests that we could **manipulate one factor in its latent subspace without affecting the others**, similarly as how humans produce speech according to the source-filter model. --- class: middle, center # Moving in the source-filter latent subspaces --- class: middle ## Disentangled speech manipulation in the VAE latent space .center.width-100[![](images/moving_4.svg)] We can transform a speech spectrum by analyzing it with the VAE encoder, applying the following **affine transformation**, and resynthesizing with the VAE decoder: $$ {\color{magenta}\tilde{\mathbf{z}}} = {\color{blue}\mathbf{z}} - \mathbf{U}\_i \mathbf{U}\_i^\top {\color{blue}\mathbf{z}} + \mathbf{U}\_i {\color{magenta}\mathbf{g}\_{\eta\_i}(y)}. $$ .alert[This transformation allows us to .bold[move only in the subspace associated with] $f\_i$, leaving other source-filter factors unchanged thanks to the orthogonality property.] --- class: middle exclude: true ## Speech manipulations in the source-filter latent subspaces Given a source latent vector $\myz{}$, drawn from $p(\myz{})$ for generation or from $q\_\phi(\myz{} | \mys{})$ for transformation, and a target value $y$ for the factor $f\_i$, we can apply the following affine transformation: $$ \myztilde{} = \myz{} \underbrace{ \,-\, \mathbf{U}\_i \mathbf{U}\_i^\top \myz{}}\_{(i)} \, \underbrace{\,+\, \mathbf{U}\_i \mathbf{g}\_{\eta\_i}(y)}\_{(ii)}, $$ which consists in $\hspace{.35cm} (i)$ substracting the projection of $\myz{}$ onto the subspace associated with $f\_i$; $\hspace{.35cm} (ii)$ adding the target component provided by the regression model $\mathbf{g}\_{\eta\_i}(y) \in \mathbb{R}^{M\_i}$. .alert[This transformation allows us to .bold[move only in the subspace associated with] $f\_i$, leaving other source-filter factors unchanged thanks to the orthogonality property.] .footnote[We do not need the value of the factor $f\_i$ associated with the source vector $\myz{}$, only the one associated with the target vector $\myztilde{}$.] --- class: middle ## Weakly-supervised piecewise linear regression learning .center.width-100[![](images/regression.svg)] Making now use of the labels in $\mathcal{D}\_i$, we learn a piecewise-linear regression model $\mathbf{g}\_{\eta\_i} : \mathbb{R}\_+ \mapsto \mathbb{R}^{M\_i}$ from the value $y \in \mathbb{R}\_+$ of the factor $f\_i$ to the data coordinates $\mathbf{U}\_i^\top \myz{}$ in the latent subspace: $$ \eta\_i = \underset{\eta}{\argmin}\,\, \mathbb{E}\_{\hat{q}\_\phi^{(i)}(\myz{}, y)}\Big[ \lVert \mathbf{g}\_{\eta}(y) - \mathbf{U}\_i^\top \myz{} \rVert\_2^2 \Big], $$ .medium[ where $\hat{q}\_\phi^{(i)}(\myz{}, y) = \displaystyle \int q\_\phi(\myz{} | \mys{}) \hat{p}^{(i)}(\mys{}, y) d\mys{}$ and $\hat{p}^{(i)}(\mys{}, y)$ is the empirical distribution of $\mathcal{D}\_i = \\{ (\mys{n}, y\_n) \\}\_n$. ] --- class: middle, center # Qualitative results --- class: middle .center.small[Fundamental and formant frequency manipulation of the vowel /a/ uttered by a female speaker] .center.width-80[![](images/w04ah.svg)] .center.width-60[![](images/illust_transform.svg)] --- class: middle .small-nvspace[ ] .center.small[Spectrogram generated from input trajectories of the fundamental and formant frequencies] .center.width-80[![](images/gen_spec.svg)] .grid[ .kol-2-3[ .center.width-90[![](images/illust_gen_2.svg)] ] .kol-1-3[ .center[
] ] ] .small-nvspace[ ] .alert[We have defined a deep generative model of speech spectrograms that is .bold[conditioned on interpretable trajectories] of the fundamental and formant frequencies.] --- class: middle, center .small-nvspace[ ] .grid[ .kol-1-3[ .center.width-90[![](images/pitch_encdec.svg)]
] .kol-1-3[ .center.width-90[![](images/pitch_trans_whisper.svg)]
] .kol-1-3[ .center.width-85[![](images/pitch_trans_gaussian.svg)]
] ] .grid[ .kol-1-3[ .center.width-90[![](images/pitch_trans_increase.svg)]
] .kol-1-3[ .center.width-90[![](images/pitch_trans_decrease.svg)]
] .kol-1-3[ .center.width-85[![](images/pitch_trans_sine.svg)]
] ] .caption[(top left) reconstructed w/o modification, (top middle) whispered spectrogram obtained with ${\tilde{\mathbf{z}}} = {\mathbf{z}} - \mathbf{U}\_0 \mathbf{U}\_0^\top {\mathbf{z}}$, (other) various $f_0$ transformations. Waveforms are obtained from the spectrograms using WaveGlow .small[(Prenger et al., 2019)]. ] --- class: middle, center
# Quantitative results We refer you to the paper, or you can ask for the backup slides.
.alert.left[ In summary, a quantitative analysis using datasets of English vowels and speech utterances confirms that - source-filter factors can be manipulated accurately, especially $f\_0$; - varying one factor (e.g., $f\_0$) has little effect on the others (e.g., the formants). ] --- class: middle, center # Conclusion --- class: middle .width-100[![](images/sf_vae_overview.svg)] In this work, given a **VAE** trained on **hours of unlabeled speech data** and a **few seconds of automatically-labeled data** generated with an artificial speech synthesizer, - we put in evidence that the latent representation learned by a VAE is consistent with the **source-filter model of speech production** .small[(Fant, 1970)]; - we proposed a **weakly-supervised** method to learn how to **move in the VAE latent space**, so as to perform **disentangled speech manipulations**. --- class: middle ## Future work - Take the non-linear nature of the manifolds into account; - Address the phase reconstruction issue, with better neural vocoders or working directly in the time domain .small[(Caillon and Esling, 2021)]; - Extend the approach to multi-microphone and reverberant signals, to learn both spectro-temporal and spatial representations of speech; - Exploit the invariance of the projected representations to perform analysis (e.g., $f_0$ estimation); - Leverage the proposed conditional deep generative speech model to guide VAE-based speech enhancement methods with the pitch information. .credit[ A. Caillon and P. Esling, RAVE: A variational autoencoder for fast and high-quality neural audio synthesis, arXiv preprint arXiv:2111.05011, 2021. ] --- class: middle, center
# Thank you
.alert[Code and audio examples available online
.small[https://samsad35.github.io/site-sfvae/]] --- class: middle, center # Quantitative results --- class: middle .grid[ .kol-3-4[ **Dataset** 12 English vowels $\times$ 50 male and 50 female speakers, labeled with fundamental and formant frequencies. **Task** We transform each vowel by varying one single factor $f\_i$ at a time. ] .kol-1-4[
Min (Hz)
Max (Hz)
Step (Hz)
$f_0$
100
300
1
$f_1$
300
900
10
$f_2$
1100
2700
20
$f_3$
2200
3200
20
] ] .small-nvspace[ ] **Metrics** - .bold[Accuracy and disentanglement] (lower is better) We compute the relative absolute error $\delta f\_i = | \hat{y} - y |/y \times 100 \%, $ where $y$ is the target value for $f\_i$ and $\hat{y}$ its estimation on the output transformed signal. - .bold[Speech naturalness] (higher is better) We use NISQA
(Mittag and Möller, 2020)
, an objective metric developed in the context of speech transformation algorithms to be highly correlated with subjective mean opinion scores. .credit[G. Mittag and S. Möller, Deep learning based assessment of synthetic speech naturalness, Interspeech, 2020.] --- class: middle .small-nvspace[ ] **Methods** - .bold[TD-PSOLA] .small[(Moulines and Charpentier, 1990)] performs $f\_0$ modification through a decomposition of the signal into pitch-synchronized overlapping frames. - .bold[WORLD] .small[(Morise et al., 2016)] is a vocoder also used for $f\_0$ modification. It decomposes the signal into three components characterizing $f\_0$, the aperiodicity, and the spectral envelope. - The .bold[VAE baseline] .small[(Hsu et al. 2017)] consists in applying translations directly in the VAE latent space: $$ \myztilde{} = \myz{} - \boldsymbol{\mu}\_{\text{src}} + \boldsymbol{\mu}\_{\text{trgt}}, $$ where $\boldsymbol{\mu}\_{\text{src}}$ and $\boldsymbol{\mu}\_{\text{trgt}}$ are predefined latent attribute representations associated with the source and target values of the factor to be modified, respectively. Computing $\boldsymbol{\mu}\_{\text{src}}$ requires analyzing the input speech signal (e.g., to estimate $f\_0$), which is not the case of the proposed method that only relies on a projection of $\myz{}$. .credit[ E. Moulines and F. Charpentier, Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones, Speech Communication, 1990.
M. Morise et al., World: a vocoder-based high-quality speech synthesis system for real-time applications, IEICE TIS, 2016.
W.-N. Hsu et al., Learning latent representations for speech generation and transformation, Interspeech, 2017. ] --- class: middle .center.width-95[![](images/results_f0.svg)] .medium[ - The proposed method always outperforms the baseline. - $\delta f\_0$ is lower than 1 % for the proposed method $\rightarrow$ very good precision in $f\_0$ manipulation. - WORLD obtains the best performance in terms of disentanglement $(\delta f\_i, i > 0)$ because the source and filter contributions are decoupled in the architecture of the vocoder. - Traditional signal processing methods obtain the best performance in terms of speech naturalness (NISQA) probably because they directly operate in the time domain (no phase reconstruction issue). ] --- class: middle .grid[ .kol-1-3[ .center.width-100[![](images/results_f1.svg)] ] .kol-1-3[ .center.width-100[![](images/results_f2.svg)] ] .kol-1-3[ .center.width-95[![](images/results_f3.svg)] ] ] - In terms of accuracy, the proposed method always outperforms the baseline (by 7%, 5% and 5% for $f\_1$, $f\_2$ and $f\_3$, respectively.) - In terms of disentanglement, the pitch is much less affected by formant manipulations with the proposed method. --- class: middle - A similar analysis on a dataset of short speech utterances (TIMIT) leads to similar conclusion. - This dataset is **phonemically richer** than the isolated vowels dataset. - However, it is not labeled with the fundamental and formant frequencies, so the groud truth required to measure disentanglement is estimated on the original speech signals, which makes the evaluation **less reliable**. --- class: middle - The objective of this study is not to compete with traditional signal processing methods such as TD-PSOLA and WORLD for pitch shifting. - It is rather to advance on the understanding of deep generative modeling of speech signals and to compare honestly with highly-specialized traditional systems. - TD-PSOLA and WORLD exploit signal models that are specifically designed for the task at hand, while the proposed method is data-driven and the exact same methodology applies for modifying $f\_0$ or the formant frequencies. - TD-PSOLA is still a strong baseline that is difficult to outperform with deep learning techniques, see e.g. controllable LPCNet .small[(Morrison et al., 2020)]. .credit[ M. Morrison et al., Controllable Neural Prosody Synthesis, Interspeech, 2020. ] --- class: middle, center # VAE model training --- ## Parameters estimation - Direct maximization of the marginal likelihood is intractable due to non-linearities. - For any distribution $q\_\phi(\mathbf{z} | \mathbf{x})$, we have .small[(Neal and Hinton, 1999; Jordan et al. 1999)] $$ \ln p(\mathbf{x}; \theta) = \mathcal{L}(\mathbf{x}; \phi, \theta) + D\_{\text{KL}}(q\_\phi(\mathbf{z} | \mathbf{x}) \parallel p\_\theta(\mathbf{z} | \mathbf{x})),$$ where $\mathcal{L}(\mathbf{x}; \phi, \theta)$ is the **evidence lower bound** (ELBO) defined by $$ \mathcal{L}(\mathbf{x}; \phi, \theta) = \mathbb{E}\_{q\_\phi(\mathbf{z} | \mathbf{x})} [\ln p(\mathbf{x}, \mathbf{z}; \theta) - \ln q\_\phi(\mathbf{z} | \mathbf{x})]. $$ .credit[ R.M. Neal and G.E. Hinton, [A view of the EM algorithm that justifies incremental, sparse, and other variants](http://www.cs.toronto.edu/~radford/ftp/emk.pdf), in M. I. Jordan (Ed.), .italic[Learning in graphical models], 1999.
M.I. Jordan et al., [An introduction to variational methods for graphical models](https://people.eecs.berkeley.edu/~jordan/papers/variational-intro.pdf), Machine Learning, 1999.] -- count: false .left-column.center[
.bold[Problem #1] $$ \underset{\theta}{\max}\, \mathcal{L}(\mathbf{x}; \phi, \theta),$$ where $\mathcal{L}(\mathbf{x}; \phi, \theta) \le \ln p(\mathbf{x}; \theta)$ ] -- count: false .right-column.center[
.bold[Problem #2] $$ \underset{\phi}{\max}\, \mathcal{L}(\mathbf{x}; \phi, \theta) $$ $$ \Leftrightarrow \underset{\phi}{\min}\, D\_{\text{KL}}(q\_\phi(\mathbf{z} | \mathbf{x}) \parallel p\_\theta(\mathbf{z} | \mathbf{x}))$$ ] .reset-column[ ] --- .grid[ .kol-6-10[ ## ELBO The ELBO is now fully defined: $$ \begin{aligned} \mathcal{L}(\mathbf{x}; \phi, \theta) &= \mathbb{E}\_{q\_\phi(\mathbf{z} | \mathbf{x})} [\ln p(\mathbf{x}, \mathbf{z}; \theta) - \ln q\_\phi(\mathbf{z} | \mathbf{x})] \\\\ &= \underbrace{\mathbb{E}\_{q\_\phi(\mathbf{z} | \mathbf{x})} [\ln p\_\theta(\mathbf{x} | \mathbf{z})]}\_{\text{reconstruction accuracy}} - \underbrace{D\_{\text{KL}}(q\_\phi(\mathbf{z} | \mathbf{x}) \parallel p(\mathbf{z}))}\_{\text{regularization}}. \end{aligned} $$ - prior: $ \hspace{2cm} p(\mathbf{z}) = \mathcal{N}(\mathbf{z}; \mathbf{0}, \mathbf{I})$ - likelihood model: $ \hspace{.45cm} p\_\theta(\mathbf{x} | \mathbf{z} ) = \mathcal{N}\left( \mathbf{x}; \boldsymbol{\mu}\_\theta(\mathbf{z}), \diag\left\\{ \mathbf{v}\_\theta(\mathbf{z}) \right\\} \right)$ - inference model: $ \hspace{.42cm} q\_\phi(\mathbf{z} | \mathbf{x}) = \mathcal{N}\left( \mathbf{z}; \boldsymbol{\mu}\_\phi(\mathbf{x}), \diag\left\\{ \mathbf{v}\_\phi(\mathbf{x}) \right\\} \right)$ ] .kol-3-10[ .center.width-90[![](images/VAE_complete.svg)] ] ] .small-nvspace[ ] -- count: false The reconstruction accuracy term is approximated with a Monte Carlo estimate: $$\mathbb{E}\_{q\_\phi(\mathbf{z} | \mathbf{x})} [\ln p\_\theta(\mathbf{x} | \mathbf{z})] \approx \frac{1}{R} \sum\_{r=1}^R \ln p\_\theta(\mathbf{x} | \tilde{\mathbf{z}}\_r ), \qquad \text{with} \quad \tilde{\mathbf{z}}\_r \sim q\_\phi(\mathbf{z} | \mathbf{x}). $$ --- .grid[ .kol-6-10[ ## ELBO The ELBO is now fully defined: $$ \begin{aligned} \mathcal{L}(\mathbf{x}; \phi, \theta) &= \mathbb{E}\_{q\_\phi(\mathbf{z} | \mathbf{x})} [\ln p(\mathbf{x}, \mathbf{z}; \theta) - \ln q\_\phi(\mathbf{z} | \mathbf{x})] \\\\ &= \underbrace{\mathbb{E}\_{q\_\phi(\mathbf{z} | \mathbf{x})} [\ln p\_\theta(\mathbf{x} | \mathbf{z})]}\_{\text{reconstruction accuracy}} - \underbrace{D\_{\text{KL}}(q\_\phi(\mathbf{z} | \mathbf{x}) \parallel p(\mathbf{z}))}\_{\text{regularization}}. \end{aligned} $$ - prior: $ \hspace{2cm} p(\mathbf{z}) = \mathcal{N}(\mathbf{z}; \mathbf{0}, \mathbf{I})$ - likelihood model: $ \hspace{.45cm} p\_\theta(\mathbf{x} | \mathbf{z} ) = \mathcal{N}\left( \mathbf{x}; \boldsymbol{\mu}\_\theta(\mathbf{z}), \diag\left\\{ \mathbf{v}\_\theta(\mathbf{z}) \right\\} \right)$ - inference model: $ \hspace{.42cm} q\_\phi(\mathbf{z} | \mathbf{x}) = \mathcal{N}\left( \mathbf{z}; \boldsymbol{\mu}\_\phi(\mathbf{x}), \diag\left\\{ \mathbf{v}\_\phi(\mathbf{x}) \right\\} \right)$ ] .kol-3-10[ .center.width-90[![](images/VAE_complete.svg)] ] ] .small-nvspace[ ] The reconstruction accuracy term is approximated with a Monte Carlo estimate, using the so-called **reparametrization trick**, to make the (sampled version of the) ELBO derivable w.r.t. $\phi$: $$\mathbb{E}\_{q\_\phi(\mathbf{z} | \mathbf{x})} [\ln p\_\theta(\mathbf{x} | \mathbf{z})] \approx \frac{1}{R} \sum\_{r=1}^R \ln p\_\theta(\mathbf{x} | \tilde{\mathbf{z}}\_r ), \qquad \begin{cases} \boldsymbol{\epsilon}\_r &\sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\\\ \tilde{\mathbf{z}}\_r &= \boldsymbol{\mu}\_\phi(\mathbf{x}) + \diag\left\\{ \mathbf{v}\_\phi(\mathbf{x}) \right\\}^{\frac{1}{2}} \boldsymbol{\epsilon}\_r \end{cases}. $$ --- ## Training procedure ** Step 1: Pick an example in the dataset ** $$ \begin{aligned} \mathcal{L}(\mathbf{x}; \phi, \theta) = \ln p\_\theta({\color{brown}\mathbf{x}} | \tilde{\mathbf{z}}) - D\_{\text{KL}}(q\_\phi(\mathbf{z} | {\color{brown}\mathbf{x}}) \parallel {\color{green}p(\mathbf{z})}) \end{aligned} $$ .center.width-80[![](images/training_1.png)] --- ## Training procedure ** Step 2: Map through the encoder ** $$ \begin{aligned} \mathcal{L}(\mathbf{x}; \phi, \theta) = \ln p\_\theta({\color{green}\mathbf{x}} | \tilde{\mathbf{z}}) - D\_{\text{KL}}({\color{brown} q\_\phi(\mathbf{z} | } {\color{green}\mathbf{x} } {\color{brown})} \parallel {\color{green}p(\mathbf{z})}) \end{aligned} $$ .center.width-80[![](images/training_2.png)] --- ## Training procedure ** Step 3: Sample from the inference model ** $$ \begin{aligned} \mathcal{L}(\mathbf{x}; \phi, \theta) = \ln p\_\theta({\color{green}\mathbf{x}} | {\color{brown}\tilde{\mathbf{z}}}) -{\color{green} D\_{\text{KL}}(q\_\phi(\mathbf{z} | \mathbf{x}) \parallel p(\mathbf{z}))} \end{aligned} $$ .center.width-80[![](images/training_3.png)] --- ## Training procedure ** Step 4: Map through the decoder ** $$ \begin{aligned} \mathcal{L}(\mathbf{x}; \phi, \theta) = {\color{green}\ln p\_\theta(\mathbf{x} | \tilde{\mathbf{z}})} -{\color{green} D\_{\text{KL}}(q\_\phi(\mathbf{z} | \mathbf{x}) \parallel p(\mathbf{z}))} \end{aligned} $$ .center.width-80[![](images/training_4.png)] --- ## Training procedure ** Step 5: Gradient ascent step on the ELBO ** $$ \begin{aligned} \mathcal{L}(\mathbf{x}; \phi, \theta) = \ln p\_\theta(\mathbf{x} | \tilde{\mathbf{z}}) - D\_{\text{KL}}(q\_\phi(\mathbf{z} | \mathbf{x}) \parallel p(\mathbf{z})) \end{aligned} $$ .center.width-80[![](images/training_5.png)] Encoder-decoder shape, which correspond to an inference-generation process. .footnote[In practice, one averages over mini batches before doing the backpropagation.] --- class: middle ## At test time .center.width-100[![](images/testing.png)] - The encoder was primarily introduced in order to estimate the parameters of the decoder. - We do not need the encoder for generating new samples. - But it is useful if we need to do inference.