DEGREASE (ANR-23-CE23-0009) is a 42-month project (2024/04 - 2027/09) funded by the French National Research Agency (ANR) within the Young Researcher program (JCJC) and coordinated by Simon Leglaive.
Figure 1: Illustration of the speech enhacement task.
DEGREASE stands for deep generative and inference models for weakly-supervised speech enhancement. Speech enhancement consists of improving the quality and intelligibility of a speech signal in a degraded recording, for instance due to interferring sound sources and reverberation (see Figure 1). Speech enhancement finds applications in various technologies for human and machine listening (hearing aids, assistive listening, vocal assistants, smartphones, smart homes, etc.)
Figure 2: The (now) conventional approach to supervised speech enhancement.
In recent years, there has been great progress in speech enhancement thanks to deep learning models trained in a supervised manner. Supervised speech enhancement involves three main ingredients, as illustrated in Figure 2:
Figure 3: High-level overview of the methodology proposed in DEGREASE.
The scientific ambition of the DEGREASE project is to develop speech enhancement methods that can leverage real unlabeled recordings of noisy and reverberant speech at training time and that can adapt to new acoustic conditions at test time. To reach this objective we propose a methodology at the crossroads of audio signal processing, probabilistic graphical modeling, and deep learning, which is based on deep generative and inference models specifically designed for the processing of multi-microphone speech signals.
The probabilistic generative modeling approach will allow us to consider the clean speech signals as partially-observed variables during training. Models will thus be learned in a semi-supervised manner at training time, and they will be adapted in an unsupervised manner at test time. Speech enhancement will be achieved by inverting the learned generative model, a.k.a performing inference.
The spatial and time-frequency structure of multi-microphone speech signals suggests the existence of a low-dimensional latent variable involved in the generative process of the data. In DEGREASE, we also seek to relate this low-dimensional latent representation to physical properties of the signal (spatial and spectro-temporal properties such as the spatial location and the pitch contour of the speech source). This interpretability will allow us to inform the speech enhancement process using external information provided for instance by off-the-shelf pitch or spatial location predictors.
The outcomes of the DEGREASE project are expected to help building more reliable speech technologies that can work optimally in diverse and uncontrolled acoustic environments.