DEGREASE

Description and objectives

DEGREASE (ANR-23-CE23-0009) is a 45-month project (2024/04 - 2028/01) funded by the French National Research Agency (ANR) within the Young Researcher program (JCJC) and coordinated by Simon Leglaive.

Speech enhancement

Figure 1: Illustration of the speech enhacement task.

DEGREASE stands for deep generative and inference models for weakly-supervised speech enhancement. Speech enhancement consists of improving the quality and intelligibility of a speech signal in a degraded recording, for instance due to interferring sound sources and reverberation (see Figure 1). Speech enhancement finds applications in various technologies for human and machine listening (hearing aids, assistive listening, vocal assistants, smartphones, smart homes, etc.)

The conventional fully-supervised approach

Figure 2: The (now) conventional approach to supervised speech enhancement.

In recent years, there has been great progress in speech enhancement thanks to deep learning models trained in a supervised manner. Supervised speech enhancement involves three main ingredients, as illustrated in Figure 2:

A model, which provides a prediction of the clean speech signal given the noisy recording. State-of-the-art methods today rely on deep neural networks.
A metric, which measures the discrepancy between the clean speech estimate and the ground-truth signal. During training, the metric corresponds to a differentiable loss function, which is minimized to estimate the model parameters. At test time, the metric (which can differ from the training loss function) is used to evaluate the performance of the trained model.
A labeled dataset, which consists of noisy speech signals and their corresponding clean reference signals; we say that the noisy speech signals are labeled with the clean speech signals. During training, the clean speech reference signals are used as the targets for the model. At test time, these are used to compute intrusive performance metrics.

Unfortunately, it is very difficult, if not impossible, to acquire labeled noisy speech signals in real-world conditions due to cross-talk between microphones. Therefore, datasets for supervised learning have to be generated artificially, by creating synthetic mixtures of isolated speech and noise signals. Artificially-generated training data are however inevitably mismatched with real-world noisy speech recordings, which can result in poor speech enhancement performance in case of severe mismatch. Moreover, if the task or the evaluation domain changes, supervised learning will require collecting new data and retraining the model, which is time- and computationally-consuming. These limitations of supervised speech enhancement contrast with the impressive adaptability of the human auditory system when it comes to perceive speech in unknown adversary acoustic conditions.

DEGREASE

Figure 3: High-level overview of the methodology proposed in DEGREASE.

The scientific ambition of the DEGREASE project is to develop speech enhancement methods that can leverage real unlabeled recordings of noisy and reverberant speech at training time and that can adapt to new acoustic conditions at test time. To reach this objective we propose a methodology at the crossroads of audio signal processing, probabilistic graphical modeling, and deep learning, which is based on deep generative and inference models specifically designed for the processing of multi-microphone speech signals.

The probabilistic generative modeling approach will allow us to consider the clean speech signals as partially-observed variables during training. Models will thus be learned in a semi-supervised manner at training time, and they will be adapted in an unsupervised manner at test time. Speech enhancement will be achieved by inverting the learned generative model, a.k.a performing inference.

The spatial and time-frequency structure of multi-microphone speech signals suggests the existence of a low-dimensional latent variable involved in the generative process of the data. In DEGREASE, we also seek to relate this low-dimensional latent representation to physical properties of the signal (spatial and spectro-temporal properties such as the spatial location and the pitch contour of the speech source). This interpretability will allow us to inform the speech enhancement process using external information provided for instance by off-the-shelf pitch or spatial location predictors.

The outcomes of the DEGREASE project are expected to help building more reliable speech technologies that can work optimally in diverse and uncontrolled acoustic environments.