class: middle, hide-count count: false $$ \global\def\myx#1{{\color{green}\mathbf{x}\_{#1}}} $$ $$ \global\def\mys#1{{\color{green}\mathbf{s}\_{#1}}} $$ $$ \global\def\myz#1{{\color{brown}\mathbf{z}\_{#1}}} $$ $$ \global\def\myhnmf#1{{\color{brown}\mathbf{h}\_{#1}}} $$ $$ \global\def\myztilde#1{{\color{brown}\tilde{\mathbf{z}}\_{#1}}} $$ $$ \global\def\myu#1{\mathbf{u}\_{#1}} $$ $$ \global\def\mya#1{\mathbf{a}\_{#1}} $$ $$ \global\def\myv#1{\mathbf{v}\_{#1}} $$ $$ \global\def\mythetaz{\theta\_\myz{}} $$ $$ \global\def\mythetax{\theta\_\myx{}} $$ $$ \global\def\mythetas{\theta\_\mys{}} $$ $$ \global\def\mythetaa{\theta\_\mya{}} $$ $$ \global\def\bs#1{{\boldsymbol{#1}}} $$ $$ \global\def\diag{\text{diag}} $$ $$ \global\def\mbf{\mathbf} $$ $$ \global\def\myh#1{{\color{purple}\mbf{h}\_{#1}}} $$ $$ \global\def\myhfw#1{{\color{purple}\overrightarrow{\mbf{h}}\_{#1}}} $$ $$ \global\def\myhbw#1{{\color{purple}\overleftarrow{\mbf{h}}\_{#1}}} $$ $$ \global\def\myg#1{{\color{purple}\mbf{g}\_{#1}}} $$ $$ \global\def\mygfw#1{{\color{purple}\overrightarrow{\mbf{g}}\_{#1}}} $$ $$ \global\def\mygbw#1{{\color{purple}\overleftarrow{\mbf{g}}\_{#1}}} $$ .big-vspace[ ] .center[ # The CHiME-7 UDASE Task ## Unsupervised Domain Adaptation for Conversational Speech Enhancement ] .vspace[ ] .center[Simon Leglaive
1
$\quad$ Léonie Borne
2
$\quad$ Efthymios Tzinis
3
$\quad$ Mostafa Sadeghi
4
$\quad$ Matthieu Fraticelli
5
$\quad$ Scott Wisdom
6
$\quad$ Manuel Pariente
2
$\quad$ Daniel Pressnitzer
5
$\quad$ John R. Hershey
6
] .small-vspace[ ] .small.center[
1
CentraleSupélec, IETR, France $\quad$
2
Pulse Audition, France $\quad$
3
University of Illinois at Urbana-Champaign, USA $\quad$
4
Inria, France $\quad$
5
Ecole Normale Supérieure, PSL University, CNRS, France $\quad$
6
Google, USA] .vspace[ ] .small.center[CHiME-2023 Workshop - Dublin, Ireland - August 25, 2023] --- class: middle, hide-count count: false .big[ ## Outline ] - ### .bold[Introduction and motivation] - ### Data, evaluation, baseline - ### Submissions and results - ### Conclusion --- exclude: true class: middle ## Speech enhancement .center.width-70[![](images/speech_enhancement.svg)] .center[The speech enhancement task is to estimate a clean speech signal from a noisy recording.] --- exclude: true class: middle ## Conventional supervised deep learning approach .center.width-90[![](images/speech_enhancement_deep_learning.svg)] - In recent years, there has been great progress in speech enhancement thanks to the use of **deep learning** models. - Most speech enhancement methods today rely on deep neural networks that are trained in a **supervised** manner. --- class: middle ## Conventional supervised speech enhancement .center.width-100[![](images/supervised_speech_enhancement.svg)] .alert[Given the impossibility of acquiring labeled noisy speech signals in real-world conditions, datasets are generated artificially by creating synthetic mixtures of isolated speech and noise signals.] --- class: middle ## Limitations of supervised learning for speech enhancement - Creating a synthetic dataset of "realistic" noisy speech mixtures is not easy. - Supervised speech enhancement methods are effective as long as the acoustic recording conditions at test time are covered by the synthetic training data. - If the test domain deviates from the synthetic training domain, it will be necessary to rebuild a synthetic training dataset and retrain the model. .alert[ Wouldn't it be easier and more effective to .bold[automatically adapt models] on .bold[real unlabeled] noisy speech recordings, without the need of ground-truth clean speech signals? ] --- class: middle ## The CHiME-7 UDASE task .bold[Unsupervised domain adaptation...] - The task focuses on **single-channel** speech enhancement in a specific **target domain** for which **no well-matched labeled data** is available for training. - The task consists of using **unlabeled data** in the **target domain** to adapt supervised speech enhancement models trained on **synthetic labeled data** in a **source domain**. .bold[... for conversational speech enhancement] - The target domain corresponds to **real conversational speech recordings** of the CHiME-5 dataset .small[(Barker et al., 2018)]. - Dinner party scenario, with multiple speakers in a **noisy** and **reverberant environment**. - The goal is to estimate the clean, potentially multi-speaker, reverberant speech, **removing the additive background noise**. .credit[ Barker, J., Watanabe, S., Vincent, E., & Trmal, J. (2018). [The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines](https://arxiv.org/abs/1803.10609). In Interspeech. ] --- class: middle exclude: true ## The UDASE task - The UDASE task focuses on **single-channel speech enhancement in a specific domain for which no well-matched labeled data is available** for training. - The target test domain corresponds to the **multi-speaker conversational speech recordings of the CHiME-5 dataset** (dinner party scenario). Given a CHiME-5 recording with one or more speakers and additive noise, the goal is to predict the clean and potentially multi-speaker speech signal, removing the additive noise. This task is motivated by the assistive listening use case, in which a speech enhancement algorithm can help any individual to better engage in a conversation. - The synthetic labeled LibriMix dataset is used for **supervised learning on mismatched data**. .alert[ The UDASE task consists of using the CHiME-5 unlabeled data to adapt a supervised speech enhancement model trained on the mismatched LibriMix data. ] --- class: middle, hide-count count: false .big[ ## Outline ] - ### Introduction and motivation - ### .bold[Data, evaluation, baseline] - ### Submissions and results - ### Conclusion --- class: middle, compact-list ## Data The task builds upon 3 datasets of noisy speech mixtures, with up to 3 overlapping speakers. .center.width-90[![](images/datasets.svg)] $\hspace{1cm}$
$\hspace{2cm}$
$\hspace{1cm}$
$\hspace{1cm}$
$\hspace{2cm}$
$\hspace{1cm}$
.alert[The final objective is to perform well in the target domain (CHiME-5 dataset).] .credit[ Barker, J., Watanabe, S., Vincent, E., & Trmal, J. (2018). [The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines](https://arxiv.org/abs/1803.10609). In Interspeech.
Cosentino, J., Pariente, M., Cornell, S., Deleforge, A., & Vincent, E. (2020). [LibriMix: An open-source dataset for generalizable speech separation](https://arxiv.org/abs/2005.11262). arXiv preprint arXiv:2005.11262. ] --- class: middle, compact-list exclude: true ## Data .tiny-nvspace[ ] .grid[ .kol-1-3[ .center[**CHiME-5** .small[(Barker et al., 2018)]
.small.bold[Adapted for the UDASE task]]
.medium[ - Train, dev, eval - **Unlabeled** data - **Target domain** - **Real** recordings of conversational speech - Reverberant speech + noise - 1 to 3 overlapping speakers ] ] .kol-1-3[ .center[**LibriMix**
.small[(Cosentino et al., 2020)]]
.medium[ - Train, dev - **Labeled** data - **Source domain** - **Synthetic** mixtures
- Speech + noise - 1 to 3 overlapping speakers ] ] .kol-1-3[ .center[**Reverberant LibriCHiME-5**
.small[(new)]]
.medium[ - Dev, eval - **Labeled** data - Close to **target domain** - **Synthetic** mixtures
- Reverberant speech + noise - 1 to 3 overlapping speakers ] ] ] .alert[The final objective is to perform well on the CHiME-5 dataset.] .credit[ Barker, J., Watanabe, S., Vincent, E., & Trmal, J. (2018). [The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines](https://arxiv.org/abs/1803.10609). In Interspeech.
Cosentino, J., Pariente, M., Cornell, S., Deleforge, A., & Vincent, E. (2020). [LibriMix: An open-source dataset for generalizable speech separation](https://arxiv.org/abs/2005.11262). arXiv preprint arXiv:2005.11262. ] --- class: compact-list count: false exclude: true ## Data .tiny-nvspace[ ] .grid[ .kol-1-3[ .center[**CHiME-5** .small[(Barker et al., 2018)]
.small.bold[Adapted for the UDASE task]]
.medium[ - Train, dev, eval - In-domain **unlabeled** data - **Real** recordings of conversational speech - Reverberant speech + noise - 1 to 3 overlapping speakers ] ] .kol-1-3[ .center[**LibriMix**
.small[(Cosentino et al., 2020)]]
.medium[ - Train and dev only - Out-of-domain **labeled** data - **Synthetic** mixtures
- Speech + noise - 1 to 3 overlapping speakers ] ] .kol-1-3[ .center[**Reverberant LibriCHiME-5**
.small[(new)]]
.medium[ - Dev and eval only - Close to in-domain **labeled** data - **Synthetic** mixtures
- Reverberant speech + noise - 1 to 3 overlapping speakers ] ] ] .medium[ Example of use: 1. Train and validate a supervised model on the out-of-domain LibriMix dataset. 2. Unsupervised domain adaptation on the in-domain CHiME-5 dataset (training). 3. Validate and evaluate the adapted model on CHiME-5 and reverberant LibriCHiME-5. ] .credit[ Barker, J., Watanabe, S., Vincent, E., & Trmal, J. (2018). [The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines](https://arxiv.org/abs/1803.10609). In Interspeech.
Cosentino, J., Pariente, M., Cornell, S., Deleforge, A., & Vincent, E. (2020). [LibriMix: An open-source dataset for generalizable speech separation](https://arxiv.org/abs/2005.11262). arXiv preprint arXiv:2005.11262. ] --- class: middle ## Evaluation stage 1 - objective evaluation - **Scale-invariant signal-to-distortion ratio** (SI-SDR) .small[(Le Roux et al., 2019)] on the reverberant LibriCHiME-5 dataset .small[(`eval/{1,2,3}`)]; - **DNSMOS P.835** .small[(Reddy et al., 2022)] on the single-speaker segments of the CHiME-5 dataset .small[(`eval/1`)]. Nonintrusive (i.e., reference-free) objective metric that provides performance scores for: - the speech signal quality (**SIG**); - the background intrusiveness (**BAK**); - the overall quality (**OVRL**). These are predictions of mean opinion scores (MOS), between 1 and 5, the higher the better. .credit[ Le Roux, J., Wisdom, S., Erdogan, H., & Hershey, J. R. (2019). [SDR–half-baked or well done?](https://arxiv.org/abs/1811.02508). In IEEE ICASSP.
Reddy, C. K., Gopal, V., & Cutler, R. (2022). [DNSMOS P. 835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors](https://arxiv.org/abs/2110.01763). In IEEE ICASSP. ] --- class: middle ## Evaluation stage 2 - listening test in the target CHiME-5 domain - Conducted by .bold[Hend ElGhazaly] and .bold[Jon Barker] at the University of Sheffield, during more than 3 weeks with two tests of 2.5 hours per day. **.bold[Many thanks to them!]** - ITU-T P.835 methodology, with 3 absolute category rating (ACR) scales to evaluate separately - the speech signal quality (**SIG**); - the background intrusiveness (**BAK**); - the overall quality (**OVRL**). - We obtained **mean opinion scores** (MOS) for each scale. - The final ranking is based on the OVRL MOS. .credit[ ITU-T recommendation P.835 (2003). [Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm](https://www.itu.int/rec/T-REC-P.835-200311-I/en). ] --- class: middle, compact-list ## RemixIT baseline .small[(Tzinis et al., 2022)] .center.width-100[![](images/remixit.svg)] .center.small.italic[Figure adapted from (Tzinis et al., 2022)] - The teacher and student models were based on the same Sudo rm -rf sound separation model .small[(Tzinis et al., 2020)]. - They were trained by minimizing the negative SI-SDR loss for both the speech and noise sources. .credit[ Tzinis, E. et al. (2022). [RemixIT: Continual self-training of speech enhancement models via bootstrapped remixing](https://arxiv.org/abs/2202.08862). In IEEE Journal of Selected Topics in Signal Processing.
Tzinis, E., Wang, Z., & Smaragdis, P. (2020). [Sudo rm-rf: Efficient networks for universal audio source separation](https://arxiv.org/abs/2007.06833). In IEEE MLSP. ] --- class:middle exclude: true We provided two versions of the student model: 1. **RemixIT**: trained using the raw audio segments of the CHiME-5 train set, which do not always contain speech. 2. **RemixIT-VAD**: trained using the audio segments that were automatically labeled as containing speech by Brouhaha’s voice activity detector .small[(Lavechin et al., 2022)]. .credit[ Lavechin, M. et al. (2022). [Brouhaha: Multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation](https://arxiv.org/abs/2210.13248). arXiv preprint arXiv:2210.13248. ] --- class: middle, hide-count count: false .big[ ## Outline ] - ### Introduction and motivation - ### Data, evaluation, baseline - ### .bold[Submissions and results] - ### Conclusion --- class: middle ## Submissions We received 5 submissions from 3 teams: - **N&B** from the **Northwestern Polytechnical University** and **ByteDance** (China) .small[.bold[The NWPU-ByteAudio System for CHiME-7 Task 2 UDASE Challenge]
Z. Zhang (NWPU), R. Han (NWPU), Z. Wang (NWPU), X. Xia (ByteDance), Y. Xiao (ByteDance), L. Xie (NWPU)] - **ISDS1** and **ISDS2** from **Sogang University** (Korea) .small[.bold[CHiME-7 Challenge Track 2 Technical Report: Purification Adapted Speech Enhancement System]
J. Jang and M.-W. Koo (Sogang University)] - **CMGAN-base** and **CMGAN-FT** from the **University of Sheffield** (UK) .small[.bold[The University of Sheffield CHiME-7 UDASE Challenge Speech Enhancement System]
G. L. Close, W. Ravenscroft, T. Hain, S. Goetze (University of Sheffield)] --- ## Objective evaluation results .width-100[![](images/OVRL_vs_SISDR.svg)] .center[The systems selected for the listening test were CMGAN-FT, N&B, ISDS1, RemixIT-VAD.] .footnote[BAK and SIG scores, tables of results, boxplots and violin plots are available on the [challenge website](https://www.chimechallenge.org/current/task2/results). ] --- class: middle, center, hide-count count: false ## Listening test results --- .small-nvspace[ ] .width-100[![](images/BAK_MOS_listening_test_pres.svg)] --- count: false .small-nvspace[ ] .width-100[![](images/BAK_MOS_listening_test_pres.svg)] .width-100[![](images/SIG_MOS_listening_test_pres.svg)] --- count: false .small-nvspace[ ] .width-100[![](images/BAK_MOS_listening_test_pres.svg)] .width-100[![](images/SIG_MOS_listening_test_pres.svg)] .width-100[![](images/OVRL_MOS_listening_test_pres.svg)] --- class: middle, hide-count count: false .big[ ## Outline ] - ### Introduction and motivation - ### Data, evaluation, baseline - ### Submissions and results - ### .bold[Conclusion] --- class: middle ## Conclusion .compact-list[ - Only 2 systems among 4 improved the overall quality compared to the unprocessed noisy speech signal. - Unsupervised domain adaptation (UDA) for speech enhancement (SE) is a difficult problem, but of great practical interest. - We hope that the design of the CHiME-7 UDASE task, the submitted systems and their evaluation will facilitate and foster the development of UDA methods for SE. - We will provide a further analysis of the objective evaluation and listening test results. ] .alert[Congratulations to the .bold[NWPU and ByteDance winning team], and many thanks to all participants!] --- class: middle, center, hide-count, compact-list count: false .large[ ## Thank you ] .footnote.left[ Resources: - arXiv preprint presenting the task and the baseline: https://arxiv.org/abs/2307.03533 - GitHub: https://github.com/UDASE-CHiME2023 - Eventually, listening test data (audio signals and human listening scores) will be made available online. ]