class: middle, hide-count
count: false

$$ \global\def\myx#1{{\color{green}\mathbf{x}\_{#1}}} $$
$$ \global\def\mys#1{{\color{green}\mathbf{s}\_{#1}}} $$
$$ \global\def\myz#1{{\color{brown}\mathbf{z}\_{#1}}} $$
$$ \global\def\myhnmf#1{{\color{brown}\mathbf{h}\_{#1}}} $$
$$ \global\def\myztilde#1{{\color{brown}\tilde{\mathbf{z}}\_{#1}}} $$
$$ \global\def\myu#1{\mathbf{u}\_{#1}} $$
$$ \global\def\mya#1{\mathbf{a}\_{#1}} $$
$$ \global\def\myv#1{\mathbf{v}\_{#1}} $$
$$ \global\def\mythetaz{\theta\_\myz{}} $$
$$ \global\def\mythetax{\theta\_\myx{}} $$
$$ \global\def\mythetas{\theta\_\mys{}} $$
$$ \global\def\mythetaa{\theta\_\mya{}} $$
$$ \global\def\bs#1{{\boldsymbol{#1}}} $$
$$ \global\def\diag{\text{diag}} $$
$$ \global\def\mbf{\mathbf} $$
$$ \global\def\myh#1{{\color{purple}\mbf{h}\_{#1}}} $$
$$ \global\def\myhfw#1{{\color{purple}\overrightarrow{\mbf{h}}\_{#1}}} $$
$$ \global\def\myhbw#1{{\color{purple}\overleftarrow{\mbf{h}}\_{#1}}} $$
$$ \global\def\myg#1{{\color{purple}\mbf{g}\_{#1}}} $$
$$ \global\def\mygfw#1{{\color{purple}\overrightarrow{\mbf{g}}\_{#1}}} $$
$$ \global\def\mygbw#1{{\color{purple}\overleftarrow{\mbf{g}}\_{#1}}} $$

.big-vspace[

]

.center[
  # The CHiME-7 UDASE Task
  ## Unsupervised Domain Adaptation for Conversational Speech Enhancement
]

.vspace[

]

.center[Simon Leglaive1 $\quad$ Léonie Borne2 $\quad$ Efthymios Tzinis3 $\quad$ Mostafa Sadeghi4 $\quad$ Matthieu Fraticelli5 $\quad$ Scott Wisdom6 $\quad$ Manuel Pariente2 $\quad$ Daniel Pressnitzer5 $\quad$ John R. Hershey6]

.small-vspace[

]

.small.center[1 CentraleSupélec, IETR, France $\quad$ 2 Pulse Audition, France $\quad$ 3 University of Illinois at Urbana-Champaign, USA $\quad$ 4 Inria, France $\quad$ 5 Ecole Normale Supérieure, PSL University, CNRS, France $\quad$ 6 Google, USA]

.vspace[

]

.small.center[CHiME-2023 Workshop - Dublin, Ireland - August 25, 2023]

---
class: middle, hide-count
count: false

.big[
  ## Outline
]

- ### .bold[Introduction and motivation]
- ### Data, evaluation, baseline
- ### Submissions and results
- ### Conclusion

---
exclude: true
class: middle

## Speech enhancement

.center.width-70[![](images/speech_enhancement.svg)]

.center[The speech enhancement task is to estimate a clean speech signal from a noisy recording.]

---
exclude: true
class: middle

## Conventional supervised deep learning approach

.center.width-90[![](images/speech_enhancement_deep_learning.svg)]

- In recent years, there has been great progress in speech enhancement thanks to the use of **deep learning** models. 
- Most speech enhancement methods today rely on deep neural networks that are trained in a **supervised** manner.

---
class: middle
## Conventional supervised speech enhancement

.center.width-100[![](images/supervised_speech_enhancement.svg)]

.alert[Given the impossibility of acquiring labeled noisy speech signals in real-world conditions, datasets are generated artificially by creating synthetic mixtures of isolated speech and noise signals.]

---
class: middle

## Limitations of supervised learning for speech enhancement

- Creating a synthetic dataset of "realistic" noisy speech mixtures is not easy.

- Supervised speech enhancement methods are effective as
long as the acoustic recording conditions at test time are covered by the synthetic training data.

- If the test domain deviates from the synthetic training domain, it will be necessary to rebuild a synthetic training dataset and retrain the model.

.alert[
  
  Wouldn't it be easier and more effective to .bold[automatically adapt models] on .bold[real unlabeled] noisy speech recordings, without the need of ground-truth clean speech signals?

]

---
class: middle

## The CHiME-7 UDASE task

.bold[Unsupervised domain adaptation...]

- The task focuses on **single-channel** speech enhancement in a specific **target domain** for which **no well-matched labeled data** is available for training.

- The task consists of using **unlabeled data** in the **target domain** to adapt supervised speech enhancement models trained on **synthetic labeled data** in a **source domain**.

.bold[... for conversational speech enhancement]
  
  - The target domain corresponds to **real conversational speech recordings** of the CHiME-5 dataset .small[(Barker et al., 2018)]. 
  
  - Dinner party scenario, with multiple speakers in a **noisy** and **reverberant environment**.

- The goal is to estimate the clean, potentially multi-speaker, reverberant speech, **removing the additive background noise**.

.credit[

]

---
class: middle
exclude: true

## The UDASE task

- The UDASE task focuses on **single-channel speech enhancement in a specific domain for which no well-matched labeled data is available** for training.

- The target test domain corresponds to the **multi-speaker conversational speech recordings of the CHiME-5 dataset** (dinner party scenario).

Given a CHiME-5 recording with one or more speakers and additive noise, the goal is to predict the clean and potentially multi-speaker speech signal, removing the additive noise.

This task is motivated by the assistive listening use case, in which a speech enhancement algorithm can help any individual to better engage in a conversation.

- The synthetic labeled LibriMix dataset is used for **supervised learning on mismatched data**.

.alert[

The UDASE task consists of using the CHiME-5 unlabeled data to adapt a supervised speech enhancement model trained on the mismatched LibriMix data.

]

---
class: middle, hide-count
count: false

.big[
  ## Outline
]

- ### Introduction and motivation
- ### .bold[Data, evaluation, baseline]
- ### Submissions and results
- ### Conclusion

---
class: middle, compact-list
## Data

The task builds upon 3 datasets of noisy speech mixtures, with up to 3 overlapping speakers.

.center.width-90[![](images/datasets.svg)]

$\hspace{1cm}$ <audio style="width: 200px; height:40px;" controls="controls" preload="none" src="audio/data/LibriMix/61-70968-0014_7176-92135-0009.wav"></audio>
$\hspace{2cm}$ <audio style="width: 200px; height:40px;" controls="controls" preload="none" src="audio/data/LibriCHiME-5/S21_P47_102a_mix.wav"></audio>
$\hspace{1cm}$ <audio style="width: 200px; height:40px;" controls="controls" preload="none" src="audio/data/CHiME-5/S21_P46_26.wav"></audio>
 
$\hspace{1cm}$ <audio style="width: 200px; height:40px;" controls="controls" preload="none" src="audio/data/LibriMix/1284-1181-0011_6829-68769-0031.wav"></audio>
$\hspace{2cm}$ <audio style="width: 200px; height:40px;" controls="controls" preload="none" src="audio/data/LibriCHiME-5/S01_P04_45b_mix.wav"></audio>
$\hspace{1cm}$ <audio style="width: 200px; height:40px;" controls="controls" preload="none" src="audio/data/CHiME-5/S01_P04_47.wav"></audio>

.alert[The final objective is to perform well in the target domain (CHiME-5 dataset).]

.credit[

Barker, J., Watanabe, S., Vincent, E., & Trmal, J. (2018). [The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines](https://arxiv.org/abs/1803.10609). In Interspeech.
 
Cosentino, J., Pariente, M., Cornell, S., Deleforge, A., & Vincent, E. (2020). [LibriMix: An open-source dataset for generalizable speech separation](https://arxiv.org/abs/2005.11262). arXiv preprint arXiv:2005.11262.

]

---
class: middle, compact-list
exclude: true
## Data

.tiny-nvspace[

]

.grid[
.kol-1-3[
.center[**CHiME-5** .small[(Barker et al., 2018)] .small.bold[Adapted for the UDASE task]]

<hr>

.medium[

- Train, dev, eval
- **Unlabeled** data
- **Target domain**
- **Real** recordings of conversational speech
- Reverberant speech + noise
- 1 to 3 overlapping speakers
  
]

]

.kol-1-3[

.center[**LibriMix** .small[(Cosentino et al., 2020)]]

<hr>

.medium[

- Train, dev
- **Labeled** data
- **Source domain**
- **Synthetic** mixtures 
- Speech + noise
- 1 to 3 overlapping speakers

]
]

.kol-1-3[

.center[**Reverberant LibriCHiME-5** .small[(new)]]

<hr>

.medium[

- Dev, eval
- **Labeled** data
- Close to **target domain**
- **Synthetic** mixtures 
- Reverberant speech + noise
- 1 to 3 overlapping speakers
]
]

]

.alert[The final objective is to perform well on the CHiME-5 dataset.]

.credit[

]

---
class: compact-list
count: false
exclude: true
## Data

.tiny-nvspace[

]

.grid[
.kol-1-3[
.center[**CHiME-5** .small[(Barker et al., 2018)] .small.bold[Adapted for the UDASE task]]

<hr>

.medium[

- Train, dev, eval
- In-domain **unlabeled** data
- **Real** recordings of conversational speech
- Reverberant speech + noise
- 1 to 3 overlapping speakers
  
]

]

.kol-1-3[

.center[**LibriMix** .small[(Cosentino et al., 2020)]]

<hr>

.medium[

- Train and dev only
- Out-of-domain **labeled** data
- **Synthetic** mixtures 
- Speech + noise
- 1 to 3 overlapping speakers

]
]

.kol-1-3[

.center[**Reverberant LibriCHiME-5** .small[(new)]]

<hr>

.medium[

- Dev and eval only
- Close to in-domain **labeled** data
- **Synthetic** mixtures 
- Reverberant speech + noise
- 1 to 3 overlapping speakers
]
]

]

.medium[

Example of use:
1. Train and validate a supervised model on the out-of-domain LibriMix dataset.
2. Unsupervised domain adaptation on the in-domain CHiME-5 dataset (training).
3. Validate and evaluate the adapted model on CHiME-5 and reverberant LibriCHiME-5.

]

.credit[

]

---
class: middle

## Evaluation stage 1 - objective evaluation
  
   - **Scale-invariant signal-to-distortion ratio** (SI-SDR) .small[(Le Roux et al., 2019)] on the reverberant LibriCHiME-5 dataset .small[(`eval/{1,2,3}`)];
   - **DNSMOS P.835** .small[(Reddy et al., 2022)] on the single-speaker segments of the CHiME-5 dataset .small[(`eval/1`)].

Nonintrusive (i.e., reference-free) objective metric that provides performance scores for: 
     - the speech signal quality (**SIG**);
     - the background intrusiveness (**BAK**);
     - the overall quality (**OVRL**).

These are predictions of mean opinion scores (MOS), between 1 and 5, the higher the better.
  
.credit[

Le Roux, J., Wisdom, S., Erdogan, H., & Hershey, J. R. (2019). [SDR–half-baked or well done?](https://arxiv.org/abs/1811.02508). In IEEE ICASSP.
 
Reddy, C. K., Gopal, V., & Cutler, R. (2022). [DNSMOS P. 835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors](https://arxiv.org/abs/2110.01763). In IEEE ICASSP.

]

---
class: middle
## Evaluation stage 2 - listening test in the target CHiME-5 domain

- Conducted by .bold[Hend ElGhazaly] and .bold[Jon Barker] at the University of Sheffield, during more than 3 weeks with two tests of 2.5 hours per day. **.bold[Many thanks to them!]**

- ITU-T P.835 methodology, with 3 absolute category rating (ACR) scales to evaluate separately
  - the speech signal quality (**SIG**);
  - the background intrusiveness (**BAK**);
  - the overall quality (**OVRL**).

- We obtained **mean opinion scores** (MOS) for each scale.

- The final ranking is based on the OVRL MOS.
  
.credit[
ITU-T recommendation P.835 (2003). [Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm](https://www.itu.int/rec/T-REC-P.835-200311-I/en).
]

---
class: middle, compact-list

## RemixIT baseline .small[(Tzinis et al., 2022)]

.center.width-100[![](images/remixit.svg)]

.center.small.italic[Figure adapted from (Tzinis et al., 2022)]

- The teacher and student models were based on the same Sudo rm -rf sound separation model .small[(Tzinis et al., 2020)]. 
- They were trained by minimizing the negative SI-SDR loss for both the speech and noise sources.

.credit[
Tzinis, E. et al. (2022). [RemixIT: Continual self-training of speech enhancement models via bootstrapped remixing](https://arxiv.org/abs/2202.08862). In IEEE Journal of Selected Topics in Signal Processing.
 
Tzinis, E., Wang, Z., & Smaragdis, P. (2020). [Sudo rm-rf: Efficient networks for universal audio source separation](https://arxiv.org/abs/2007.06833). In IEEE MLSP.
]

---
class:middle
exclude: true

We provided two versions of the student model: 
  1. **RemixIT**: trained using the raw audio segments of the CHiME-5 train set, which do not always contain speech.
  2. **RemixIT-VAD**: trained using the audio segments that were automatically labeled as containing speech by Brouhaha’s voice activity detector .small[(Lavechin et al., 2022)].

.credit[

Lavechin, M. et al. (2022). [Brouhaha: Multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation](https://arxiv.org/abs/2210.13248). arXiv preprint arXiv:2210.13248.

]

---
class: middle, hide-count
count: false

.big[
  ## Outline
]

- ### Introduction and motivation
- ### Data, evaluation, baseline
- ### .bold[Submissions and results]
- ### Conclusion

---
class: middle
## Submissions

We received 5 submissions from 3 teams:

- **N&B** from the **Northwestern Polytechnical University** and **ByteDance** (China)
 
 .small[.bold[The NWPU-ByteAudio System for CHiME-7 Task 2 UDASE Challenge] 
 Z. Zhang (NWPU), R. Han (NWPU), Z. Wang (NWPU), X. Xia (ByteDance), Y. Xiao (ByteDance), L. Xie (NWPU)]

- **ISDS1** and **ISDS2** from **Sogang University** (Korea)

.small[.bold[CHiME-7 Challenge Track 2 Technical Report: Purification Adapted Speech Enhancement System] 
 J. Jang and M.-W. Koo (Sogang University)]

- **CMGAN-base** and **CMGAN-FT** from the **University of Sheffield** (UK)

.small[.bold[The University of Sheffield CHiME-7 UDASE Challenge Speech Enhancement System] 
 G. L. Close, W. Ravenscroft, T. Hain, S. Goetze (University of Sheffield)]

---
## Objective evaluation results

.width-100[![](images/OVRL_vs_SISDR.svg)]

.center[The systems selected for the listening test were CMGAN-FT, N&B, ISDS1, RemixIT-VAD.]

.footnote[BAK and SIG scores, tables of results, boxplots and violin plots are available on the [challenge website](https://www.chimechallenge.org/current/task2/results).
]

---
class: middle, center, hide-count
count: false
## Listening test results

---

.small-nvspace[
  
]

.width-100[![](images/BAK_MOS_listening_test_pres.svg)]

---
count: false

.small-nvspace[
  
]

.width-100[![](images/BAK_MOS_listening_test_pres.svg)]

.width-100[![](images/SIG_MOS_listening_test_pres.svg)]

---
count: false

.small-nvspace[
  
]

.width-100[![](images/BAK_MOS_listening_test_pres.svg)]

.width-100[![](images/SIG_MOS_listening_test_pres.svg)]

.width-100[![](images/OVRL_MOS_listening_test_pres.svg)]

---
class: middle, hide-count
count: false

.big[
  ## Outline
]

- ### Introduction and motivation
- ### Data, evaluation, baseline
- ### Submissions and results
- ### .bold[Conclusion]

---
class: middle

## Conclusion

.compact-list[

- Only 2 systems among 4 improved the overall quality compared to the unprocessed noisy speech signal.

- Unsupervised domain adaptation (UDA) for speech enhancement (SE) is a difficult problem, but of great practical interest.

- We hope that the design of the CHiME-7 UDASE task, the submitted systems and their evaluation will facilitate and foster the development of UDA methods for SE.

- We will provide a further analysis of the objective evaluation and listening test results.

]
.alert[Congratulations to the .bold[NWPU and ByteDance winning team], and many thanks to all participants!]

---
class: middle, center, hide-count, compact-list
count: false

.large[
  ## Thank you
]

.footnote.left[

Resources:
- arXiv preprint presenting the task and the baseline: https://arxiv.org/abs/2307.03533
- GitHub: https://github.com/UDASE-CHiME2023
- Eventually, listening test data (audio signals and human listening scores) will be made available online.

]