Audio source separation

class: middle, center

$$ \global\def\myx#1{{\color{green}\mathbf{x}\_{#1}}} $$
$$ \global\def\mys#1{{\color{green}\mathbf{s}\_{#1}}} $$
$$ \global\def\myz#1{{\color{brown}\mathbf{z}\_{#1}}} $$
$$ \global\def\myhnmf#1{{\color{brown}\mathbf{h}\_{#1}}} $$
$$ \global\def\myztilde#1{{\color{brown}\tilde{\mathbf{z}}\_{#1}}} $$
$$ \global\def\myu#1{\mathbf{u}\_{#1}} $$
$$ \global\def\mya#1{\mathbf{a}\_{#1}} $$
$$ \global\def\myv#1{\mathbf{v}\_{#1}} $$
$$ \global\def\mythetaz{\theta\_\myz{}} $$
$$ \global\def\mythetax{\theta\_\myx{}} $$
$$ \global\def\mythetas{\theta\_\mys{}} $$
$$ \global\def\mythetaa{\theta\_\mya{}} $$
$$ \global\def\bs#1{{\boldsymbol{#1}}} $$
$$ \global\def\diag{\text{diag}} $$
$$ \global\def\mbf{\mathbf} $$
$$ \global\def\myh#1{{\color{purple}\mbf{h}\_{#1}}} $$
$$ \global\def\myhfw#1{{\color{purple}\overrightarrow{\mbf{h}}\_{#1}}} $$
$$ \global\def\myhbw#1{{\color{purple}\overleftarrow{\mbf{h}}\_{#1}}} $$
$$ \global\def\myg#1{{\color{purple}\mbf{g}\_{#1}}} $$
$$ \global\def\mygfw#1{{\color{purple}\overrightarrow{\mbf{g}}\_{#1}}} $$
$$ \global\def\mygbw#1{{\color{purple}\overleftarrow{\mbf{g}}\_{#1}}} $$
$$ \global\def\neq{\mathrel{\char`≠}} $$

# Audio source separation
### based on the sparsity and spatial diversity of the source signals

.vspace[

]

.center[Simon Leglaive]

.vspace[

]

.small.center[CentraleSupélec]

---
class: middle

## Today

.center[<iframe width="560" height="315" src="https://www.youtube.com/embed/cXO5vUUXmxE?si=EyA4gpdDEMBucYWI&start=60" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>]

.alert-g[How to **exploit the structure** of **audio/speech signals** to solve the audio **source separation** problem.]

---
class: middle

## General objectives

The audio source separation problem today is a pretext to discuss general methodological principles involved in many different signal processing problems:

1. The **observation model** to relate the latent signal(s) of interest to the observations;
2. The **latent signal model** to make the resolution of the problem tractable;
3. The **algorithm** to estimate the model parameters and recover the latent signal(s).

It will also be an opportunity to reuse some concepts that we have seen during the last lesson about signal representations. In particular how signal transformations can be useful to define models.

---
class: middle

## Source separation

.center[.width-50[![](images/illust_sep_sources.svg)]]

.alert-g[The goal is to separate a set of $J$ source signals $s\_j(t)$, $j \in \\\\{1,...,J\\\\}$,<br> given a set of $I$ mixture signals $x\_i(t)$, $i \in \\\\{1,...,I\\\\}$.]

The source separation problem is mainly characterized by
- the type of mixing (instantaneous vs. convolutive); 
- the relative number of sources and microphones (under/over-determined problem).

---
class: middle, center

# Observation model

---
class: middle

## Linear instantaneous model

- We aim to model the relationship between the **latent signals of interest** (the sources) and the **observations** (the mixtures).

- The simplest mixture model is **linear** and **instantaneous**:

$$ x\_i(t) = \sum\_{j=1}^J a\_{ij} \, s\_j(t), $$

or equivalently in matrix form:

$$
  \begin{bmatrix} x\_1(t) \\\\ \vdots \\\\ x\_{I}(t) \end{bmatrix} =
  \begin{bmatrix} a\_{11} & a\_{12} & ... & a\_{1J} \\\\  a\_{21} & a\_{22} & ... & a\_{2J} \\\\ \vdots & \vdots & \vdots & \vdots \\\\ a\_{I1} & a\_{I2} & ... & a\_{IJ} \end{bmatrix}
  \begin{bmatrix} s\_1(t) \\\\ s\_2(t) \\\\ \vdots \\\\ s\_J(t)  \end{bmatrix} . 
  $$

---
class: middle

.grid[

.kol-2-3[

- **Magnetoencephalography** and **electroencephalography** (M/EEG) are
non-invasive techniques to record brain activity.

They capture the magnetic and electric signals produced by active neurons from the scalp surface.

- Each M/EEG sensor captures a **linear combination** of the different brain activities.

- The linear combination is considered **instantaneous** due to the proximity between the brain and the sensors.

.center[.width-40[![](images/MNE_python.png)]]
.center.tiny[Credits: MNE Python]

]

.kol-1-3[

.center[.width-90[![](images/EEG.png)]]
.center.tiny[Credits: Hulton-Deutsch / Corbis Historical / Getty Images

]

]
]

---
class: middle

- An electrocardiogram (ECG) is a recording of the heart's electrical activity through repeated cardiac cycles.
- The fetal ECG provides important information about the health of the fetus. Its extraction involves the elimination of the maternal ECG components and other interfering signals from the ECG measurements obtained during pregnancy.
- This can be formulated as a source separation problem, where the mixture is usually assumed linear and instantaneous.

.grid[
  .kol-1-2[.center[.width-80[![](images/ECG_BSS.png)]]]
  .kol-1-2[.center[.width-60[![](images/ECG_BSS2.png)]]]
]

.center.tiny[Credits: D. Sugumar et al., Joint blind source separation algorithms in the separation of non-invasive maternal and fetal ECG, IECS, 2014.]

---
class: middle

## Anechoic model

In audio, the instantaneous mixing model rarely holds due to the propagation of the sound source in the acoustic medium.

.grid[

.kol-3-5[

- Let us consider a source (baloon) and a microphone in an open air environment without any obstacle.

- We assume the source and the microphone are not reflecting sound.

- The baloon explodes, it produces a source signal $s(t)$.

]
.kol-2-5[

.right.width-90[![](images/free_field.png)]

]
]

.alert-g[How is the source signal $s(t)$ related to the microphone signal $x(t)$?]

---
class: middle

.grid[

.kol-3-5[

The signal acquired by the microphone is given by

$$ x(t) = \frac{1}{\sqrt{4 \pi}d} s\left(t - \frac{d}{c}f\_s\right) $$

where
  - $d$ is the source-to-microphone distance (in m);
  - $c = 343$ is the speed of sound (in m/s at 20°C);
  - $d/c$ is the time of arrival (in s);
  - $f\_s$ is the sampling rate (in Hz).

]
.kol-2-5[

.right.width-90[![](images/free_field.png)]

]
]

.alert-g[At the microphone, the source signal is simply **attenuated** and **delayed**. <br> This is an **anechoic recording**.]

---
class: middle

- In a real recording situation, the acoustic environment includes obstacles that affect the sound propagation in many different ways.

- The interaction between the source signal and the acoustic environment is what leads to signal at the microphone.

.width-100[![](images/sound_interaction_surface.png)]

.credit[Image credits: Diego Di Carlo’s Ph.D. Thesis [“Echo-aware signal processing for audio scene analysis”](https://tel.archives-ouvertes.fr/tel-03133271v2/document), 2020.]

???

Specular reflection: The direction of the reflected wave is symmetrical to the direction of the incident wave with respect to the surface normal.

Diffuse reflection: The wave may be reflected in many directions depending on its  avelength and the dimension and irregularities of the surface.

Diffraction: The wave is diffracted in a way that depends on its wavelength, the shape of the obstacle or opening, its material and the angle of  incidence.

Refraction and absoption: The absorption ratio depends on the material and the angle of incidence.

---
class: middle

## Convolutive model

- This interaction is accurately represented by a **convolution**:

$$ x(t) = \[h \star s \](t) = \sum\limits_{\tau=0}^{L\_h-1} h(\tau) s(t - \tau) , $$

where $h(t)$ is called the **room impulse response** (RIR) and characterizes the acoustic path between the source and microphone locations.

.grid[

.kol-1-2[

- It is called the room .italic[impulse] response because it is the response of the room when the source is an impulse (dirac delta function):

$$x(t) = \[h \star \delta \](t) = h(t).$$

]

.kol-1-2[

.center.width-80[![](images/RIR_balloon.png)]

]

---
class: middle

The previous anechoic model corresponds to the case where the RIR only characterizes the **direct path** between the source and the microphone:

$$ h(t) = \frac{1}{\sqrt{4 \pi}d} \delta\left(t - \frac{d}{c}f\_s\right), \qquad \delta(t) = \begin{cases} 1 & t = 0 \\\\ 0 & t \neq 0 \end{cases}.$$

.width-100[![](images/RIR_DP.svg)]

---
class: middle

But in a real room, many reflections of the sound source arrive at the microphone.

This is called **reverberation**.

.width-100[![](images/RIR_LR.svg)]

---
class: middle

## Summary

- To sum up, we have seen different mixture models:

- Linear instantaneous mixture model: $\hspace{.5cm}$ $ x\_i(t) = \sum\limits\_{j=1}^J a\_{ij} \, s\_j(t). $

- Anechoic mixture model: $\hspace{2.05cm}$ $ x\_i(t) = \sum\limits\_{j=1}^J a\_{ij} \, s\_j(t - \tau\_{ij}). $

- Convolutive mixture model: $\hspace{1.65cm}$ $ x\_i(t) = \sum\limits\_{j=1}^J \[a\_{ij} \star \, s\_j \](t). $

- The more "expressive" the mixture model, the more complex in terms of the number of **unknown** mixing parameters.

We are not directly interested in the mixing parameters, but we will have to estimate them to solve the source separation problem.
  
  More unkowns means a more difficult problem to solve. **There is in general a trade-off between modeling accuracy and estimation tractability**.

---
class: middle, center

# Latent source signal model

---
class: middle
## Blind under-determined source separation

For simplicty, let us consider the linear instantaneous model:

Moreover, we consider an **under-determined scenario** where the number of microphones $I$ is lower than the number of sources $J$, i.e. **we have more unkowns than equations**.

.alert-g[This is an ill-posed problem that admits an infinite number of solution.]

---
class: middle

## Regularization with a signal model

- We need to **introduce additional information** about the latent signals of interest to compensate for the lack of observations.

- This additional information will be provided through a **source signal model**, which can take different forms:

- simplifying assumptions (e.g., the sources have disjoint time supports);
  - deterministic model (e.g., $s_j(t)$ is the sum of a few sinusoids with exponentially decaying amplitudes);
  - probabilistic model (e.g., $s_j(t)$ is a locally stationary Gaussian process).

.footnote[Signal models are usually expressed mathematically.<br> Using mathematical representations of signals allows us to derive algorithms to solve real-world problems (e.g., source separation).]

???

Intuition:

- Solve $ 8 = s\_1 + s\_2 $
- ?
- $s\_j$ is a rugby score.
- $s\_1 = 3$ and $s\_2 = 5$, or the inverse.

---
class: middle

## Signal modeling in a transformed domain

- Sometimes, it is easier to define a signal model in a **transformed domain**.
- We are processing non stationary audio signals, so we consider a linear **time-frequency transform** .small[(see the previous lesson on signal representations)], such as the MDCT or the STFT.
- The observation model in the transformed domain is simply given by

$$
    \begin{bmatrix} X\_1(f,n) \\\\ \vdots \\\\ X\_{I}(f,n) \end{bmatrix} =
    \begin{bmatrix} a\_{11} & a\_{12} & ... & a\_{1J} \\\\  a\_{21} & a\_{22} & ... & a\_{2J} \\\\ \vdots & \vdots & \vdots & \vdots \\\\ a\_{I1} & a\_{I2} & ... & a\_{IJ} \end{bmatrix}
    \begin{bmatrix} S\_1(f,n) \\\\ S\_2(f,n) \\\\ \vdots \\\\ S\_J(f,n)  \end{bmatrix},
  $$

where $\cdot(f,n)$ denotes a signal coefficient in the STFT or MDCT domain, at the time-frequency point $(f,n) \in \\{0,...,F-1\\} \times \\{0,...,N-1\\}$.

---
class: middle

Let us consider a mixture of $J=3$ speech sources over $I=2$ microphones, represented in the waveform, MDCT or STFT domain by:

$$
  \begin{bmatrix} x\_1(\cdot) \\\\ x\_{2}(\cdot) \end{bmatrix} =
  \begin{bmatrix} a\_{11} & a\_{12} & a\_{13} \\\\  a\_{21} & a\_{22} & a\_{23} \end{bmatrix}
  \begin{bmatrix} s\_1(\cdot) \\\\ s\_2(\cdot) \\\\ s\_3(\cdot)  \end{bmatrix}  =
  s\_1(\cdot) \begin{bmatrix} a\_{11} \\\\ a\_{21} \end{bmatrix} + s\_2(\cdot) \begin{bmatrix} a\_{12} \\\\ a\_{22} \end{bmatrix} + s\_3(\cdot) \begin{bmatrix} a\_{13} \\\\ a\_{23} \end{bmatrix}
   . 
$$

.width-90[![](images/source_separation_transformed_domain_sparsity.png)]

.alert-g[The figure shows the mixture coefficients in the different domains, what do you observe and why?]

???

- Source signals are sparse in the time-frequency domain.
- Therefore, the mixture coefficients tend to cluster around the columns of the mixing matrix.

---
class: middle

- We see that some structure emerges in the MDCT/STFT representation of the mixture signals.

- This structure actually originates from the source signals, which tend to be sparse in the time-frequency domain.

- Sparsity is a central notion in signal processing, and we can define signal models that encode this characteristic of natural signals and images.

.grid[

.kol-1-3[
.caption[Source 1]
.width-100[![](images/voice_man_1.png)]
]
.kol-1-3[
.caption[Source 2]
.width-100[![](images/voice_woman_1.png)]
]
.kol-1-3[
.caption[Source 3]
.width-95[![](images/voice_woman_2.png)]
]
]

---
class: middle, center

## The Degenerate Unmixing Estimation Technique (DUET)

We are now going see that sparsity is used in the DUET algorithm to solve the audio source separation problem for anechoic stereophonic mixtures.

.credit.left[S. Rickard, ["The DUET Blind Source Separation Algorithm"](https://pdfs.semanticscholar.org/1413/746141f2871e0f45056a7696e019b8f8a100.pdf), 2007.]

---
class: middle
## Unmixing or source separation

We want to estimate individual speech source signals from a **stereophonic mixture**.

.center.width-40[![](images/recording_congif.png)]

.center[ <audio controls src="audio/duet_mix.wav"></audio>]

We have 3 source signals so the problem is under-determined or **degenerate**.

---
class: middle
## Anechoic stereo mixture model in the time domain

Each microphone signal is the sum of delayed and attenuated source signals:

$$
\begin{aligned} 
x\_i(t) &= \sum\_{j=1}^J \frac{1}{\sqrt{4 \pi}d\_{ij}} s\_j\left(t - \frac{d\_{ij}}{c}f\_s\right),
\end{aligned} 
$$

where $d\_{ij}$ is the distance between the $j$-th source and the $i$-th microphone

.center.width-30[![](images/multi-mic-anec.svg)]

---
class: middle

Without loss of generality, we absorb the attenuation and delay parameters at the first microphone into the definition of the source signal, i.e., we define the "new" source signal by:

$$ \tilde{s}\_j(t) = \frac{1}{\sqrt{4 \pi}d\_{1j}} s\_j\left(t - \frac{d\_{1j}}{c}f\_s\right) $$

such that the mixture model becomes

$$
\begin{aligned} 
x\_1(t) &= \sum\_{j=1}^J \tilde{s}\_j(t), \qquad x\_2(t) &= \sum\_{j=1}^J a\_j \tilde{s}\_j(t - \delta\_j),
\end{aligned} 
$$
where
- $a\_j = d\_{1j}/d\_{2j}$ is the **relative attenuation factor** of the $j$-th source, also called the inter-microphone level ratio;
- $\delta\_j = \displaystyle \frac{d\_{2j} - d\_{1j}}{c} f\_s$ is the **time difference of arrival** (TDoA) of the $j$-th source, also called the inter-microphone time difference.

These parameters convey information about the spatial location of the source.

---
class: middle

In matrix form, we have:

$$
  \begin{bmatrix} x\_1(t) \\\\ x\_{2}(t) \end{bmatrix} =
  \begin{bmatrix} 1 & 1 & ... & 1 \\\\  a\_{21} & a\_{22} & ... & a\_{2J} \end{bmatrix}
  \begin{bmatrix} s\_1(t - \delta\_1) \\\\ s\_2(t - \delta\_2) \\\\ \vdots \\\\ s\_J(t - \delta\_J) \end{bmatrix},
$$

where we removed the tilde to simplify the notations.

In the following, we will refer to $\left\\{({a}\_j,{\delta}\_j)\right\\}\_{j=1}^J$ as the **mixing parameters**.

---
class: middle
## Anechoic stereo mixture model in the STFT domain

Assuming that the TDoAs are small relative to the STFT analysis window length $L$, we have:

$$
s\_j(t-\delta\_j) \overset{\text{STFT}}{\longleftrightarrow} \exp\left({-\imath 2 \pi \frac{f \delta\_j}{L}}\right) S\_j(f, n).
$$

The mixture model thus rewrites in STFT domain as follows:

$$
\begin{bmatrix} X\_{1}(f, n) \\\\ X\_{2}(f, n) \end{bmatrix} =
\begin{bmatrix} 1 & ... & 1 \\\\ a\_1 \exp\left({ \displaystyle -\imath 2 \pi \frac{f\delta\_1}{L} }\right) & ... & a\_J \exp\left({ \displaystyle -\imath 2 \pi \frac{f\delta\_J}{L} }\right) \end{bmatrix}
\begin{bmatrix} S\_{1}(f, n) \\\\ \vdots \\\\ S\_{J}(f, n) \end{bmatrix}. 
$$

---
class: middle
## DUET principle

.alert-g.left[

It is possible to blindly separate an arbitrary number of sources from .bold[two anechoic mixtures] provided that

- the .bold[time–frequency representations of the sources do not overlap] (assumption 1),

- the sources have .bold[different spatial locations] (assumption 2).

]

---
class: middle
## W-disjoint orthogonality .small[(assumption 1)]

- Source signals are sparse and have disjoint time-frequency supports. In other words, **at most one source is active at each time-frequency point $(f,n)$**.

- The W-disjoint orthogonality hypothesis can be formalized by:

$$
S\_{j}(f,n)S\_{k}(f,n) = 0, \qquad \forall (f,n), \qquad \forall j \neq k.$$

- The mixture model simplifies as follows:

$$
  \begin{bmatrix} X\_{1}(f,n) \\\\ X\_{2}(f,n) \end{bmatrix} =
  \begin{bmatrix} 1 \\\\ a\_{\mathcal{I}(f,n)}  \exp\left({\displaystyle -\imath 2 \pi \frac{f\delta\_{\mathcal{I}(f,n)}}{L} }\right)  \end{bmatrix} S\_{\mathcal{I}(f,n)}(f, n).
  $$

where $\mathcal{I}(f,n) \in \\{1,2,...,J \\}$ indicates which source is active at time-frequency point $(f,n)$.

- It is the mathematical idealization of a milder assumption considering that **every time–frequency point in the mixture is dominated by the contribution of at most one source**.

---
class: middle
## Unmixing with binary masking

W-disjoint orthogonality is crucial to DUET because it allows for separating the mixture into its component sources using **binary masks**:

$$ \hat{S}\_{j}(f,n) = M\_{j}(f,n) X\_1(f,n),$$

where the mask is defined by:

$$
M\_j(f,n) = \begin{cases}
1 & \text{if } \mathcal{I}(f,n) = j \\\\
0 & \text{otherwise} \\\\
\end{cases}.
$$

.alert-g[

.bold[The source separation problem now becomes that of estimating which source is active at each time-frequency point.]

This is where the second assumption of DUET comes into play.

]

---
class: middle
## DUET algorithm

Let us recall the mixture model under the W-disjoint orthogonality assumption:

$$
\begin{bmatrix} X\_{1}(f,n) \\\\ X\_{2}(f,n) \end{bmatrix} =
\begin{bmatrix} 1 \\\\ a\_{\mathcal{I}(f,n)}  \exp\left({\displaystyle -\imath 2 \pi \frac{f\delta\_{\mathcal{I}(f,n)}}{L} }\right)  \end{bmatrix} S\_{\mathcal{I}(f,n)}(f, n).
$$

.alert-g[

The main observation that DUET leverages is that .bold[the ratio of the mixtures in the STFT domain does not depend on the source signal but only on the mixing parameters associated with the active source]:

$$
\frac{X\_{2}(f,n)}{X\_{1}(f,n)} = a\_j\exp\left({-\imath 2 \pi \frac{f\delta\_j}{L}}\right), \qquad \forall (f, n) \in \Omega\_j = \\{(f,n),\, \mathcal{I}(f,n) = j\\}.
$$

]

---
class: middle

- Let us define the **local attenuations** and **delays** by:

$$ \hat{a}(f,n) = \left|\frac{X\_{2}(f,n)}{X\_{1}(f,n)}\right|,$$

$$ \hat{\delta}(f,n) = -\frac{1}{2 \pi f/L}\arg\left(\frac{X\_{2}(f,n)}{X\_{1}(f,n)}\right), \qquad f > 0.$$

- Using the key observation in the previous slide, we have:

$$ \Big(\hat{a}(f,n), \hat{\delta}(f,n)\Big)  = \Big(a\_j, \delta_j\Big), \qquad \forall (f, n) \in \Omega\_j = \\{(f,n),\, \mathcal{I}(f,n) = j\\}. $$

---
class: middle

## Spatial diversity .small[(assumption 2)]

.alert-g[

We assume that the .bold[sources have different spatial locations], that is

$$ (a\_j \neq a\_k) \text{ or } (\delta\_j \neq \delta\_k), \qquad \forall j \neq k. $$

.medium[We recall that $a\_j$ and $\delta\_j$ encode the position of the $j$-th source relative to the microphones.]

]

.center.width-40[![](images/recording_congif.png)]

---
class: middle

## 2D histogram of local attenuations and delays

The local attenuations and delays that are computed from the mixture signals can thus **only take values among the actual mixing parameters that are assumed to be all different**:

$$
\big(\hat{a}(f,n), \hat{\delta}(f,n)\big) \in \\{(a\_j, \delta\_j)\\}\_{j=1}^J, \qquad \forall (f,n).
$$

.alert-g.center[

What should we obtain if we build a 2D histogram of the local attenuations and delays $\big(\hat{a}(f,n), \hat{\delta}(f,n)\big)$?

]

---
class: middle, center

.grid[

.kol-1-2[

.width-95[![](images/DUET_hist.png)]
]
.kol-1-2[

.width-95[![](images/DUET_hist_2.png)]

]

.small-nvspace[

]

The observations do not perfectly match with the model, but we can still identify three clusters.

---
class: middle

- From the 2D histogram, we can estimate the mixing parameters $\\{(\hat{a}\_j, \hat{\delta}\_j)\\}\_{j=1}^J$ by peak picking.

- We recall that in principle, for all time-frequency points $(f,n)$,

$$ \big(\hat{a}(f,n), \hat{\delta}(f,n)\big) \in \\{(\hat{a}\_j, \hat{\delta}\_j)\\}\_{j=1}^J.$$

- We can thus build the time-frequency masks for source separation as follows:

$$
M\_j(f, n) =  \begin{cases}
1 & \text{if } \big(\hat{a}(f,n), \hat{\delta}(f,n)\big) = (\hat{a}\_j, \hat{\delta}\_j) \\\\
0 & \text{otherwise}
\end{cases}.
$$

- In practice, because not all the assumptions are strictly satisfied, the local attenuations and delays will not be precisely equal to the estimated mixing parameters, but they will cluster around them. We will **need a metric** to measure the proximity.

---
class: middle
## Summary of DUET

1. Construct the STFT representations $X\_1(f,n)$ and $X\_2(f,n)$ of both mixtures.

2. Take the ratio of the two mixtures and extract local attenuations and delays

$$\left\\{\big(\hat{a}(f,n), \hat{\delta}(f,n)\big)\right\\}\_{(f,n)}.$$

3. Compute a 2D histogram and estimate the mixing parameters $\left\\{(\hat{a}\_j, \hat{\delta}\_j)\right\\}\_{j=1}^J$ by peak picking.

4. Build the binary masks
$$
M\_j(f, n) =  \begin{cases}
1 & \text{if } \big(\hat{a}(f,n), \hat{\delta}(f,n)\big) \approx (\hat{a}\_j, \hat{\delta}\_j) \\\\
0 & \text{otherwise}
\end{cases}.
$$

5. Estimate the sources by $ \hat{S}\_{j}(f,n) = M\_{j}(f,n) X\_1(f,n)$.

6. Compute the inverse STFT to get the time-domain source signals.

---
class: middle, center

.center[<audio controls src="audio/duet_mix.wav"></audio>]

.grid[

.kol-1-3[
.caption[Mask 1]
.width-100[![](images/mask1.png)]
.center[<audio controls src="audio/est_src1.wav"></audio>]
]
.kol-1-3[
.caption[Mask 2]
.width-100[![](images/mask2.png)]
.center[<audio controls src="audio/est_src2.wav"></audio>]
]
.kol-1-3[
.caption[Mask 3]
.width-95[![](images/mask3.png)]
.center[<audio controls src="audio/est_src3.wav"></audio>]
]
]