Simon Leglaive

Student's t Source and Mixing Models for Multichannel Audio Source Separation

This web page presents source separation examples obtained with the proposed method [1]. It relies on a Student's t NMF-based source model defined in the MDCT domain. The impulse response of the mixing filters is also modeled using the Student's distribution. We compare this approach with other ones from the literature:

"Ozerov et al. - rank 1 or 2": Gaussian source image model based on NMF and rank-1 or -2 spatial covariance matrix [2]. The estimation procedure is based on an expectation-maximization (EM) algorithm.
"Sawada et al. - rank 2": Gaussian source image model based on NMF and rank-2 spatial covariance matrix [3]. The estimation procedure is based on multiplicative update rules derived using the auxiliary function method.
"Unconstrained time-domain filters": Gaussian NMF-based source model and time-domain representation of the convolutive mixing process using deterministic unconstrained mixing filters [4]. The estimation procedure is based on a variational EM algorithm.

The algorithms for the baselines and the proposed methods are run using oracle NMF dictionaries. All other model parameters are blindly estimated.

Matlab code for [1] is available here.

The stereo mixtures have been created using source signals from the MTG MASS database and room impulse responses from the MIRD database [5]. Note that we evaluate the source separation quality in terms of reconstructed stereo source images. You can therefore pay attention to the estimated spatial position of the sources.

[1] S. Leglaive, R. Badeau, and G. Richard. "Student's t source and mixing models for multichannel audio source separation", in IEEE Transactions on Audio, Speech and Language Processing, vol. 26, no. 6, 2018.
[2] A. Ozerov, E. Vincent, and F. Bimbot. "A general flexible framework for the handling of prior information in audio source separation", in IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 4, 2012.
[3] H. Sawada, H. Kameoka, S. Araki, and N. Ueda. "Multichannel extensions of non-negative matrix factorization with complex-valued data", in IEEE Transactions on Audio, Speech and Language Processing, vol. 21, no. 5, 2013.
[4] S. Leglaive, R. Badeau, and G. Richard. "Multichannel audio source separation: variational inference of time-frequency sources from time-domain observations", in Proc. of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), New-Orleans, LA, USA, 2017.
[5] E. Hadad, F. Heese, P. Vary, and S. Gannot. "Multichannel audio database in various acoustic environments", in Proc. of IEEE Int. Workshop on Acoustic Signal Enhancement (IWAENC), Antibes - Juan les Pins, France, 2014.

1. Source-specific time-frequency resolution

In this first section we illustrate the interest of using a source-specific time-frequency (TF) analysis/synthesis window. This is particularly clear when comparing the separated drums with the two following approaches:

"Proposed - w/o adapted TF window", corresponding to the proposed method when the TF window length is fixed to 64 ms for all sources.
"Proposed - w/ adapted TF window", corresponding to the proposed method when the TF window length is fixed to 32 ms for drums and 128 ms for all other instruments.

The musical excerpt is from "TV on" by Kismet. The mixture is associated with a reverberation time of 360 ms.

Stereo mixture:

	Drums	Guitar 1	Guitar 2	Voice
Original source images (stereo)
Proposed - w/o adapted TF window
Proposed - w/ adapted TF window

We also provide below the results with the baseline methods for the same audio example. As for "Proposed - w/o adapted TF window", the window length is fixed to 64 ms for all the sources.

Ozerov et al. - rank 1
Ozerov et al. - rank 2
Sawada et al. - rank 2
Unconstrained time-domain filters

2. Robustness to highly reverberant conditions

In this second section we present the source separation results for a mixture created with three different reverberation times : 160, 360 and 610 ms. The musical excerpt is from "Ana" by Vieux Farka Toure. We chose this specific excerpt because of the impulsiveness of the drums, which allows us to carefully listen to reverberation. Moreover some issues with our previous method [3] are particularly audible from this excerpt (see section 3 below).

a) Reverberation time of 160 ms

Stereo mixture:

	Drums	Voice	Bass
Original source images (stereo)
Proposed - w/o adapted TF window
Proposed - w/ adapted TF window
Ozerov et al. - rank 1
Ozerov et al. - rank 2
Sawada et al. - rank 2
Unconstrained time-domain filters

b) Reverberation time of 360 ms

Stereo mixture:

	Drums	Voice	Bass
Original source images (stereo)
Proposed - w/o adapted TF window
Proposed - w/ adapted TF window
Ozerov et al. - rank 1
Ozerov et al. - rank 2
Sawada et al. - rank 2
Unconstrained time-domain filters

c) Reverberation time of 610 ms

Stereo mixture:

	Drums	Voice	Bass
Original source images (stereo)
Proposed - w/o adapted TF window
Proposed - w/ adapted TF window
Ozerov et al. - rank 1
Ozerov et al. - rank 2
Sawada et al. - rank 2
Unconstrained time-domain filters

3. Constrained vs. unconstrained estimation of the mixing filters

It can be noticed from the mixtures with a reverberation time of 360 and 610 ms that the results obtained with our previous method [3] are not satisfactory. This is due to the unconstrained nature of the estimation of the mixing filters in this method. To illustrate this point you can listen below to the true stereo mixing filter used for the drums in the previous audio example (reverberation time of 610 ms), along with its estimate using [3] (unconstrained) and the proposed method (constrained). As can be heard, a part of the voice source signal is contained in the unconstrained mixing filter. We do not have this issue with the proposed method, precisely because we used probabilistic priors to guide the estimation of the mixing filters.

True mixing filter	Unconstrained estimation	Constrained estimation