Under-determined speech and music mixtures


We propose to repeat last year's Stereo Audio Source Separation Evaluation Campaign (external link) with fresh data.

Results


See the results over test (external link) and development (external link) data

Test data


Download test.zip (external link) (22 MB)
These files are made available under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link) license. The authors are Glen Phillips, Mark Engelberg, Psycho Voyager, Nine Inch Nails and Ali Farka Touré for music source signals and Shoko Araki and Emmanuel Vincent for mixture signals.

The test data contains three types of stereo mixtures:
  • instantaneous mixtures (static sources scaled by positive gains)
  • synthetic convolutive mixtures (static sources filtered by synthetic room impulse responses simulating a pair of omnidirectional microphones via the Roomsim (external link) toolbox)
  • live recordings (static sources played through loudspeakers in a meeting room, recorded one at a time by a pair of omnidirectional microphones and subsequently added together)
The room dimensions are the same for synthetic convolutive mixtures and live recordings (4.45 x 3.55 x 2.5 m). The reverberation time is set to either 130 ms or 250 ms and the distance between the two microphones to either 5 cm or 1 m, resulting in 9 mixing conditions overall.

For each mixing condition, 6 mixture signals have been generated from different sets of source signals placed at different spatial positions:
  • 4 male speech sources
  • 4 female speech sources
  • 3 male speech sources
  • 3 female speech sources
  • 3 non-percussive music sources
  • 3 music sources including drums
The source directions of arrival vary between -60 degrees and +60 degrees with a minimal spacing of 15 degrees and the distances between the sources and the center of the microphone pair vary between 80 cm and 1.20 m.

The data consist of stereo WAV audio files, that can be imported in Matlab using the wavread command. These files are named test_<srcset>_<mixtype> [ _<reverb>_<spacing> ] _mix.wav, where <srcset> is a shortcut for the set of source signals, <mixtype> for a shortcut for the mixture type, <reverb> the reverberation time and <spacing> the microphone spacing.

Development data


Download dev1.zip (external link) (91 MB) (former development data of the Stereo Audio Source Separation Evaluation Campaign (external link), completed by new data for the additional mixing conditions considered above)
Download dev2.zip (external link) (47 MB) (former test data of the Stereo Audio Source Separation Evaluation Campaign (external link))
These files are made available under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link) license. The authors are Another Dreamer and Alex Q for music source signals and Hiroshi Sawada, Shoko Araki and Emmanuel Vincent for mixture signals.

The data consist of Matlab MAT-files and WAV audio files, that can be imported in Matlab using the commands load and wavread respectively. These files are named as follows:
  • dev1_<srcset> [ _<mixtype>_<reverb> ] _src_<j>.wav: mono source signal
  • dev1_<srcset>_inst_matrix.mat: mixing matrix for instantaneous mixtures
  • dev1_<srcset>_<mixtype>_<reverb>_<spacing>_setup.txt: positions of the sources for convolutive mixtures
  • dev1_<srcset>_<mixtype>_<reverb>_<spacing>_filt.mat: mixing filter system for convolutive mixtures
  • dev1_<srcset>_<mixtype> [ _<reverb>_<spacing> ] _sim_<j>.wav: stereo contribution of a source signal to the two mixture channels
  • dev1_<srcset>_<mixtype> [ _<reverb>_<spacing> ] _mix.wav: stereo mixture signal
where <srcset> is a shortcut for the set of source signals, <mixtype> for a shortcut for the mixture type, <reverb> the reverberation time, <spacing> the microphone spacing and <j> the source index.

All mixture signals and source image signals have 10s duration. Music source signals have 11s duration to avoid border effects within convolutive mixtures. The last 10s are then selected once the mixing system has been applied.

Tasks and reference software

Based on the outcomes of the panel discussion at ICA'07, the source separation problem has been split into four tasks:
  1. source counting (estimate the number of sources)
  2. mixing system estimation (estimate the mixing matrix for instantantaneous mixtures or the frequency-dependent mixing matrix for convolutive mixtures)
  3. source signal estimation (estimate the mono source signals)
  4. source spatial image estimation (estimate the stereo contribution of each source to the two mixture channels)

Participants are welcome to use some of the Matlab reference software below to build their own algorithms

An example use of this software is given in for instantaneous and convolutive mixtures in example_inst.m and example_conv.m respectively.

Submission


Each participant is asked to submit the results of his/her algorithm for tasks 3 or 4 as preferred
The results for tasks 1 and 2 may also be submitted if possible. They will help diagnosing the performance of various parts of the algorithm when available.

In addition, each participant is asked to provide basic information about his/her algorithm (e.g. a bibliographical reference) and to declare its average running time, expressed in seconds per test excerpt and per GHz of CPU.

Note that the submitted audio files will be made available on a website under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link) license.

Evaluation criteria

There is no agreed-upon criterion for the evaluation of estimated mixing systems in under-determined mixtures so far. We propose to use a SNR-like criterion that we call Mixing Error Ratio (MER) expressed in decibels. This criterion is computed in each frequency bin between the estimated mixing matrix and the true matrix, allowing arbitrary scaling for each source, and averaged over frequency. All source orderings are tested and the ordering leading to the best MER is selected.

We propose to evaluate the estimated source signals via the criteria defined in the BSS_EVAL (external link) toolbox. These criteria allow an arbitrary filtering between the estimated source and the true source and measure inteference and artifacts distortion separately. All source orderings are tested and the ordering leading to the best SIR is selected.

Similarly, we propose to evaluate the estimated spatial source image signals via the criteria used for the Stereo Audio Source Separation Evaluation Campaign (external link). These criteria distinguish spatial (or filtering) distortion, interference and artifacts. All source orderings are tested and the ordering leading to the best SIR is selected.

Performance will be compared to that of ideal binary masking as a benchmark (i.e. binary masks providing maximum SDR), computed over a STFT or a cochleagram.

The above performance criteria and benchmarks are respectively implemented in
Example uses are given in example_inst.m and example_conv.m again.
Note: the computation of these criteria may take some time, due to the need of computing the best ordering and the actual filter distortion between the estimated sources and the true sources.

Potential participants

  • Dan Barry (dan.barry (a) dit_ie)
  • Pau Bofill (pau (a) ac_upc_edu)
  • Andreas Ehmann (aehmann (a) uiuc_edu)
  • Vikrham Gowreesunker (gowr0001 (a) umn_edu)
  • Matt Kleffner (kleffner (a) uiuc_edu)
  • Nikolaos Mitianoudis (n.mitianoudis (a) imperial_ac_uk)
  • Hiroshi Sawada (sawada (a) cslab_kecl_ntt_co_jp)
  • Emmanuel Vincent (emmanuel.vincent (a) irisa_fr)
  • Ming Xiao (xiaoming1968 (a) 163_com)
  • Ron Weiss (ronw (a) ee_columbia_edu)
  • Michael Mandel (mim (a) ee_columbia_edu)
  • Shoko Araki (shoko (a) cslab_kecl_ntt_co_jp)
  • Yosuke Izumi (izumi (a) hil_t_u-tokyo_ac_jp)
  • Taesu Kim (taesu (a) ucsd_edu)
  • Maximo Cobos (mcobos (a) iteam_upv_es)
  • John Woodruff (woodruff.95 (a) osu_edu)
  • Antonio Rebordao (antonio (a) gavo_t_u-tokyo_ac_jp)

Task proposed by: Emmanuel Vincent, Shoko Araki, Pau Bofill

Menu