Head-geometry mixtures of two speech sources in real environments, impinging from many directions
We propose a set of stereo recordings of two stationary stereo speech sources, located at various spatial positions, in an anechoic chamber and an office room, received by an artifical dummy head equipped with hearing aid microphones.
Results
See the results webpage
Test data
Download anechoic.zip (166MB)
Download office.zip (185MB)
These files are made available under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 license. The authors of the dataset are Hendrik Kayser and Jörn Anemüller. The collection of the dataset was supported by the European Commission under IP DIRAC.
The sound sources are generated by convolution of clean recorded speech signals taken from the NOIZEUS speech corpus (P. C. Loizou: Speech Enhancement: Theory and Practice, CRC Press, 2007) with a multitude of real recorded impulse responses from two rooms. The impulse responses were measured using a dummy head equipped with binaural hearing aid microphones. A high SNR of the impulse responses was ensured, thus real-room recordings can be reproduced with high authenticity. The measurement signals received by the hearing aid microphones were recorded directly without being affected by further processing steps in the hearing aids.
For each recording, the sources are located on a circle around the receiver in a fixed distance. In the anechoic chamber, the distance from source to receiver is 3 meters and the source position is varied in a full circle in steps of 20°. This yields 18 stereo signals per source. In the office room, the distance is 1 meter and the source positions are on the front hemisphere of the artificial head, ranging from 90° to the left to 90° to the right in steps of 10°, yielding 19 stereo signals per source.
By superimposing the provided stereo signal of source one at one angle with the stereo signal of source two at one of the remaining 17 or 18 different angles, a total of 17*18=306 spatially non-degenerate mixtures can be obtained for the anechoic situation and 18*19=342 for the office room.
The difficulty of separating the two sources depends on their spatial positions. Hence, the provided data allow a systematic evaluation and comparison of the robustness with respect to spatial variability of different algorithms. Separation can be performed under benign anechoic and more challenging reverberant conditions.
The data consist of WAV audio files, each containing one possible stereo mixture of both sources. The files are named as follows:
<situation>_src1_<pos1>-src2_<pos2>_16kHz.wav
where <situation> is either 'anechoic' or 'office', <pos1> and <pos2> denote the angle of incidence of source one and source two according to the coordinate system . The sampling frequency is 16 kHz.
Development data
Download anechoic_dev.zip (3.6MB)
Download office_dev.zip (3.6MB)
These files are made available under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 license. The authors of the dataset are Hendrik Kayser and Jörn Anemüller. The collection of the dataset was supported by the European Commission under IP DIRAC.
The development data sets consist of two different mixtures of two sound sources in the anechoic and the office environment each, containing three stereo WAV audio files per mixture named in the following way:
<situation>_{mix,src1,src2}_16kHz_dev{1,2}.wav
and two mono WAV audio files with the underlying source signals:
<situation>_src{1,2}_clean_16kHz_dev.wav.
Tasks
Based on the outcomes of the panel discussion at ICA'07, the source separation problem has been split into three tasks:- mixing system estimation (estimate the frequency-dependent mixing matrix)
- source signal estimation (estimate the mono source signals)
- source spatial image estimation (estimate the stereo contribution of each source to the two mixture channels)
Submission
Each participant is asked to submit the results of his/her algorithm for tasks 2 or 3 over all the mixtures from either or both test sets.
The results for task 1 may also be submitted if possible. They will help diagnosing the performance of various parts of the algorithm when available.
Due to the large amount of data, each participant should make his results available online in the form of a tarball called anechoic_results.zip or office_results.zip.
The included files must be named as follows:
- <situation>_src1_<pos1>-src2_<pos2>_16kHz_src<j>.wav: estimated source <j>, mono WAV file sampled at 16 kHz
- <situation>_src1_<pos1>-src2_<pos2>_16kHz_sim<j>.wav: estimated spatial image of source <j>, stereo WAV file sampled at 16 kHz
- <situation>_src1_<pos1>-src2_<pos2>_16kHz_mixing.mat: estimated mixing system, Matlab MAT file containing a 2 x 2 x nbin frequency-dependent mixing matrix where nbin is the chosen number of STFT bins
Each participant should then send an email to providing
- contact information (name, affiliation)
- basic information about his/her algorithm, including its average running time (in seconds per test excerpt and per GHz of CPU) and a bibliographical reference if possible
- the URL of the tarball(s)
Note that the submitted audio files will be made available on a website under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 license.
Evaluation criteria
We plan to use the same evaluation criteria as for the under-determined speech and music mixtures dataset, so that results are comparable.The estimated mixing systems will be evaluated using a SNR-like criterion that we call Mixing Error Ratio (MER) expressed in decibels. This criterion is computed in each frequency bin between the estimated mixing matrix and the true matrix, allowing arbitrary scaling for each source, and averaged over frequency. All source orderings are tested and the ordering leading to the best MER is selected.
The estimated source signals will be evaluated via the criteria defined in the BSS_EVAL toolbox. These criteria allow an arbitrary filtering between the estimated source and the true source and measure inteference and artifacts distortion separately. All source orderings are tested and the ordering leading to the best SIR is selected.
Similarly, the estimated spatial source image signals will be evaluated via the criteria used for the Stereo Audio Source Separation Evaluation Campaign. These criteria distinguish spatial (or filtering) distortion, interference and artifacts. All source orderings are tested and the ordering leading to the best SIR is selected.
The above performance criteria are respectively implemented in
Example uses are given in example_conv.m.
Note: the computation of these criteria may take some time, due to the need of computing the best ordering and the actual filter distortion between the estimated sources and the true sources.
Participants will also be asked to submit the relative running time of their algorithm, expressed in seconds per test excerpt and per GHz of CPU.
Potential participants
If you might consider participating, please add your name and email address here and sign up for the mailing list to receive further announcements- Jörn Anemüller (joern.anemueller (a) uni-oldenburg_de)
- Hendrik Kayser (hendrik.kayser (a) uni-oldenburg_de)
- Ron Weiss (ronw (a) ee_columbia_edu)
- Michael Mandel (mim (a) ee_columbia_edu)
- John Woodruff (woodruff.95 (a) osu_edu)
Task proposed by: Hendrik Kayser, Jörn Anemïller
Support by the European Commission under integrated project DIRAC (Detection and Identification of Rare Audio-visual Cues) IST-027787 is gratefully acknowledged. Thanks to Volker Hohmann, Stephan Ewert and Thomas Rohdenburg who have made significant contributions to the developement and implementation of the measurement techniques and for the support by HörTech?, the German centre of competence on hearing technology.