You will find here audio examples of the different separation methods presented in the paper below, both on synthetic mixtures from the test set and on real-life recordings.

We compare our proposed IS³ model with the Harmonic-Percussive Sound Separation (HPSS) method, adapted from music signal processing research, using two different margin parameters ($p_m = 1$ and $p_m = 2$), the wavelet-based method from Nongpiur et al. [1] and a Conv-TasNet model [2] trained on the same data as IS³.

References

[1] R. C. Nongpiur. Impulse noise removal in speech using wavelets. In ICASSP, 2008.

[2] Y. Luo and N. Mesgarani. Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019.

Synthetic examples on the test set

These initial examples are derived from the test set, generated using the same pipeline as the training and validation datasets described in the paper.

We select an example from the test set, which consists of the following components:

  • A stationary background track
  • An impulsive sounds track
  • A mixture track

The mixture track serves as the input for the four separation methods to extract the stationary and impulsive components.

Example 1

Clean signals and mix

Impulsive sounds
Stationary Background
Mix


Separated signals with the different methods

Impulsive sounds Stationary Background
IS³
HPSS $p_m = 1$
HPSS $p_m = 2$
Nongpiur
Conv-TasNet


Example 2

Clean signals and mix

Impulsive sounds
Stationary Background
Mix


Separated signals with the different methods

Impulsive sounds Stationary Background
IS³
HPSS $p_m = 1$
HPSS $p_m = 2$
Nongpiur
Conv-TasNet


Example 3 : empty impulsive track

Clean signals and mix

Impulsive sounds
Stationary Background
Mix


Separated signals with the different methods

Impulsive sounds Stationary Background
IS³
HPSS $p_m = 1$
HPSS $p_m = 2$
Nongpiur
Conv-TasNet


Observations: The HPSS methods exhibit significant leakage of both stationary and impulsive components in the separated tracks. Increasing the margin parameter ($p_m = 2$) reduces the leakage of the stationary background into the impulsive track.

Nongpiur’s method performs poorly on the impulsive track, introducing audio artefacts. On the stationary track, the original method from Nongpiur’s article only attenuates impulsive sounds, which remain partially present. It is worth noting that the parameter selection in this wavelet-based approach is highly dependent on the type of impulses and ambient sounds (speech in the original article). While efforts were made to optimize parameters for our context, the diversity of sound types and acoustic scenes in our study leads to inconsistent performance across examples.

Conv-TasNet surpasses other baselines but still shows some leakage of the stationary background into the impulsive track. In contrast, IS³ delivers the best separation, producing a clean impulsive track and a clean stationary track. However, there is a slight attenuation of the resonance in impulsive sounds, making them sound slightly drier compared to the target track. Additionally, in examples where the impulsive track’s frequency components are more dispersed, the separation process becomes more challenging, leading to a slight reduction in the background track’s sound level during impulse events.

Notably, in the third example where the impulsive track is silent, IS³ successfully generates a completely silent impulsive track, whereas other methods introduce artefacts or stationary background leakage.

Real recordings examples

To better appreciate the performance of the model, it is also interesting to test it on real-life recordings (for which we don’t have groundtruth for either the impulsive or the stationary sources).

Example 1

Original recording

Mix


Separated signals with the different methods

Impulsive sounds Stationary Background
IS³
HPSS $p_m = 1$
HPSS $p_m = 2$
Nongpiur
Conv-TasNet


Example 2

Original recording

Mix


Separated signals with the different methods

Impulsive sounds Stationary Background
IS³
HPSS $p_m = 1$
HPSS $p_m = 2$
Nongpiur
Conv-TasNet