Noise reduction techniques based on deep learning have demonstrated impressive performance in
enhancing the overall quality of recorded speech. While these approaches are highly performant,
their application in audio engineering can be limited due to a number of factors. These include
operation only on speech without support for music, lack of real-time capability, lack of
interpretable control parameters, operation at lower sample rates, and a tendency to introduce
artifacts. On the other hand, signal processing-based noise reduction algorithms offer
fine-grained control and operation on a broad range of content, however, they often require
manual operation to achieve the best results. To address the limitations of both approaches, in
this work we introduce a method that leverages a signal processing-based denoiser that when
combined with a neural network controller, enables fully automatic and high-fidelity noise
reduction on both speech and music signals. We evaluate our proposed method with objective
metrics and a perceptual listening test. Our evaluation reveals that speech enhancement models
can be extended to music, however training the model to remove only stationary noise is
critical. Furthermore, our proposed approach achieves performance on par with the deep learning
models, while being significantly more efficient and introducing fewer artifacts in some cases.
Citation
@inproceedings{steinmetz2023highfidelity,
title={High-fidelity noise reduction with differentiable signal processing},
author={Steinmetz, Christian J. and Walther, Thomas and Reiss, Joshua D.},
booktitle={155th Convention of the Audio Engineering Society},
year={2023}
}
Audio Examples
Speech enhancement systems for music
While speech enhancement models have achieved impressive performance in improving the quality of
full-band signals, such as Adobe Enhance Speech and DeepFilterNet, these
system cannot be used to denoise non-speech signals. When running these systems on music recordings
they
either corrupt the musicial content, fail to remove any noise, or completely remove the music
signals.
The following examples demonstrate results when using speech enhancement models for non-speech
sources
compared to our proposed approach, which works on all audio sources.
Name
Noisy
Adobe Enhance Speech
DeepFilterNet2 (Schröter et
al.)
Tape It (ours)
AcGtr + Vocal 1
Classical Guitar
AcGtr
AcGtr + Vocal 2
Jazz Piano
Listening test stimuli
The following examples are denoised recordings used in the perceptual listening test. The results
from the listening test are shown below in the boxplot. We compare our proposed method (Tape It)
against variants of HDemucs that are trained on the same dataset, as well as iZotope RX Spectral
Denoise. We manaully adjust the iZotope denoiser selecting a noise-only section when it is
available, otherwise we use the automatic mode. We also compare to our model without stage 2
training.
ID
Noisy
Tape It (ours)
Tape It (Stage 1) (ours)
HDemucs
HDemucs (DNS)
iZotope
A
B
C
E
F
G
H
I
J
K
L
Test set
The following are selected examples from the held-out test dataset. Here we compare our approach
(Tape It) against other models trained on our dataset, including HDemucs and DCUNet. We also compare
here against RNNoise, which was pretrained for speech enhancement.