High-fidelity noise reduction
with differentiable signal processing

Christian J. Steinmetz1     Thomas Walther2     Joshua D. Reiss1

1Centre for Digital Music, Queen Mary University of London

2Tape It Music GmbH

Abstract


Noise reduction techniques based on deep learning have demonstrated impressive performance in enhancing the overall quality of recorded speech. While these approaches are highly performant, their application in audio engineering can be limited due to a number of factors. These include operation only on speech without support for music, lack of real-time capability, lack of interpretable control parameters, operation at lower sample rates, and a tendency to introduce artifacts. On the other hand, signal processing-based noise reduction algorithms offer fine-grained control and operation on a broad range of content, however, they often require manual operation to achieve the best results. To address the limitations of both approaches, in this work we introduce a method that leverages a signal processing-based denoiser that when combined with a neural network controller, enables fully automatic and high-fidelity noise reduction on both speech and music signals. We evaluate our proposed method with objective metrics and a perceptual listening test. Our evaluation reveals that speech enhancement models can be extended to music, however training the model to remove only stationary noise is critical. Furthermore, our proposed approach achieves performance on par with the deep learning models, while being significantly more efficient and introducing fewer artifacts in some cases.

Citation


                
    @inproceedings{steinmetz2023highfidelity,
        title={High-fidelity noise reduction with differentiable signal processing},
        author={Steinmetz, Christian J. and Walther, Thomas and Reiss, Joshua D.},
        booktitle={155th Convention of the Audio Engineering Society},
        year={2023}
    }
                

Audio Examples


Speech enhancement systems for music

While speech enhancement models have achieved impressive performance in improving the quality of full-band signals, such as Adobe Enhance Speech and DeepFilterNet, these system cannot be used to denoise non-speech signals. When running these systems on music recordings they either corrupt the musicial content, fail to remove any noise, or completely remove the music signals. The following examples demonstrate results when using speech enhancement models for non-speech sources compared to our proposed approach, which works on all audio sources.

Name Noisy
Adobe Enhance Speech
DeepFilterNet2
(Schröter et al.)
Tape It
(ours)
AcGtr + Vocal 1

Classical Guitar

AcGtr

AcGtr + Vocal 2

Jazz Piano


Listening test stimuli

The following examples are denoised recordings used in the perceptual listening test. The results from the listening test are shown below in the boxplot. We compare our proposed method (Tape It) against variants of HDemucs that are trained on the same dataset, as well as iZotope RX Spectral Denoise. We manaully adjust the iZotope denoiser selecting a noise-only section when it is available, otherwise we use the automatic mode. We also compare to our model without stage 2 training.

ID Noisy
Tape It
(ours)
Tape It (Stage 1)
(ours)
HDemucs
HDemucs (DNS)
iZotope
A

B

C

E

F

G

H

I

J

K

L

Test set


The following are selected examples from the held-out test dataset. Here we compare our approach (Tape It) against other models trained on our dataset, including HDemucs and DCUNet. We also compare here against RNNoise, which was pretrained for speech enhancement.

Source Noise Noisy
Tape It
(ours)
HDemucs
HDemucs (DNS)
DCUNet
RNNoise
GuitarSet

Freesound

DSD100

Tape It (Internal)

VCTK

Tape It (Internal)