Speech Enhancement and Dereverberation with Diffusion-based Generative Models
Richter, Welker, Lemercier et al.
In this work, we build upon our previous publication and use diffusion-based generative models for speech enhancement. We present a detailed overview of the diffusion process that is based on a stochastic differential equation and delve into an extensive theoretical examination of its implications. Opposed to usual conditional generation tasks, we do not start the reverse process from pure Gaussian noise but from a mixture of noisy speech and Gaussian noise. This matches our forward process which moves from clean speech to noisy speech by including a drift term. We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates. By adapting the network architecture, we are able to significantly improve the speech enhancement performance, indicating that the network, rather than the formalism, was the main limitation of our original approach. In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models and achieves better generalization when evaluating on a different corpus than used for training. We complement the results with an instrumental evaluation using real-world noisy recordings and a listening experiment, in which our proposed method is rated best. Examining different sampler configurations for solving the reverse process allows us to balance the performance and computational speed of the proposed method. Moreover, we show that the proposed method is also suitable for dereverberation and thus not limited to additive background noise removal. Code and audio examples are available online, see https://github.com/sp-uhh/sgmse.
academic
Speech Enhancement and Dereverberation with Diffusion-based Generative Models
Building upon prior work, this paper employs diffusion-based generative models for speech enhancement. The paper provides detailed exposition of the diffusion process based on stochastic differential equations (SDEs) with in-depth theoretical analysis. Unlike conventional conditional generation tasks, this work does not initiate the reverse process from pure Gaussian noise, but rather from a mixture of noisy speech and Gaussian noise. This corresponds to a forward process containing a drift term that transitions from clean speech to noisy speech. The study demonstrates that high-quality clean speech estimates can be generated with only 30 diffusion steps. Through improved network architecture, significant performance gains in speech enhancement are achieved, indicating that the network architecture, rather than the formalization approach, was the primary limiting factor in the original method.
Speech enhancement aims to recover clean speech signals from audio recordings affected by acoustic noise or reverberation. This is a classical signal processing problem with important applications in telephone communications, hearing aids, speech recognition, and related fields.
This paper aims to design a purely generative diffusion model that learns the prior distribution of clean speech to achieve high-quality speech enhancement and dereverberation in the complex STFT domain.
Innovative SDE Diffusion Process: Proposes a stochastic differential equation with drift term that enables the forward process to transition from clean speech to noisy speech
Improved Network Architecture: Adopts NCSN++ architecture replacing the original complex U-Net, significantly enhancing performance
Unified Framework: A single framework capable of handling both speech enhancement and dereverberation tasks
Comprehensive Evaluation: Includes cross-dataset evaluation, real-world data testing, and subjective listening experiments
Efficiency Optimization: Balances performance and computational speed through different sampler configurations
Theoretical Analysis: Provides detailed theoretical derivations and analysis of the diffusion process
The paper operates in the complex STFT domain using magnitude compression transformation:
c~=β∣c∣αei∠(c)
where α∈(0,1] is the compression exponent and β∈R+ is the scaling factor.
Under mismatched conditions (trained on VB-DMD, tested on WSJ0-CHiME3), SGMSE+ outperforms other methods on all metrics, demonstrating superior generalization capability.
Compared to the original SGMSE, SGMSE+ achieves 0.75 improvement in POLQA and 0.68 improvement in PESQ, demonstrating the importance of network architecture.
The paper cites extensive related work, primarily including:
Song et al. (2021): Score-based generative modeling through stochastic differential equations
Lu et al. (2022): Conditional diffusion probabilistic model for speech enhancement
Vincent (2011): A connection between score matching and denoising autoencoders
Anderson (1982): Reverse-time diffusion equation models
Overall Assessment: This is a high-quality research paper demonstrating excellence in theoretical innovation, methodological design, and experimental validation. The paper successfully applies diffusion models to speech enhancement tasks, achieving performance comparable to discriminative models through clever SDE design and network architecture improvements, while demonstrating superior generalization capability. Despite computational efficiency and artifact concerns, its theoretical contributions and practical value make it an important work in the field.