2025-11-14T00:52:10.685423

Speech Enhancement and Dereverberation with Diffusion-based Generative Models

Richter, Welker, Lemercier et al.
In this work, we build upon our previous publication and use diffusion-based generative models for speech enhancement. We present a detailed overview of the diffusion process that is based on a stochastic differential equation and delve into an extensive theoretical examination of its implications. Opposed to usual conditional generation tasks, we do not start the reverse process from pure Gaussian noise but from a mixture of noisy speech and Gaussian noise. This matches our forward process which moves from clean speech to noisy speech by including a drift term. We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates. By adapting the network architecture, we are able to significantly improve the speech enhancement performance, indicating that the network, rather than the formalism, was the main limitation of our original approach. In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models and achieves better generalization when evaluating on a different corpus than used for training. We complement the results with an instrumental evaluation using real-world noisy recordings and a listening experiment, in which our proposed method is rated best. Examining different sampler configurations for solving the reverse process allows us to balance the performance and computational speed of the proposed method. Moreover, we show that the proposed method is also suitable for dereverberation and thus not limited to additive background noise removal. Code and audio examples are available online, see https://github.com/sp-uhh/sgmse.
academic

Speech Enhancement and Dereverberation with Diffusion-based Generative Models

Basic Information

  • Paper ID: 2208.05830
  • Title: Speech Enhancement and Dereverberation with Diffusion-based Generative Models
  • Authors: Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Timo Gerkmann
  • Categories: eess.AS (Audio and Speech Processing), cs.LG (Machine Learning), cs.SD (Sound)
  • Publication Date: August 2022 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2208.05830
  • Code Link: https://github.com/sp-uhh/sgmse

Abstract

Building upon prior work, this paper employs diffusion-based generative models for speech enhancement. The paper provides detailed exposition of the diffusion process based on stochastic differential equations (SDEs) with in-depth theoretical analysis. Unlike conventional conditional generation tasks, this work does not initiate the reverse process from pure Gaussian noise, but rather from a mixture of noisy speech and Gaussian noise. This corresponds to a forward process containing a drift term that transitions from clean speech to noisy speech. The study demonstrates that high-quality clean speech estimates can be generated with only 30 diffusion steps. Through improved network architecture, significant performance gains in speech enhancement are achieved, indicating that the network architecture, rather than the formalization approach, was the primary limiting factor in the original method.

Research Background and Motivation

Problem Definition

Speech enhancement aims to recover clean speech signals from audio recordings affected by acoustic noise or reverberation. This is a classical signal processing problem with important applications in telephone communications, hearing aids, speech recognition, and related fields.

Limitations of Existing Methods

  1. Limitations of Discriminative Models:
    • Difficulty in covering all possible acoustic conditions in training data
    • May produce unnatural speech distortions
    • Limited generalization capability
  2. Issues with VAE-based Generative Models:
    • Dimensionality reduction constraints in latent layers
    • Encoder sensitivity to noisy inputs
    • Dependence on linear noise models
  3. Shortcomings of Existing Diffusion Models:
    • CDiffuSE requires explicit environmental noise estimation
    • Poor preservation of high-frequency information

Research Motivation

This paper aims to design a purely generative diffusion model that learns the prior distribution of clean speech to achieve high-quality speech enhancement and dereverberation in the complex STFT domain.

Core Contributions

  1. Innovative SDE Diffusion Process: Proposes a stochastic differential equation with drift term that enables the forward process to transition from clean speech to noisy speech
  2. Improved Network Architecture: Adopts NCSN++ architecture replacing the original complex U-Net, significantly enhancing performance
  3. Unified Framework: A single framework capable of handling both speech enhancement and dereverberation tasks
  4. Comprehensive Evaluation: Includes cross-dataset evaluation, real-world data testing, and subjective listening experiments
  5. Efficiency Optimization: Balances performance and computational speed through different sampler configurations
  6. Theoretical Analysis: Provides detailed theoretical derivations and analysis of the diffusion process

Methodology Details

Task Definition

  • Input: Noisy/reverberant speech signal yy
  • Output: Clean speech signal x0x_0
  • Constraint: Preserve speech naturalness and intelligibility

Data Representation

The paper operates in the complex STFT domain using magnitude compression transformation: c~=βcαei(c)\tilde{c} = \beta|c|^{\alpha}e^{i\angle(c)} where α(0,1]\alpha \in (0,1] is the compression exponent and βR+\beta \in \mathbb{R}^+ is the scaling factor.

Stochastic Differential Equation Design

Forward Process

Define a linear SDE: dxt=f(xt,y)dt+g(t)dwdx_t = f(x_t, y)dt + g(t)dw

where:

  • Drift coefficient: f(xt,y)=γ(yxt)f(x_t, y) = \gamma(y - x_t)
  • Diffusion coefficient: g(t)=σmin(σmaxσmin)t2log(σmaxσmin)g(t) = \sigma_{min}\left(\frac{\sigma_{max}}{\sigma_{min}}\right)^t\sqrt{2\log\left(\frac{\sigma_{max}}{\sigma_{min}}\right)}

Reverse Process

The corresponding reverse SDE is: dxt=[f(xt,y)g(t)2sθ(xt,y,t)]dt+g(t)dwˉdx_t = [f(x_t, y) - g(t)^2s_\theta(x_t, y, t)]dt + g(t)d\bar{w}

where sθ(xt,y,t)s_\theta(x_t, y, t) is the learnable score function.

Training Objective

Based on denoising score matching, the training objective is: argminθEt,(x0,y),z,xt(x0,y)[sθ(xt,y,t)+zσ(t)22]\arg\min_\theta \mathbb{E}_{t,(x_0,y),z,x_t|(x_0,y)}\left[\left\|s_\theta(x_t, y, t) + \frac{z}{\sigma(t)}\right\|_2^2\right]

Network Architecture

Employs NCSN++ architecture with main characteristics:

  1. Multi-resolution U-Net structure
  2. Progressive growth pathways
  3. Global attention mechanisms
  4. Time embedding: Uses Fourier embedding to encode temporal information
  5. Residual blocks: Based on BigGAN residual network blocks

Experimental Setup

Datasets

  1. WSJ0-CHiME3: WSJ0 clean speech with CHiME3 noise, SNR range 0-20dB
  2. VB-DMD (VoiceBank-DEMAND): Standard speech enhancement benchmark dataset
  3. WSJ0-REVERB: Reverberation data simulated using pyroomacoustics, T60 range 0.4-1.0 seconds

Evaluation Metrics

  • Full-Reference Metrics: POLQA, PESQ, ESTOI, SI-SDR, SI-SIR, SI-SAR
  • No-Reference Metrics: DNSMOS, SIG, BAK, OVRL, WVMOS

Comparison Methods

  • Generative Models: STCN, DVAE, CDiffuSE, SGMSE (original)
  • Discriminative Models: MetricGAN+, Conv-TasNet, GaGNet, TCN+SA+S

Implementation Details

  • STFT parameters: Window length 510, hop length 128, Hann window
  • SDE parameters: σmin=0.05\sigma_{min}=0.05, σmax=0.5\sigma_{max}=0.5, γ=1.5\gamma=1.5
  • Training: 4×Quadro RTX 6000, 160 epochs, learning rate 10410^{-4}
  • Sampling: 30-step reverse process, predictor-corrector sampler

Experimental Results

Main Results

Speech Enhancement Performance (WSJ0-CHiME3)

MethodTraining SetPOLQAPESQSI-SDR
SGMSE+WSJ0-C33.732.9618.3
Conv-TasNetWSJ0-C33.652.9919.9
MetricGAN+WSJ0-C33.523.0310.5
CDiffuSEWSJ0-C33.082.279.2

Cross-Dataset Generalization

Under mismatched conditions (trained on VB-DMD, tested on WSJ0-CHiME3), SGMSE+ outperforms other methods on all metrics, demonstrating superior generalization capability.

Dereverberation Performance (WSJ0-REVERB)

MethodPOLQAPESQSI-SDR
SGMSE+3.242.661.6
Conv-TasNet2.411.841.6
GaGNet2.621.98-0.6

Ablation Studies

Sampler Configuration Optimization

  • Predictor-Corrector Sampler: One correction step achieves optimal performance balance
  • Step Selection: 30 steps reach performance saturation
  • Computational Efficiency: RTF of 1.77 (1.77 times real-time processing)

Architecture Improvement Effects

Compared to the original SGMSE, SGMSE+ achieves 0.75 improvement in POLQA and 0.68 improvement in PESQ, demonstrating the importance of network architecture.

Subjective Listening Experiments

MUSHRA experiment results show SGMSE+ achieves the highest scores, particularly demonstrating excellent robustness under mismatched conditions.

Real-World Data Evaluation

On DNS Challenge 2020 real-world noise data, SGMSE+ performs best on all no-reference metrics.

Discriminative Model Approaches

  • Time-frequency masking: Learning ideal binary or ratio masks
  • Complex spectral mapping: Direct estimation of complex STFT coefficients
  • Time-domain methods: End-to-end waveform processing

Generative Model Approaches

  • VAE-based: Learning speech prior distribution, but limited by latent space dimensionality reduction
  • GAN methods: Implicit density estimation, but unstable training
  • Diffusion models: Recently emerging, divided into regeneration and direct modeling categories

Diffusion Model Applications in Speech

  • Speech regeneration: Methods such as CDiffuSE
  • Direct modeling: SGMSE series methods in this paper

Conclusions and Discussion

Main Conclusions

  1. Improved network architecture is the key factor for performance enhancement
  2. Generative models outperform discriminative models in cross-dataset generalization
  3. A single framework effectively handles multiple speech restoration tasks
  4. A 30-step diffusion process achieves high-quality speech generation

Limitations

  1. Computational Complexity: Larger computational burden compared to discriminative models
  2. Artifacts: May produce "vocalization" artifacts under extremely low SNR conditions
  3. Phase Modeling: Limited phase enhancement effects with complex-valued modeling
  4. Parameter Sensitivity: Requires careful tuning of SDE parameters

Future Directions

  1. Incorporate speech activity detection and phoneme information conditioning
  2. Explore more efficient sampling strategies
  3. Investigate phase enhancement under shorter frame lengths
  4. Extend to other speech restoration tasks

In-Depth Evaluation

Strengths

  1. Theoretical Contribution: Provides complete SDE theoretical derivations and analysis
  2. Methodological Innovation: Clever drift term design enables task adaptation
  3. Comprehensive Experiments: Includes cross-dataset, real-world data, and subjective evaluation
  4. Practical Value: Open-source code facilitates reproducibility and application
  5. Clear Presentation: Detailed theoretical derivations and well-designed experiments

Weaknesses

  1. Computational Efficiency: RTF of 1.77 leaves room for real-time performance improvement
  2. Artifact Issues: "Vocalization" artifacts under low SNR require resolution
  3. Parameter Tuning: SDE parameters require dataset-specific optimization
  4. Theoretical Analysis: Insufficient analysis of forward-backward process mismatch effects

Impact

  1. Academic Value: Provides important reference for diffusion model applications in speech processing
  2. Practical Value: Achieves competitive performance on multiple benchmark datasets
  3. Reproducibility: Provides complete code and audio samples
  4. Inspirational Value: Offers a general framework for other speech restoration tasks

Applicable Scenarios

  1. Speech Enhancement: Telephone communications, hearing aids
  2. Dereverberation: Post-processing for indoor speech recordings
  3. Speech Restoration: Historical recording restoration
  4. Preprocessing: Front-end processing for speech recognition systems

References

The paper cites extensive related work, primarily including:

  • Song et al. (2021): Score-based generative modeling through stochastic differential equations
  • Lu et al. (2022): Conditional diffusion probabilistic model for speech enhancement
  • Vincent (2011): A connection between score matching and denoising autoencoders
  • Anderson (1982): Reverse-time diffusion equation models

Overall Assessment: This is a high-quality research paper demonstrating excellence in theoretical innovation, methodological design, and experimental validation. The paper successfully applies diffusion models to speech enhancement tasks, achieving performance comparable to discriminative models through clever SDE design and network architecture improvements, while demonstrating superior generalization capability. Despite computational efficiency and artifact concerns, its theoretical contributions and practical value make it an important work in the field.