2025-11-14T00:52:10.685423

Speech Enhancement and Dereverberation with Diffusion-based Generative Models

Richter, Welker, Lemercier et al.

In this work, we build upon our previous publication and use diffusion-based generative models for speech enhancement. We present a detailed overview of the diffusion process that is based on a stochastic differential equation and delve into an extensive theoretical examination of its implications. Opposed to usual conditional generation tasks, we do not start the reverse process from pure Gaussian noise but from a mixture of noisy speech and Gaussian noise. This matches our forward process which moves from clean speech to noisy speech by including a drift term. We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates. By adapting the network architecture, we are able to significantly improve the speech enhancement performance, indicating that the network, rather than the formalism, was the main limitation of our original approach. In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models and achieves better generalization when evaluating on a different corpus than used for training. We complement the results with an instrumental evaluation using real-world noisy recordings and a listening experiment, in which our proposed method is rated best. Examining different sampler configurations for solving the reverse process allows us to balance the performance and computational speed of the proposed method. Moreover, we show that the proposed method is also suitable for dereverberation and thus not limited to additive background noise removal. Code and audio examples are available online, see https://github.com/sp-uhh/sgmse.

academic

Speech Enhancement and Dereverberation with Diffusion-based Generative Models

Basic Information

Paper ID: 2208.05830
Title: Speech Enhancement and Dereverberation with Diffusion-based Generative Models
Authors: Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Timo Gerkmann
Categories: eess.AS (Audio and Speech Processing), cs.LG (Machine Learning), cs.SD (Sound)
Publication Date: August 2022 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2208.05830
Code Link: https://github.com/sp-uhh/sgmse

Abstract

Building upon prior work, this paper employs diffusion-based generative models for speech enhancement. The paper provides detailed exposition of the diffusion process based on stochastic differential equations (SDEs) with in-depth theoretical analysis. Unlike conventional conditional generation tasks, this work does not initiate the reverse process from pure Gaussian noise, but rather from a mixture of noisy speech and Gaussian noise. This corresponds to a forward process containing a drift term that transitions from clean speech to noisy speech. The study demonstrates that high-quality clean speech estimates can be generated with only 30 diffusion steps. Through improved network architecture, significant performance gains in speech enhancement are achieved, indicating that the network architecture, rather than the formalization approach, was the primary limiting factor in the original method.

Research Background and Motivation

Problem Definition

Speech enhancement aims to recover clean speech signals from audio recordings affected by acoustic noise or reverberation. This is a classical signal processing problem with important applications in telephone communications, hearing aids, speech recognition, and related fields.

Limitations of Existing Methods

Limitations of Discriminative Models:
- Difficulty in covering all possible acoustic conditions in training data
- May produce unnatural speech distortions
- Limited generalization capability
Issues with VAE-based Generative Models:
- Dimensionality reduction constraints in latent layers
- Encoder sensitivity to noisy inputs
- Dependence on linear noise models
Shortcomings of Existing Diffusion Models:
- CDiffuSE requires explicit environmental noise estimation
- Poor preservation of high-frequency information

Research Motivation

This paper aims to design a purely generative diffusion model that learns the prior distribution of clean speech to achieve high-quality speech enhancement and dereverberation in the complex STFT domain.

Core Contributions

Innovative SDE Diffusion Process: Proposes a stochastic differential equation with drift term that enables the forward process to transition from clean speech to noisy speech
Improved Network Architecture: Adopts NCSN++ architecture replacing the original complex U-Net, significantly enhancing performance
Unified Framework: A single framework capable of handling both speech enhancement and dereverberation tasks
Comprehensive Evaluation: Includes cross-dataset evaluation, real-world data testing, and subjective listening experiments
Efficiency Optimization: Balances performance and computational speed through different sampler configurations
Theoretical Analysis: Provides detailed theoretical derivations and analysis of the diffusion process

Methodology Details

Task Definition

Input: Noisy/reverberant speech signal $y$
Output: Clean speech signal $x_0$
Constraint: Preserve speech naturalness and intelligibility

Data Representation

The paper operates in the complex STFT domain using magnitude compression transformation: $\tilde{c} = \beta|c|^{\alpha}e^{i\angle(c)}$ where $\alpha \in (0,1]$ is the compression exponent and $\beta \in \mathbb{R}^+$ is the scaling factor.

Stochastic Differential Equation Design

Forward Process

Define a linear SDE: $dx_t = f(x_t, y)dt + g(t)dw$

where:

Drift coefficient: $f(x_t, y) = \gamma(y - x_t)$
Diffusion coefficient: $g(t) = \sigma_{min}\left(\frac{\sigma_{max}}{\sigma_{min}}\right)^t\sqrt{2\log\left(\frac{\sigma_{max}}{\sigma_{min}}\right)}$

Reverse Process

The corresponding reverse SDE is: $dx_t = [f(x_t, y) - g(t)^2s_\theta(x_t, y, t)]dt + g(t)d\bar{w}$

where $s_\theta(x_t, y, t)$ is the learnable score function.

Training Objective

Based on denoising score matching, the training objective is: $\arg\min_\theta \mathbb{E}_{t,(x_0,y),z,x_t|(x_0,y)}\left[\left\|s_\theta(x_t, y, t) + \frac{z}{\sigma(t)}\right\|_2^2\right]$

Network Architecture

Employs NCSN++ architecture with main characteristics:

Multi-resolution U-Net structure
Progressive growth pathways
Global attention mechanisms
Time embedding: Uses Fourier embedding to encode temporal information
Residual blocks: Based on BigGAN residual network blocks

Experimental Setup

Datasets

WSJ0-CHiME3: WSJ0 clean speech with CHiME3 noise, SNR range 0-20dB
VB-DMD (VoiceBank-DEMAND): Standard speech enhancement benchmark dataset
WSJ0-REVERB: Reverberation data simulated using pyroomacoustics, T60 range 0.4-1.0 seconds

Evaluation Metrics

Full-Reference Metrics: POLQA, PESQ, ESTOI, SI-SDR, SI-SIR, SI-SAR
No-Reference Metrics: DNSMOS, SIG, BAK, OVRL, WVMOS

Comparison Methods

Generative Models: STCN, DVAE, CDiffuSE, SGMSE (original)
Discriminative Models: MetricGAN+, Conv-TasNet, GaGNet, TCN+SA+S

Implementation Details

STFT parameters: Window length 510, hop length 128, Hann window
SDE parameters: $\sigma_{min}=0.05$ , $\sigma_{max}=0.5$ , $\gamma=1.5$
Training: 4×Quadro RTX 6000, 160 epochs, learning rate $10^{-4}$
Sampling: 30-step reverse process, predictor-corrector sampler

Experimental Results

Main Results

Speech Enhancement Performance (WSJ0-CHiME3)

Method	Training Set	POLQA	PESQ	SI-SDR
SGMSE+	WSJ0-C3	3.73	2.96	18.3
Conv-TasNet	WSJ0-C3	3.65	2.99	19.9
MetricGAN+	WSJ0-C3	3.52	3.03	10.5
CDiffuSE	WSJ0-C3	3.08	2.27	9.2

Cross-Dataset Generalization

Under mismatched conditions (trained on VB-DMD, tested on WSJ0-CHiME3), SGMSE+ outperforms other methods on all metrics, demonstrating superior generalization capability.

Dereverberation Performance (WSJ0-REVERB)

Method	POLQA	PESQ	SI-SDR
SGMSE+	3.24	2.66	1.6
Conv-TasNet	2.41	1.84	1.6
GaGNet	2.62	1.98	-0.6

Ablation Studies

Sampler Configuration Optimization

Predictor-Corrector Sampler: One correction step achieves optimal performance balance
Step Selection: 30 steps reach performance saturation
Computational Efficiency: RTF of 1.77 (1.77 times real-time processing)

Architecture Improvement Effects

Compared to the original SGMSE, SGMSE+ achieves 0.75 improvement in POLQA and 0.68 improvement in PESQ, demonstrating the importance of network architecture.

Subjective Listening Experiments

MUSHRA experiment results show SGMSE+ achieves the highest scores, particularly demonstrating excellent robustness under mismatched conditions.

Real-World Data Evaluation

On DNS Challenge 2020 real-world noise data, SGMSE+ performs best on all no-reference metrics.

Discriminative Model Approaches

Time-frequency masking: Learning ideal binary or ratio masks
Complex spectral mapping: Direct estimation of complex STFT coefficients
Time-domain methods: End-to-end waveform processing

Generative Model Approaches

VAE-based: Learning speech prior distribution, but limited by latent space dimensionality reduction
GAN methods: Implicit density estimation, but unstable training
Diffusion models: Recently emerging, divided into regeneration and direct modeling categories

Diffusion Model Applications in Speech

Speech regeneration: Methods such as CDiffuSE
Direct modeling: SGMSE series methods in this paper

Conclusions and Discussion

Main Conclusions

Improved network architecture is the key factor for performance enhancement
Generative models outperform discriminative models in cross-dataset generalization
A single framework effectively handles multiple speech restoration tasks
A 30-step diffusion process achieves high-quality speech generation

Limitations

Computational Complexity: Larger computational burden compared to discriminative models
Artifacts: May produce "vocalization" artifacts under extremely low SNR conditions
Phase Modeling: Limited phase enhancement effects with complex-valued modeling
Parameter Sensitivity: Requires careful tuning of SDE parameters

Future Directions

Incorporate speech activity detection and phoneme information conditioning
Explore more efficient sampling strategies
Investigate phase enhancement under shorter frame lengths
Extend to other speech restoration tasks

In-Depth Evaluation

Strengths

Theoretical Contribution: Provides complete SDE theoretical derivations and analysis
Methodological Innovation: Clever drift term design enables task adaptation
Comprehensive Experiments: Includes cross-dataset, real-world data, and subjective evaluation
Practical Value: Open-source code facilitates reproducibility and application
Clear Presentation: Detailed theoretical derivations and well-designed experiments

Weaknesses

Computational Efficiency: RTF of 1.77 leaves room for real-time performance improvement
Artifact Issues: "Vocalization" artifacts under low SNR require resolution
Parameter Tuning: SDE parameters require dataset-specific optimization
Theoretical Analysis: Insufficient analysis of forward-backward process mismatch effects

Impact

Academic Value: Provides important reference for diffusion model applications in speech processing
Practical Value: Achieves competitive performance on multiple benchmark datasets
Reproducibility: Provides complete code and audio samples
Inspirational Value: Offers a general framework for other speech restoration tasks

Applicable Scenarios

Speech Enhancement: Telephone communications, hearing aids
Dereverberation: Post-processing for indoor speech recordings
Speech Restoration: Historical recording restoration
Preprocessing: Front-end processing for speech recognition systems

References

The paper cites extensive related work, primarily including:

Song et al. (2021): Score-based generative modeling through stochastic differential equations
Lu et al. (2022): Conditional diffusion probabilistic model for speech enhancement
Vincent (2011): A connection between score matching and denoising autoencoders
Anderson (1982): Reverse-time diffusion equation models

Overall Assessment: This is a high-quality research paper demonstrating excellence in theoretical innovation, methodological design, and experimental validation. The paper successfully applies diffusion models to speech enhancement tasks, achieving performance comparable to discriminative models through clever SDE design and network architecture improvements, while demonstrating superior generalization capability. Despite computational efficiency and artifact concerns, its theoretical contributions and practical value make it an important work in the field.