2025-11-16T04:37:12.332621

DÃ©rÃ©verbÃ©ration non-supervisÃ©e de la parole par modÃ¨le hybride

Bahrman, Fontaine, Richard

This paper introduces a new training strategy to improve speech dereverberation systems in an unsupervised manner using only reverberant speech. Most existing algorithms rely on paired dry/reverberant data, which is difficult to obtain. Our approach uses limited acoustic information, like the reverberation time (RT60), to train a dereverberation system. Experimental results demonstrate that our method achieves more consistent performance across various objective metrics than the state-of-the-art.

academic

Unsupervised Speech Dereverberation with Hybrid Model

Basic Information

Paper ID: 2510.09025
Title: Unsupervised Speech Dereverberation with Hybrid Model
Authors: Louis Bahrman, Mathieu Fontaine, Gaël Richard (LTCI, Télécom Paris, Institut Polytechnique de Paris)
Categories: cs.SD cs.AI eess.AS
Publication Date: October 10, 2025
Paper Link: https://arxiv.org/abs/2510.09025

Abstract

This paper proposes a novel training strategy to improve speech dereverberation systems in an unsupervised manner using only reverberant speech. Most existing algorithms rely on paired clean/reverberant data, which is difficult to obtain. The method uses limited acoustic information (such as reverberation time RT60) to train the dereverberation system. Experimental results demonstrate that the method achieves more consistent performance across various objective metrics compared to state-of-the-art approaches.

Research Background and Motivation

Core Problem: In indoor environments, speech signals are affected by wall reflections and obstacle diffraction, producing reverberation phenomena that reduce speech intelligibility. Dereverberation methods need to be developed to mitigate this impact.
Problem Significance: Reverberation severely affects speech quality and intelligibility. Effective dereverberation techniques are needed in applications such as speech recognition and communication systems.
Limitations of Existing Methods:
- Discriminative methods require large amounts of paired (clean, reverberant) data, which is difficult to obtain
- Generative methods, while requiring less supervision, still need clean speech data, which is harder to acquire than reverberant data
- Methods like MetricGAN-U use only reverberant signals but are based on single-metric optimization, lacking comprehensive performance
Research Motivation: Develop an unsupervised dereverberation method using only reverberant speech, leveraging limited acoustic information such as reverberation time for training.

Core Contributions

Proposed a reverberation self-supervised training framework: Innovatively uses reverberation models to supervise deep neural network training, rather than traditional metric-based supervision
Designed a reverberation time-aware training strategy: Combines acoustic modeling and deep learning, utilizing parameters such as RT60 to guide training
Achieved more consistent performance improvements: Outperforms metric-supervised methods across multiple objective metrics
Provided open-source implementation: Released code, pre-trained models, and examples to facilitate research reproducibility

Method Details

Task Definition

Input: Reverberant speech signal Y Output: Estimated clean speech signal Ŝ Constraint: Training uses only reverberant signals without requiring paired clean/reverberant data

Model Architecture

1. Overall Framework

The method comprises three main components:

Reverberation Analyzer A: Estimates acoustic parameters (primarily RT60) from reverberant signals
RIS Synthesizer S: Synthesizes room impulse responses based on acoustic parameters
Convolution Model C: Performs cross-band convolution in the time-frequency domain

2. Reverberation Model

Signal Model:

y(n) = (s ⋆ h)(n)

where y is the reverberant signal, s is the clean signal, and h is the room impulse response (RIS).

Polack Reverberation Model:

h_l(n) = b(n)e^(-3ln(10)n/(RT60·f_s))

where b(n)~N(0,σ²) is white noise and RT60 is the reverberation time.

3. Time-Frequency Domain Convolution

In the Short-Time Fourier Transform (STFT) domain, convolution is represented as:

Y_{f,t} = ∑∑ H_{f,f',t'} S_{f',t-t'}

4. RIS Synthesizer

The synthesized RIS is defined as:

S(Θ)(n) = {
  |b(n)|e^(-3ln(10)n/(RT60·f_s)), n > n_m
  1,                               n = 0  
  0,                               otherwise
}

Technical Innovations

Reverberation Self-Supervision Strategy: Unlike traditional metric supervision, directly uses physical reverberation models for supervision
Cross-Band Time-Frequency Convolution: Implements differentiable time-frequency domain convolution operations for gradient backpropagation
Reverberation Matching Loss Function:

L = ∑|Ŷ_{f,t} - Y_{f,t}|² + λ|log((1+γ|Ŷ_{f,t}|)/(1+γ|Y_{f,t}|))|²

Experimental Setup

Datasets

Training Data: Headset microphone recordings from WSJ1 dataset, 73 hours of audio, 60,307 segments
RIS Data: 32,000 RIS generated using pyroomacoustics from 2,000 simulated rooms
Room Parameters:
- Dimensions: 5,10×5,10×2.5,4 m³
- RT60: 0.2,1.0 s
- Source-microphone distance: 0.75,2.5 m

Evaluation Metrics

SISDR: Scale-Invariant Signal-to-Distortion Ratio
ESTOI: Extended Short-Time Objective Intelligibility
WB-PESQ: Wideband Perceptual Evaluation of Speech Quality
SRMR: Speech-to-Reverberation Modulation Energy Ratio

Comparison Methods

Fully Supervised Methods: FullSubNet and BiLSTM trained on paired data
Weakly Supervised Methods: Versions using oracle RT60
Blind Supervised Methods: Fully unsupervised version using estimated RT60
Baseline Methods: MetricGAN-U (BiLSTM+SRMR)

Implementation Details

Audio Processing: 16 kHz sampling rate, 512-point Hann window, 50% overlap
Optimizer: Adam optimizer
Stopping Criterion: Based on validation set SISDR metric
Models: FullSubNet (FSN) and BiLSTM neural network architectures

Experimental Results

Main Results

Model	Supervision	SISDR	ESTOI	WB-PESQ	SRMR
FSN	Fully Supervised	5.6±3.9	0.84±0.10	2.55±0.67	8.2±3.5
FSN	Weakly Supervised	2.9±3.5	0.71±0.15	1.78±0.70	6.9±2.8
FSN	Blind Supervised (Proposed)	2.8±3.4	0.71±0.15	1.78±0.70	6.9±2.8
BiLSTM	Fully Supervised	1.3±4.3	0.78±0.12	2.25±0.78	7.9±3.0
BiLSTM	Weakly Supervised	1.6±3.7	0.71±0.15	1.84±0.74	6.9±2.8
BiLSTM	Blind Supervised (Proposed)	1.5±3.7	0.71±0.15	1.84±0.74	6.9±2.8
BiLSTM	SRMR Baseline	-1.5±3.5	0.64±0.18	1.78±0.72	10.9±4.3
-	Reverberant Signal	-1.3±3.5	0.69±0.16	1.75±0.74	6.9±2.9

Key Findings

Consistency Advantage: The proposed method outperforms the SRMR baseline across SISDR, ESTOI, and WB-PESQ metrics
Baseline Limitations: The MetricGAN-U baseline achieves the best SRMR performance but shows degraded performance on other metrics, even falling below the original reverberant signal
Estimation Robustness: The blind supervised version performs nearly identically to the weakly supervised version, indicating robustness to RT60 estimation errors
Model Adaptability: BiLSTM shows smaller performance degradation from fully supervised to weakly supervised settings, possibly because it only processes magnitude masks and is insensitive to phase disturbances

Traditional Methods

Statistical Signal Processing: Such as Weighted Prediction Error (WPE) methods
Convolutional Transfer Function Approximation: Models reverberation as a filtering process in subbands

Deep Learning Methods

Discriminative Methods: Directly predict clean signals or complex masks
Generative Methods: Such as Variational Autoencoders learning clean speech distributions
Hybrid Methods: Combining traditional models and deep learning, such as USDNet

Unsupervised Methods

MetricGAN-U: Uses adversarial networks to optimize specific metrics
Diffusion Model Methods: Such as BUDDy using diffusion models for blind dereverberation

Conclusions and Discussion

Main Conclusions

Reverberation self-supervision achieves more consistent performance improvements than metric self-supervision
The method improves performance across multiple objective metrics, avoiding limitations of single-metric optimization
Blind RT60 estimation does not significantly impact performance, enhancing practical applicability

Limitations

Model Complexity: Requires additional reverberation modeling components compared to pure data-driven methods
Parameter Dependency: Although blind estimation is possible, still depends on accuracy of acoustic parameters like RT60
Simplified Reverberation Model: The Polack model used is a simplified reverberation model that may not fully match real environments
Phase Sensitivity: Complex spectral methods (such as FSN) are more sensitive to phase disturbances in the reverberation model

Future Directions

Generative Extensions: Apply the method to generative models for better consideration of probabilistic RIS models
More Complex Reverberation Models: Consider more accurate physical reverberation models
Multi-Channel Extension: Extend to multi-microphone scenarios
Real-Time Applications: Optimize computational efficiency for real-time processing support

In-Depth Evaluation

Strengths

Strong Innovation: First to propose reverberation self-supervision training strategy with novel technical approach
High Practical Value: Addresses the practical problem of difficulty in obtaining paired training data
Comprehensive Experiments: Thorough evaluation across multiple metrics and model architectures
Open-Source Contribution: Provides complete code and models, facilitating research reproducibility
Solid Theoretical Foundation: Based on mature acoustic reverberation theory

Weaknesses

Performance Gap: Still shows notable performance gap compared to fully supervised methods
Evaluation Limitations: Evaluation only on simulated data, lacking validation in real environments
Insufficient Parameter Sensitivity Analysis: Limited analysis of sensitivity to reverberation model parameters
Computational Overhead: Requires additional reverberation modeling computation during training

Impact

Academic Contribution: Provides a new unsupervised training paradigm for speech dereverberation
Practical Value: Reduces data requirements for high-quality dereverberation systems
Reproducibility: Open-source code and detailed experimental settings ensure reproducibility
Inspirational Significance: Provides insights for physical model supervision in other speech enhancement tasks

Applicable Scenarios

Data-Scarce Scenarios: Application environments lacking paired training data
Specific Acoustic Environments: Fixed environments with known basic acoustic parameters
Rapid Deployment: Systems requiring quick adaptation to new environments
Research Prototypes: As foundational components for more complex systems

References

The paper cites important works in related fields, including:

Classical theoretical foundations of the Polack reverberation model
Traditional dereverberation methods such as WPE
Recent unsupervised methods like MetricGAN-U
Advanced speech enhancement models such as FullSubNet
Related algorithms for blind reverberation parameter estimation

This paper presents an innovative unsupervised speech dereverberation framework that cleverly combines acoustic modeling and deep learning, achieving a good balance between practicality and performance. Although there remains a gap compared to fully supervised methods, it provides a valuable solution to address the practical challenge of data acquisition in real-world applications.