2025-11-16T04:37:12.332621

Déréverbération non-supervisée de la parole par modèle hybride

Bahrman, Fontaine, Richard
This paper introduces a new training strategy to improve speech dereverberation systems in an unsupervised manner using only reverberant speech. Most existing algorithms rely on paired dry/reverberant data, which is difficult to obtain. Our approach uses limited acoustic information, like the reverberation time (RT60), to train a dereverberation system. Experimental results demonstrate that our method achieves more consistent performance across various objective metrics than the state-of-the-art.
academic

Unsupervised Speech Dereverberation with Hybrid Model

Basic Information

  • Paper ID: 2510.09025
  • Title: Unsupervised Speech Dereverberation with Hybrid Model
  • Authors: Louis Bahrman, Mathieu Fontaine, Gaël Richard (LTCI, Télécom Paris, Institut Polytechnique de Paris)
  • Categories: cs.SD cs.AI eess.AS
  • Publication Date: October 10, 2025
  • Paper Link: https://arxiv.org/abs/2510.09025

Abstract

This paper proposes a novel training strategy to improve speech dereverberation systems in an unsupervised manner using only reverberant speech. Most existing algorithms rely on paired clean/reverberant data, which is difficult to obtain. The method uses limited acoustic information (such as reverberation time RT60) to train the dereverberation system. Experimental results demonstrate that the method achieves more consistent performance across various objective metrics compared to state-of-the-art approaches.

Research Background and Motivation

  1. Core Problem: In indoor environments, speech signals are affected by wall reflections and obstacle diffraction, producing reverberation phenomena that reduce speech intelligibility. Dereverberation methods need to be developed to mitigate this impact.
  2. Problem Significance: Reverberation severely affects speech quality and intelligibility. Effective dereverberation techniques are needed in applications such as speech recognition and communication systems.
  3. Limitations of Existing Methods:
    • Discriminative methods require large amounts of paired (clean, reverberant) data, which is difficult to obtain
    • Generative methods, while requiring less supervision, still need clean speech data, which is harder to acquire than reverberant data
    • Methods like MetricGAN-U use only reverberant signals but are based on single-metric optimization, lacking comprehensive performance
  4. Research Motivation: Develop an unsupervised dereverberation method using only reverberant speech, leveraging limited acoustic information such as reverberation time for training.

Core Contributions

  1. Proposed a reverberation self-supervised training framework: Innovatively uses reverberation models to supervise deep neural network training, rather than traditional metric-based supervision
  2. Designed a reverberation time-aware training strategy: Combines acoustic modeling and deep learning, utilizing parameters such as RT60 to guide training
  3. Achieved more consistent performance improvements: Outperforms metric-supervised methods across multiple objective metrics
  4. Provided open-source implementation: Released code, pre-trained models, and examples to facilitate research reproducibility

Method Details

Task Definition

Input: Reverberant speech signal Y Output: Estimated clean speech signal Ŝ Constraint: Training uses only reverberant signals without requiring paired clean/reverberant data

Model Architecture

1. Overall Framework

The method comprises three main components:

  • Reverberation Analyzer A: Estimates acoustic parameters (primarily RT60) from reverberant signals
  • RIS Synthesizer S: Synthesizes room impulse responses based on acoustic parameters
  • Convolution Model C: Performs cross-band convolution in the time-frequency domain

2. Reverberation Model

Signal Model:

y(n) = (s ⋆ h)(n)

where y is the reverberant signal, s is the clean signal, and h is the room impulse response (RIS).

Polack Reverberation Model:

h_l(n) = b(n)e^(-3ln(10)n/(RT60·f_s))

where b(n)~N(0,σ²) is white noise and RT60 is the reverberation time.

3. Time-Frequency Domain Convolution

In the Short-Time Fourier Transform (STFT) domain, convolution is represented as:

Y_{f,t} = ∑∑ H_{f,f',t'} S_{f',t-t'}

4. RIS Synthesizer

The synthesized RIS is defined as:

S(Θ)(n) = {
  |b(n)|e^(-3ln(10)n/(RT60·f_s)), n > n_m
  1,                               n = 0  
  0,                               otherwise
}

Technical Innovations

  1. Reverberation Self-Supervision Strategy: Unlike traditional metric supervision, directly uses physical reverberation models for supervision
  2. Cross-Band Time-Frequency Convolution: Implements differentiable time-frequency domain convolution operations for gradient backpropagation
  3. Reverberation Matching Loss Function:
L = ∑|Ŷ_{f,t} - Y_{f,t}|² + λ|log((1+γ|Ŷ_{f,t}|)/(1+γ|Y_{f,t}|))|²

Experimental Setup

Datasets

  • Training Data: Headset microphone recordings from WSJ1 dataset, 73 hours of audio, 60,307 segments
  • RIS Data: 32,000 RIS generated using pyroomacoustics from 2,000 simulated rooms
  • Room Parameters:
    • Dimensions: 5,10×5,10×2.5,4
    • RT60: 0.2,1.0 s
    • Source-microphone distance: 0.75,2.5 m

Evaluation Metrics

  • SISDR: Scale-Invariant Signal-to-Distortion Ratio
  • ESTOI: Extended Short-Time Objective Intelligibility
  • WB-PESQ: Wideband Perceptual Evaluation of Speech Quality
  • SRMR: Speech-to-Reverberation Modulation Energy Ratio

Comparison Methods

  1. Fully Supervised Methods: FullSubNet and BiLSTM trained on paired data
  2. Weakly Supervised Methods: Versions using oracle RT60
  3. Blind Supervised Methods: Fully unsupervised version using estimated RT60
  4. Baseline Methods: MetricGAN-U (BiLSTM+SRMR)

Implementation Details

  • Audio Processing: 16 kHz sampling rate, 512-point Hann window, 50% overlap
  • Optimizer: Adam optimizer
  • Stopping Criterion: Based on validation set SISDR metric
  • Models: FullSubNet (FSN) and BiLSTM neural network architectures

Experimental Results

Main Results

ModelSupervisionSISDRESTOIWB-PESQSRMR
FSNFully Supervised5.6±3.90.84±0.102.55±0.678.2±3.5
FSNWeakly Supervised2.9±3.50.71±0.151.78±0.706.9±2.8
FSNBlind Supervised (Proposed)2.8±3.40.71±0.151.78±0.706.9±2.8
BiLSTMFully Supervised1.3±4.30.78±0.122.25±0.787.9±3.0
BiLSTMWeakly Supervised1.6±3.70.71±0.151.84±0.746.9±2.8
BiLSTMBlind Supervised (Proposed)1.5±3.70.71±0.151.84±0.746.9±2.8
BiLSTMSRMR Baseline-1.5±3.50.64±0.181.78±0.7210.9±4.3
-Reverberant Signal-1.3±3.50.69±0.161.75±0.746.9±2.9

Key Findings

  1. Consistency Advantage: The proposed method outperforms the SRMR baseline across SISDR, ESTOI, and WB-PESQ metrics
  2. Baseline Limitations: The MetricGAN-U baseline achieves the best SRMR performance but shows degraded performance on other metrics, even falling below the original reverberant signal
  3. Estimation Robustness: The blind supervised version performs nearly identically to the weakly supervised version, indicating robustness to RT60 estimation errors
  4. Model Adaptability: BiLSTM shows smaller performance degradation from fully supervised to weakly supervised settings, possibly because it only processes magnitude masks and is insensitive to phase disturbances

Traditional Methods

  • Statistical Signal Processing: Such as Weighted Prediction Error (WPE) methods
  • Convolutional Transfer Function Approximation: Models reverberation as a filtering process in subbands

Deep Learning Methods

  • Discriminative Methods: Directly predict clean signals or complex masks
  • Generative Methods: Such as Variational Autoencoders learning clean speech distributions
  • Hybrid Methods: Combining traditional models and deep learning, such as USDNet

Unsupervised Methods

  • MetricGAN-U: Uses adversarial networks to optimize specific metrics
  • Diffusion Model Methods: Such as BUDDy using diffusion models for blind dereverberation

Conclusions and Discussion

Main Conclusions

  1. Reverberation self-supervision achieves more consistent performance improvements than metric self-supervision
  2. The method improves performance across multiple objective metrics, avoiding limitations of single-metric optimization
  3. Blind RT60 estimation does not significantly impact performance, enhancing practical applicability

Limitations

  1. Model Complexity: Requires additional reverberation modeling components compared to pure data-driven methods
  2. Parameter Dependency: Although blind estimation is possible, still depends on accuracy of acoustic parameters like RT60
  3. Simplified Reverberation Model: The Polack model used is a simplified reverberation model that may not fully match real environments
  4. Phase Sensitivity: Complex spectral methods (such as FSN) are more sensitive to phase disturbances in the reverberation model

Future Directions

  1. Generative Extensions: Apply the method to generative models for better consideration of probabilistic RIS models
  2. More Complex Reverberation Models: Consider more accurate physical reverberation models
  3. Multi-Channel Extension: Extend to multi-microphone scenarios
  4. Real-Time Applications: Optimize computational efficiency for real-time processing support

In-Depth Evaluation

Strengths

  1. Strong Innovation: First to propose reverberation self-supervision training strategy with novel technical approach
  2. High Practical Value: Addresses the practical problem of difficulty in obtaining paired training data
  3. Comprehensive Experiments: Thorough evaluation across multiple metrics and model architectures
  4. Open-Source Contribution: Provides complete code and models, facilitating research reproducibility
  5. Solid Theoretical Foundation: Based on mature acoustic reverberation theory

Weaknesses

  1. Performance Gap: Still shows notable performance gap compared to fully supervised methods
  2. Evaluation Limitations: Evaluation only on simulated data, lacking validation in real environments
  3. Insufficient Parameter Sensitivity Analysis: Limited analysis of sensitivity to reverberation model parameters
  4. Computational Overhead: Requires additional reverberation modeling computation during training

Impact

  1. Academic Contribution: Provides a new unsupervised training paradigm for speech dereverberation
  2. Practical Value: Reduces data requirements for high-quality dereverberation systems
  3. Reproducibility: Open-source code and detailed experimental settings ensure reproducibility
  4. Inspirational Significance: Provides insights for physical model supervision in other speech enhancement tasks

Applicable Scenarios

  1. Data-Scarce Scenarios: Application environments lacking paired training data
  2. Specific Acoustic Environments: Fixed environments with known basic acoustic parameters
  3. Rapid Deployment: Systems requiring quick adaptation to new environments
  4. Research Prototypes: As foundational components for more complex systems

References

The paper cites important works in related fields, including:

  • Classical theoretical foundations of the Polack reverberation model
  • Traditional dereverberation methods such as WPE
  • Recent unsupervised methods like MetricGAN-U
  • Advanced speech enhancement models such as FullSubNet
  • Related algorithms for blind reverberation parameter estimation

This paper presents an innovative unsupervised speech dereverberation framework that cleverly combines acoustic modeling and deep learning, achieving a good balance between practicality and performance. Although there remains a gap compared to fully supervised methods, it provides a valuable solution to address the practical challenge of data acquisition in real-world applications.