This paper introduces a new training strategy to improve speech dereverberation systems in an unsupervised manner using only reverberant speech. Most existing algorithms rely on paired dry/reverberant data, which is difficult to obtain. Our approach uses limited acoustic information, like the reverberation time (RT60), to train a dereverberation system. Experimental results demonstrate that our method achieves more consistent performance across various objective metrics than the state-of-the-art.
academic
Unsupervised Speech Dereverberation with Hybrid Model
This paper proposes a novel training strategy to improve speech dereverberation systems in an unsupervised manner using only reverberant speech. Most existing algorithms rely on paired clean/reverberant data, which is difficult to obtain. The method uses limited acoustic information (such as reverberation time RT60) to train the dereverberation system. Experimental results demonstrate that the method achieves more consistent performance across various objective metrics compared to state-of-the-art approaches.
Core Problem: In indoor environments, speech signals are affected by wall reflections and obstacle diffraction, producing reverberation phenomena that reduce speech intelligibility. Dereverberation methods need to be developed to mitigate this impact.
Problem Significance: Reverberation severely affects speech quality and intelligibility. Effective dereverberation techniques are needed in applications such as speech recognition and communication systems.
Limitations of Existing Methods:
Discriminative methods require large amounts of paired (clean, reverberant) data, which is difficult to obtain
Generative methods, while requiring less supervision, still need clean speech data, which is harder to acquire than reverberant data
Methods like MetricGAN-U use only reverberant signals but are based on single-metric optimization, lacking comprehensive performance
Research Motivation: Develop an unsupervised dereverberation method using only reverberant speech, leveraging limited acoustic information such as reverberation time for training.
Proposed a reverberation self-supervised training framework: Innovatively uses reverberation models to supervise deep neural network training, rather than traditional metric-based supervision
Designed a reverberation time-aware training strategy: Combines acoustic modeling and deep learning, utilizing parameters such as RT60 to guide training
Achieved more consistent performance improvements: Outperforms metric-supervised methods across multiple objective metrics
Provided open-source implementation: Released code, pre-trained models, and examples to facilitate research reproducibility
Input: Reverberant speech signal Y
Output: Estimated clean speech signal Ŝ
Constraint: Training uses only reverberant signals without requiring paired clean/reverberant data
Consistency Advantage: The proposed method outperforms the SRMR baseline across SISDR, ESTOI, and WB-PESQ metrics
Baseline Limitations: The MetricGAN-U baseline achieves the best SRMR performance but shows degraded performance on other metrics, even falling below the original reverberant signal
Estimation Robustness: The blind supervised version performs nearly identically to the weakly supervised version, indicating robustness to RT60 estimation errors
Model Adaptability: BiLSTM shows smaller performance degradation from fully supervised to weakly supervised settings, possibly because it only processes magnitude masks and is insensitive to phase disturbances
The paper cites important works in related fields, including:
Classical theoretical foundations of the Polack reverberation model
Traditional dereverberation methods such as WPE
Recent unsupervised methods like MetricGAN-U
Advanced speech enhancement models such as FullSubNet
Related algorithms for blind reverberation parameter estimation
This paper presents an innovative unsupervised speech dereverberation framework that cleverly combines acoustic modeling and deep learning, achieving a good balance between practicality and performance. Although there remains a gap compared to fully supervised methods, it provides a valuable solution to address the practical challenge of data acquisition in real-world applications.