Spatially-Augmented Sequence-to-Sequence Neural Diarization for Meetings
Li, Cheng, Zhang et al.
This paper proposes a Spatially-Augmented Sequence-to-Sequence Neural Diarization (SA-S2SND) framework, which integrates direction-of-arrival (DOA) cues estimated by SRP-DNN into the S2SND backbone. A two-stage training strategy is adopted: the model is first trained with single-channel audio and DOA features, and then further optimized with multi-channel inputs under DOA guidance. In addition, a simulated DOA generation scheme is introduced to alleviate dependence on matched multi-channel corpora. On the AliMeeting dataset, SA-S2SND consistently outperform the S2SND baseline, achieving a 7.4% relative DER reduction in the offline mode and over 19% improvement when combined with channel attention. These results demonstrate that spatial cues are highly complementary to cross-channel modeling, yielding good performance in both online and offline settings.
academic
Spatially-Augmented Sequence-to-Sequence Neural Diarization for Meetings
This paper proposes a spatially-augmented sequence-to-sequence neural diarization (SA-S2SND) framework that integrates direction-of-arrival (DOA) cues estimated via SRP-DNN into the S2SND backbone network. A two-stage training strategy is employed: the model is first trained using single-channel audio and DOA features, then further optimized with multi-channel inputs under DOA guidance. Additionally, a simulated DOA generation scheme is introduced to reduce dependence on matched multi-channel corpora. On the AliMeeting dataset, SA-S2SND consistently outperforms the S2SND baseline, achieving a 7.4% relative DER reduction in offline mode, with improvements exceeding 19% when combined with channel attention. These results demonstrate that spatial cues are highly complementary to cross-channel modeling, yielding strong performance in both online and offline settings.
Speaker diarization aims to answer the question "who spoke when," serving as a fundamental preprocessing step for downstream tasks such as automatic speech recognition. Despite significant progress in this field, speaker diarization in meeting scenarios remains challenging, primarily due to:
Early Modular Approaches: Segment audio into short utterances and cluster via speaker embedding similarity, assuming each segment contains only one speaker, performing poorly on overlapping speech
End-to-End Neural Diarization (EEND): While addressing overlap issues, still primarily relies on acoustic embeddings
Sequence-to-Sequence Diarization (S2SND): Progress in online diarization, but lacks explicit spatial information
Most existing methods rely solely on acoustic embeddings, which are often unreliable in real meetings. The key question is: How can spatial cues from multi-channel recordings be leveraged to improve speaker diarization?
Proposes SA-S2SND Framework: Integrates DNN-derived DOA as explicit spatial input into S2SND for both online and offline speaker diarization
Designs Simulated DOA Method: Decouples spatial cues from array design, enabling effective utilization of spatial information without requiring large multi-channel corpora
Validates Effectiveness: Verifies SA-S2SND on the AliMeeting dataset, demonstrating consistent DER improvements over S2SND baseline in both modes
Two-Stage Training Strategy: Trains first with single-channel audio, then extends to multi-channel, ensuring a consistent progression from pure acoustic to spatially-augmented modeling
The speaker diarization task aims to determine the identity of active speakers for each time segment in multi-speaker audio. Input consists of multi-channel audio signals, and output comprises speaker activity labels and speaker representations for each time frame.
Encoder: Conformer for modeling long-range dependencies
Representation Decoder: Generates target embeddings Ê
Detection Decoder: Predicts activity Ŷ
DOA Integration Method:
X=X+LinearRA→RD(interpolate(O))/D
where O ∈ R^{T''×A} is the DOA probability matrix, integrated into encoder representations through nearest-neighbor interpolation and linear projection.
Compared to existing work, this paper is the first to effectively integrate explicit DOA cues into a sequence-to-sequence speaker diarization framework, and proposes a simulated strategy to reduce dependence on multi-channel corpora.
The paper cites 36 relevant references covering key works in speaker diarization, multi-channel signal processing, and deep learning, providing a solid theoretical foundation for the research.
Overall Assessment: This is a high-quality research paper that proposes innovative methods for utilizing spatial information in speaker diarization. The experimental design is rigorous, results are convincing, and practical value is substantial. The main innovation lies in effectively integrating explicit DOA cues into the sequence-to-sequence framework and solving the multi-channel data scarcity problem through clever training strategies.