2025-11-14T05:22:11.004755

Spatially-Augmented Sequence-to-Sequence Neural Diarization for Meetings

Li, Cheng, Zhang et al.

This paper proposes a Spatially-Augmented Sequence-to-Sequence Neural Diarization (SA-S2SND) framework, which integrates direction-of-arrival (DOA) cues estimated by SRP-DNN into the S2SND backbone. A two-stage training strategy is adopted: the model is first trained with single-channel audio and DOA features, and then further optimized with multi-channel inputs under DOA guidance. In addition, a simulated DOA generation scheme is introduced to alleviate dependence on matched multi-channel corpora. On the AliMeeting dataset, SA-S2SND consistently outperform the S2SND baseline, achieving a 7.4% relative DER reduction in the offline mode and over 19% improvement when combined with channel attention. These results demonstrate that spatial cues are highly complementary to cross-channel modeling, yielding good performance in both online and offline settings.

academic

Spatially-Augmented Sequence-to-Sequence Neural Diarization for Meetings

Basic Information

Paper ID: 2510.09505
Title: Spatially-Augmented Sequence-to-Sequence Neural Diarization for Meetings
Authors: Li Li, Ming Cheng, Hongyu Zhang, Juan Liu, Ming Li
Classification: eess.AS (Audio and Speech Processing)
Publication Date: October 10, 2025
Paper Link: https://arxiv.org/abs/2510.09505v1

Abstract

This paper proposes a spatially-augmented sequence-to-sequence neural diarization (SA-S2SND) framework that integrates direction-of-arrival (DOA) cues estimated via SRP-DNN into the S2SND backbone network. A two-stage training strategy is employed: the model is first trained using single-channel audio and DOA features, then further optimized with multi-channel inputs under DOA guidance. Additionally, a simulated DOA generation scheme is introduced to reduce dependence on matched multi-channel corpora. On the AliMeeting dataset, SA-S2SND consistently outperforms the S2SND baseline, achieving a 7.4% relative DER reduction in offline mode, with improvements exceeding 19% when combined with channel attention. These results demonstrate that spatial cues are highly complementary to cross-channel modeling, yielding strong performance in both online and offline settings.

Research Background and Motivation

Core Problem

Speaker diarization aims to answer the question "who spoke when," serving as a fundamental preprocessing step for downstream tasks such as automatic speech recognition. Despite significant progress in this field, speaker diarization in meeting scenarios remains challenging, primarily due to:

Overlapping Speech: Multiple speakers speaking simultaneously
Unreliable Speaker Embeddings: Difficulty in extracting speaker characteristics in noisy environments
Reverberation: Acoustic distortion caused by indoor environments

Limitations of Existing Methods

Early Modular Approaches: Segment audio into short utterances and cluster via speaker embedding similarity, assuming each segment contains only one speaker, performing poorly on overlapping speech
End-to-End Neural Diarization (EEND): While addressing overlap issues, still primarily relies on acoustic embeddings
Sequence-to-Sequence Diarization (S2SND): Progress in online diarization, but lacks explicit spatial information

Research Motivation

Most existing methods rely solely on acoustic embeddings, which are often unreliable in real meetings. The key question is: How can spatial cues from multi-channel recordings be leveraged to improve speaker diarization?

Core Contributions

Proposes SA-S2SND Framework: Integrates DNN-derived DOA as explicit spatial input into S2SND for both online and offline speaker diarization
Designs Simulated DOA Method: Decouples spatial cues from array design, enabling effective utilization of spatial information without requiring large multi-channel corpora
Validates Effectiveness: Verifies SA-S2SND on the AliMeeting dataset, demonstrating consistent DER improvements over S2SND baseline in both modes
Two-Stage Training Strategy: Trains first with single-channel audio, then extends to multi-channel, ensuring a consistent progression from pure acoustic to spatially-augmented modeling

Methodology Details

Task Definition

The speaker diarization task aims to determine the identity of active speakers for each time segment in multi-speaker audio. Input consists of multi-channel audio signals, and output comprises speaker activity labels and speaker representations for each time frame.

Model Architecture

1. DOA Estimation Module (SRP-DNN)

Employs SRP-DNN for robust multi-source DOA estimation:

Core Concept: Learns direct-path inter-phase differences (DP-IPDs); for the k-th source, DOA is represented as: $\theta_k = [\theta_{ele}^k, \theta_{azi}^k]^T$

Training Objective: Weighted sum of direct-path IPD vectors: $R_{mm'}(n) = \sum_{k=1}^K \beta_k(n) r_{mm'}(\theta_k(n))$

Spatial Spectrum Construction: $P'(\theta;n) = \frac{2}{M(M-1)F} \sum_{m=1}^{M-1} \sum_{m'=m+1}^M \Re\{R̂_{mm'}(n)^H r_{mm'}(\theta)\}$

Multi-Source Localization: Employs iterative detection-and-localization (IDL) strategy for multi-speaker scenarios.

2. SA-S2SND Architecture

Based on S2SND backbone network, comprising four core modules:

Extractor: ResNet + segmental statistics pooling (SSP)
Encoder: Conformer for modeling long-range dependencies
Representation Decoder: Generates target embeddings Ê
Detection Decoder: Predicts activity Ŷ

DOA Integration Method: $X = X + \text{Linear}_{R^A \rightarrow R^D}(\text{interpolate}(O))/\sqrt{D}$

where O ∈ R^{T''×A} is the DOA probability matrix, integrated into encoder representations through nearest-neighbor interpolation and linear projection.

Technical Innovations

Explicit Spatial Cue Injection: Unlike blind fusion, directly uses DOA estimation to provide directional evidence
Simulated DOA Strategy:
- Real multi-channel speech + SRP-DNN estimated DOA
- Simulated multi-channel speech + randomly generated pseudo-DOA
Two-Stage Training:
- Part A: Single-channel model + multi-channel DOA (stages 1-3)
- Part B: Multi-channel model + multi-channel DOA (stages 4-5)

Experimental Setup

Datasets

Simulated Data: VoxCeleb2 (1M utterances, 6,112 speakers) for online mixture generation
Real Data: AliMeeting (training set 104.75h, evaluation set 4h, test set 10h)
- 8-channel far-field array and headset recordings
- Uses NARA-WPE dereverberated far-field array signals

Evaluation Metrics

DER (Diarization Error Rate): Without Oracle VAD and tolerance
Reports performance separately for 1-2 speaker and 2+ speaker scenarios
Performance comparison in online and offline modes

Baseline Methods

S2SND baseline (single-channel and multi-channel versions)
BUT System (state-of-the-art)
Different model sizes: Small (16.56M parameters) and Medium (45.96M parameters)

Implementation Details

Audio Processing: 8s window, 2s overlap, 80-dimensional log-Mel filterbank
Training: AdamW optimizer, BCE + ArcFace loss
Inference: Block-level sliding window, 0.8s online latency
Hardware: Two RTX-A6000 GPUs

Experimental Results

Main Results

Model	Channels	DOA	Total DER (Online %)	Total DER (Offline %)
S2SND	1	✗	16.03	13.59
SA-S2SND	1	✓	15.35	12.59
S2SND	8	✗	14.85	12.79
SA-S2SND	8	✓	12.93	10.84

Key Findings

Consistent Improvements: Adding DOA brings improvements across all configurations
- Single-channel: Online 4.2%↓, Offline 7.4%↓
- Multi-channel: Online 12.9%↓, Offline 15.2%↓
Multi-Speaker Scenario Advantages: More significant improvements in 2+ speaker scenarios, demonstrating robustness in complex dialogue conditions
Complementarity: Channel attention and DOA are highly complementary
- Channel attention captures correlations
- DOA provides explicit spatial cues
Parameter Efficiency: Best model (E4) achieves 19.3%/20.3% relative gains over baseline (E1) with comparable parameter count to SOTA

DOA Analysis

In AliMeeting training set, only 5.98% of duration involves more than two simultaneous speakers
Simulated data shows negligible DOA errors
In real meeting data, azimuth estimation provides clear differentiation between different speakers

Speaker Diarization Development Timeline

Modular Methods: Traditional clustering-based approaches
End-to-End Neural Diarization (EEND): Multi-label prediction task
Target Speaker Voice Activity Detection (TSVAD): Combines modular and neural approaches
Sequence-to-Sequence Diarization (S2SND): Supports online diarization

Multi-Channel Processing Approaches

Speech Enhancement: Beamforming, etc., but may introduce distortion
Channel Fusion: Attention modules aggregate signals, but typically blind fusion
Explicit Features: DOA estimation, etc., providing direct directional evidence

Advantages of This Work

Compared to existing work, this paper is the first to effectively integrate explicit DOA cues into a sequence-to-sequence speaker diarization framework, and proposes a simulated strategy to reduce dependence on multi-channel corpora.

Conclusions and Discussion

Main Conclusions

Effectiveness of Spatial Cues: DOA cues significantly improve speaker diarization performance
Complementarity: Spatial information is highly complementary to cross-channel modeling
Practicality: Performs well in both online and offline settings
Generalization Capability: Simulated DOA strategy reduces dependence on specific array configurations

Limitations

Multi-Speaker Constraints: SRP-DNN's IDL strategy tracks at most two speakers
Array Dependence: Requires retraining SRP-DNN to adapt to different array configurations
Computational Complexity: Adds computational overhead from DOA estimation

Future Directions

Multi-Speaker DOA Robustness: Improve handling of more than two simultaneous speakers
Joint Training Strategy: Explore end-to-end training of DOA estimation and speaker diarization
System Performance Enhancement: Further optimize overall system performance

In-Depth Evaluation

Strengths

Strong Innovation:
- First to effectively integrate explicit DOA cues into S2SND framework
- Proposes simulated DOA strategy addressing multi-channel data scarcity
- Well-designed two-stage training strategy
Comprehensive Experiments:
- Full evaluation on standard datasets
- Detailed ablation studies and analysis
- Fair comparison with SOTA methods
Solid Technical Foundation:
- Clever DOA integration method similar to positional encoding
- Addresses multi-channel array adaptation
- Supports both online and offline application scenarios
High Practical Value:
- Significant performance improvements (up to 19%+ relative gains)
- Good parameter efficiency
- Extensible to different array configurations

Weaknesses

Method Limitations:
- Constrained by SRP-DNN's two-speaker limitation
- Requires retraining DOA module for different arrays
- Authenticity of simulated DOA needs verification
Experimental Scope:
- Validation only on AliMeeting dataset
- Lacks robustness analysis under different acoustic conditions
- No computational complexity analysis provided
Insufficient Theoretical Analysis:
- Lacks theoretical explanation for DOA effectiveness
- No analysis of performance under different noise and reverberation conditions

Impact

Academic Contribution: Provides new perspectives on spatial information utilization in speaker diarization
Practical Value: Directly applicable to meeting transcription systems
Reproducibility: Detailed implementation details facilitate reproduction

Applicable Scenarios

Meeting Transcription: Real-time and offline speaker diarization for multi-person meetings
Intelligent Meeting Systems: End-to-end meeting understanding combined with speech recognition
Multi-Channel Speech Processing: Any speech separation task requiring spatial information utilization

References

The paper cites 36 relevant references covering key works in speaker diarization, multi-channel signal processing, and deep learning, providing a solid theoretical foundation for the research.

Overall Assessment: This is a high-quality research paper that proposes innovative methods for utilizing spatial information in speaker diarization. The experimental design is rigorous, results are convincing, and practical value is substantial. The main innovation lies in effectively integrating explicit DOA cues into the sequence-to-sequence framework and solving the multi-channel data scarcity problem through clever training strategies.