2025-11-14T05:22:11.004755

Spatially-Augmented Sequence-to-Sequence Neural Diarization for Meetings

Li, Cheng, Zhang et al.
This paper proposes a Spatially-Augmented Sequence-to-Sequence Neural Diarization (SA-S2SND) framework, which integrates direction-of-arrival (DOA) cues estimated by SRP-DNN into the S2SND backbone. A two-stage training strategy is adopted: the model is first trained with single-channel audio and DOA features, and then further optimized with multi-channel inputs under DOA guidance. In addition, a simulated DOA generation scheme is introduced to alleviate dependence on matched multi-channel corpora. On the AliMeeting dataset, SA-S2SND consistently outperform the S2SND baseline, achieving a 7.4% relative DER reduction in the offline mode and over 19% improvement when combined with channel attention. These results demonstrate that spatial cues are highly complementary to cross-channel modeling, yielding good performance in both online and offline settings.
academic

Spatially-Augmented Sequence-to-Sequence Neural Diarization for Meetings

Basic Information

  • Paper ID: 2510.09505
  • Title: Spatially-Augmented Sequence-to-Sequence Neural Diarization for Meetings
  • Authors: Li Li, Ming Cheng, Hongyu Zhang, Juan Liu, Ming Li
  • Classification: eess.AS (Audio and Speech Processing)
  • Publication Date: October 10, 2025
  • Paper Link: https://arxiv.org/abs/2510.09505v1

Abstract

This paper proposes a spatially-augmented sequence-to-sequence neural diarization (SA-S2SND) framework that integrates direction-of-arrival (DOA) cues estimated via SRP-DNN into the S2SND backbone network. A two-stage training strategy is employed: the model is first trained using single-channel audio and DOA features, then further optimized with multi-channel inputs under DOA guidance. Additionally, a simulated DOA generation scheme is introduced to reduce dependence on matched multi-channel corpora. On the AliMeeting dataset, SA-S2SND consistently outperforms the S2SND baseline, achieving a 7.4% relative DER reduction in offline mode, with improvements exceeding 19% when combined with channel attention. These results demonstrate that spatial cues are highly complementary to cross-channel modeling, yielding strong performance in both online and offline settings.

Research Background and Motivation

Core Problem

Speaker diarization aims to answer the question "who spoke when," serving as a fundamental preprocessing step for downstream tasks such as automatic speech recognition. Despite significant progress in this field, speaker diarization in meeting scenarios remains challenging, primarily due to:

  1. Overlapping Speech: Multiple speakers speaking simultaneously
  2. Unreliable Speaker Embeddings: Difficulty in extracting speaker characteristics in noisy environments
  3. Reverberation: Acoustic distortion caused by indoor environments

Limitations of Existing Methods

  1. Early Modular Approaches: Segment audio into short utterances and cluster via speaker embedding similarity, assuming each segment contains only one speaker, performing poorly on overlapping speech
  2. End-to-End Neural Diarization (EEND): While addressing overlap issues, still primarily relies on acoustic embeddings
  3. Sequence-to-Sequence Diarization (S2SND): Progress in online diarization, but lacks explicit spatial information

Research Motivation

Most existing methods rely solely on acoustic embeddings, which are often unreliable in real meetings. The key question is: How can spatial cues from multi-channel recordings be leveraged to improve speaker diarization?

Core Contributions

  1. Proposes SA-S2SND Framework: Integrates DNN-derived DOA as explicit spatial input into S2SND for both online and offline speaker diarization
  2. Designs Simulated DOA Method: Decouples spatial cues from array design, enabling effective utilization of spatial information without requiring large multi-channel corpora
  3. Validates Effectiveness: Verifies SA-S2SND on the AliMeeting dataset, demonstrating consistent DER improvements over S2SND baseline in both modes
  4. Two-Stage Training Strategy: Trains first with single-channel audio, then extends to multi-channel, ensuring a consistent progression from pure acoustic to spatially-augmented modeling

Methodology Details

Task Definition

The speaker diarization task aims to determine the identity of active speakers for each time segment in multi-speaker audio. Input consists of multi-channel audio signals, and output comprises speaker activity labels and speaker representations for each time frame.

Model Architecture

1. DOA Estimation Module (SRP-DNN)

Employs SRP-DNN for robust multi-source DOA estimation:

Core Concept: Learns direct-path inter-phase differences (DP-IPDs); for the k-th source, DOA is represented as: θk=[θelek,θazik]T\theta_k = [\theta_{ele}^k, \theta_{azi}^k]^T

Training Objective: Weighted sum of direct-path IPD vectors: Rmm(n)=k=1Kβk(n)rmm(θk(n))R_{mm'}(n) = \sum_{k=1}^K \beta_k(n) r_{mm'}(\theta_k(n))

Spatial Spectrum Construction: P(θ;n)=2M(M1)Fm=1M1m=m+1M{R^mm(n)Hrmm(θ)}P'(\theta;n) = \frac{2}{M(M-1)F} \sum_{m=1}^{M-1} \sum_{m'=m+1}^M \Re\{R̂_{mm'}(n)^H r_{mm'}(\theta)\}

Multi-Source Localization: Employs iterative detection-and-localization (IDL) strategy for multi-speaker scenarios.

2. SA-S2SND Architecture

Based on S2SND backbone network, comprising four core modules:

  1. Extractor: ResNet + segmental statistics pooling (SSP)
  2. Encoder: Conformer for modeling long-range dependencies
  3. Representation Decoder: Generates target embeddings Ê
  4. Detection Decoder: Predicts activity Ŷ

DOA Integration Method: X=X+LinearRARD(interpolate(O))/DX = X + \text{Linear}_{R^A \rightarrow R^D}(\text{interpolate}(O))/\sqrt{D}

where O ∈ R^{T''×A} is the DOA probability matrix, integrated into encoder representations through nearest-neighbor interpolation and linear projection.

Technical Innovations

  1. Explicit Spatial Cue Injection: Unlike blind fusion, directly uses DOA estimation to provide directional evidence
  2. Simulated DOA Strategy:
    • Real multi-channel speech + SRP-DNN estimated DOA
    • Simulated multi-channel speech + randomly generated pseudo-DOA
  3. Two-Stage Training:
    • Part A: Single-channel model + multi-channel DOA (stages 1-3)
    • Part B: Multi-channel model + multi-channel DOA (stages 4-5)

Experimental Setup

Datasets

  1. Simulated Data: VoxCeleb2 (1M utterances, 6,112 speakers) for online mixture generation
  2. Real Data: AliMeeting (training set 104.75h, evaluation set 4h, test set 10h)
    • 8-channel far-field array and headset recordings
    • Uses NARA-WPE dereverberated far-field array signals

Evaluation Metrics

  • DER (Diarization Error Rate): Without Oracle VAD and tolerance
  • Reports performance separately for 1-2 speaker and 2+ speaker scenarios
  • Performance comparison in online and offline modes

Baseline Methods

  • S2SND baseline (single-channel and multi-channel versions)
  • BUT System (state-of-the-art)
  • Different model sizes: Small (16.56M parameters) and Medium (45.96M parameters)

Implementation Details

  • Audio Processing: 8s window, 2s overlap, 80-dimensional log-Mel filterbank
  • Training: AdamW optimizer, BCE + ArcFace loss
  • Inference: Block-level sliding window, 0.8s online latency
  • Hardware: Two RTX-A6000 GPUs

Experimental Results

Main Results

ModelChannelsDOATotal DER (Online %)Total DER (Offline %)
S2SND116.0313.59
SA-S2SND115.3512.59
S2SND814.8512.79
SA-S2SND812.9310.84

Key Findings

  1. Consistent Improvements: Adding DOA brings improvements across all configurations
    • Single-channel: Online 4.2%↓, Offline 7.4%↓
    • Multi-channel: Online 12.9%↓, Offline 15.2%↓
  2. Multi-Speaker Scenario Advantages: More significant improvements in 2+ speaker scenarios, demonstrating robustness in complex dialogue conditions
  3. Complementarity: Channel attention and DOA are highly complementary
    • Channel attention captures correlations
    • DOA provides explicit spatial cues
  4. Parameter Efficiency: Best model (E4) achieves 19.3%/20.3% relative gains over baseline (E1) with comparable parameter count to SOTA

DOA Analysis

  • In AliMeeting training set, only 5.98% of duration involves more than two simultaneous speakers
  • Simulated data shows negligible DOA errors
  • In real meeting data, azimuth estimation provides clear differentiation between different speakers

Speaker Diarization Development Timeline

  1. Modular Methods: Traditional clustering-based approaches
  2. End-to-End Neural Diarization (EEND): Multi-label prediction task
  3. Target Speaker Voice Activity Detection (TSVAD): Combines modular and neural approaches
  4. Sequence-to-Sequence Diarization (S2SND): Supports online diarization

Multi-Channel Processing Approaches

  1. Speech Enhancement: Beamforming, etc., but may introduce distortion
  2. Channel Fusion: Attention modules aggregate signals, but typically blind fusion
  3. Explicit Features: DOA estimation, etc., providing direct directional evidence

Advantages of This Work

Compared to existing work, this paper is the first to effectively integrate explicit DOA cues into a sequence-to-sequence speaker diarization framework, and proposes a simulated strategy to reduce dependence on multi-channel corpora.

Conclusions and Discussion

Main Conclusions

  1. Effectiveness of Spatial Cues: DOA cues significantly improve speaker diarization performance
  2. Complementarity: Spatial information is highly complementary to cross-channel modeling
  3. Practicality: Performs well in both online and offline settings
  4. Generalization Capability: Simulated DOA strategy reduces dependence on specific array configurations

Limitations

  1. Multi-Speaker Constraints: SRP-DNN's IDL strategy tracks at most two speakers
  2. Array Dependence: Requires retraining SRP-DNN to adapt to different array configurations
  3. Computational Complexity: Adds computational overhead from DOA estimation

Future Directions

  1. Multi-Speaker DOA Robustness: Improve handling of more than two simultaneous speakers
  2. Joint Training Strategy: Explore end-to-end training of DOA estimation and speaker diarization
  3. System Performance Enhancement: Further optimize overall system performance

In-Depth Evaluation

Strengths

  1. Strong Innovation:
    • First to effectively integrate explicit DOA cues into S2SND framework
    • Proposes simulated DOA strategy addressing multi-channel data scarcity
    • Well-designed two-stage training strategy
  2. Comprehensive Experiments:
    • Full evaluation on standard datasets
    • Detailed ablation studies and analysis
    • Fair comparison with SOTA methods
  3. Solid Technical Foundation:
    • Clever DOA integration method similar to positional encoding
    • Addresses multi-channel array adaptation
    • Supports both online and offline application scenarios
  4. High Practical Value:
    • Significant performance improvements (up to 19%+ relative gains)
    • Good parameter efficiency
    • Extensible to different array configurations

Weaknesses

  1. Method Limitations:
    • Constrained by SRP-DNN's two-speaker limitation
    • Requires retraining DOA module for different arrays
    • Authenticity of simulated DOA needs verification
  2. Experimental Scope:
    • Validation only on AliMeeting dataset
    • Lacks robustness analysis under different acoustic conditions
    • No computational complexity analysis provided
  3. Insufficient Theoretical Analysis:
    • Lacks theoretical explanation for DOA effectiveness
    • No analysis of performance under different noise and reverberation conditions

Impact

  1. Academic Contribution: Provides new perspectives on spatial information utilization in speaker diarization
  2. Practical Value: Directly applicable to meeting transcription systems
  3. Reproducibility: Detailed implementation details facilitate reproduction

Applicable Scenarios

  1. Meeting Transcription: Real-time and offline speaker diarization for multi-person meetings
  2. Intelligent Meeting Systems: End-to-end meeting understanding combined with speech recognition
  3. Multi-Channel Speech Processing: Any speech separation task requiring spatial information utilization

References

The paper cites 36 relevant references covering key works in speaker diarization, multi-channel signal processing, and deep learning, providing a solid theoretical foundation for the research.


Overall Assessment: This is a high-quality research paper that proposes innovative methods for utilizing spatial information in speaker diarization. The experimental design is rigorous, results are convincing, and practical value is substantial. The main innovation lies in effectively integrating explicit DOA cues into the sequence-to-sequence framework and solving the multi-channel data scarcity problem through clever training strategies.