NAP: Attention-Based Late Fusion for Automatic Sleep Staging
Rossi, van der Meer, Schmidt et al.
Polysomnography signals are highly heterogeneous, varying in modality composition (e.g., EEG, EOG, ECG), channel availability (e.g., frontal, occipital EEG), and acquisition protocols across datasets and clinical sites. Most existing models that process polysomnography data rely on a fixed subset of modalities or channels and therefore neglect to fully exploit its inherently multimodal nature. We address this limitation by introducing NAP (Neural Aggregator of Predictions), an attention-based model which learns to combine multiple prediction streams using a tri-axial attention mechanism that captures temporal, spatial, and predictor-level dependencies. NAP is trained to adapt to different input dimensions. By aggregating outputs from frozen, pretrained single-channel models, NAP consistently outperforms individual predictors and simple ensembles, achieving state-of-the-art zero-shot generalization across multiple datasets. While demonstrated in the context of automated sleep staging from polysomnography, the proposed approach could be extended to other multimodal physiological applications.
academic
NAP: Attention-Based Late Fusion for Automatic Sleep Staging
Polysomnography (PSG) signals exhibit high heterogeneity across modality composition (e.g., EEG, EOG, ECG), channel availability (e.g., frontal, occipital EEG), and acquisition protocols across different datasets and clinical centers. Existing models for processing multimodal PSG data largely depend on fixed modality or channel subsets, thus failing to fully exploit their inherent multimodal characteristics. This paper addresses this limitation by introducing NAP (Neural Aggregator of Predictions), an attention-based model that employs tri-axial attention mechanisms to learn the combination of multiple prediction streams, capturing temporal, spatial, and predictor-level dependencies. NAP is trained to adapt to varying input dimensions. By aggregating outputs from frozen pretrained single-channel models, NAP consistently outperforms individual predictors and simple ensemble methods, achieving state-of-the-art zero-shot generalization performance across multiple datasets.
Core Problem: The heterogeneity of polysomnography (PSG) data, including varying modality compositions, channel configurations, and acquisition protocols, with existing models unable to fully exploit its multimodal characteristics.
Significance:
Sleep staging is the clinical gold standard for diagnosing sleep-wake disorders
Manual sleep staging is time-consuming and subject to observer bias
Multimodal information provides a more comprehensive view of sleep dynamics, facilitating better understanding of patient health status
Limitations of Existing Methods:
Most models rely on fixed modality or channel subsets
Simple soft voting ensemble methods assume averaging is a sufficient aggregation function
Implicitly treat all contributors as equally reliable
Operate at the epoch level, ignoring temporal dependencies
Research Motivation: Develop an attention-based model capable of flexibly handling varying input dimensions, effectively aggregating multimodal prediction streams, and maintaining modularity.
Proposed NAP Model: A lightweight attention-based meta-model that learns to aggregate predictions from pretrained single-channel models by explicitly capturing temporal, spatial/channel, model-level, and cross-modal dependencies.
Extended Cross-Attention Mechanism: Generalized the criss-cross attention mechanism from spatiotemporal dimensions to tri-axial attention as an effective fusion strategy.
Dimension-Adaptive Training: Extended dimension-adaptive training to dynamically sample varying sequence lengths, channel counts, model counts, and modality counts.
SOTA Zero-Shot Performance: Achieved state-of-the-art zero-shot generalization performance across multiple datasets, significantly outperforming individual predictors and simple ensemble methods.
Tri-Axial Attention Mechanism: Decomposes attention computation into spatial, temporal, and predictor dimensions, more efficient and targeted than traditional joint attention.
Dynamic Dimension Adaptation: Randomly samples varying timesteps, modality sets, channel counts, and base predictors during training to enhance generalization.
Gradient Accumulation Strategy: Accumulates gradients across G distinct batches, avoiding padding and masking operations for improved computational efficiency.
Consistent Improvements: NAP achieved zero-shot MF1 improvements across most out-of-distribution datasets
DCSM: 0.803 → 0.815
DOD-H: 0.828 → 0.834
PHYS: 0.693 → 0.732
SEDF-SC: 0.734 → 0.752
SEDF-ST: 0.761 → 0.796
N1 Stage Improvements: MF1 improvements primarily stem from enhanced identification of the challenging N1 stage, with improvements in Wake stage recognition in some cases
Maximum Improvement Scenarios: NAP achieved the largest improvements on datasets where SOMNUS performed relatively poorly (e.g., PHYS and SEDF)
While the paper lacks detailed ablation experiments, comparison with simple soft voting (SOMNUS) validates the advantages of attention mechanisms over simple averaging.
The paper cites important works in sleep medicine, deep learning, and multimodal fusion, including:
Berry et al. (2017): AASM sleep staging standards
Perslev et al. (2021): U-Sleep model
Phan et al. (2022): SleepTransformer
Huang et al. (2019): Original criss-cross attention work
Zhang et al. (2018, 2024): NSRR data resources
Overall Assessment: This is a high-quality machine learning paper addressing a clinically important problem with innovative solutions. The tri-axial attention mechanism design is elegant, and experimental results are convincing. While improvements in theoretical analysis and ablation studies are possible, its practical value and technical innovation make it an important contribution to multimodal physiological signal processing.