2025-11-11T15:10:09.360380

NAP: Attention-Based Late Fusion for Automatic Sleep Staging

Rossi, van der Meer, Schmidt et al.

Polysomnography signals are highly heterogeneous, varying in modality composition (e.g., EEG, EOG, ECG), channel availability (e.g., frontal, occipital EEG), and acquisition protocols across datasets and clinical sites. Most existing models that process polysomnography data rely on a fixed subset of modalities or channels and therefore neglect to fully exploit its inherently multimodal nature. We address this limitation by introducing NAP (Neural Aggregator of Predictions), an attention-based model which learns to combine multiple prediction streams using a tri-axial attention mechanism that captures temporal, spatial, and predictor-level dependencies. NAP is trained to adapt to different input dimensions. By aggregating outputs from frozen, pretrained single-channel models, NAP consistently outperforms individual predictors and simple ensembles, achieving state-of-the-art zero-shot generalization across multiple datasets. While demonstrated in the context of automated sleep staging from polysomnography, the proposed approach could be extended to other multimodal physiological applications.

academic

NAP: Attention-Based Late Fusion for Automatic Sleep Staging

Basic Information

Paper ID: 2511.03488
Title: NAP: Attention-Based Late Fusion for Automatic Sleep Staging
Authors: Alvise Dei Rossi, Julia van der Meer, Markus H. Schmidt, Claudio L.A. Bassetti, Luigi Fiorillo, Francesca Faraci
Classification: cs.LG (Machine Learning)
Publication Date: November 5, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2511.03488v1

Abstract

Polysomnography (PSG) signals exhibit high heterogeneity across modality composition (e.g., EEG, EOG, ECG), channel availability (e.g., frontal, occipital EEG), and acquisition protocols across different datasets and clinical centers. Existing models for processing multimodal PSG data largely depend on fixed modality or channel subsets, thus failing to fully exploit their inherent multimodal characteristics. This paper addresses this limitation by introducing NAP (Neural Aggregator of Predictions), an attention-based model that employs tri-axial attention mechanisms to learn the combination of multiple prediction streams, capturing temporal, spatial, and predictor-level dependencies. NAP is trained to adapt to varying input dimensions. By aggregating outputs from frozen pretrained single-channel models, NAP consistently outperforms individual predictors and simple ensemble methods, achieving state-of-the-art zero-shot generalization performance across multiple datasets.

Research Background and Motivation

Problem Definition

Core Problem: The heterogeneity of polysomnography (PSG) data, including varying modality compositions, channel configurations, and acquisition protocols, with existing models unable to fully exploit its multimodal characteristics.
Significance:
- Sleep staging is the clinical gold standard for diagnosing sleep-wake disorders
- Manual sleep staging is time-consuming and subject to observer bias
- Multimodal information provides a more comprehensive view of sleep dynamics, facilitating better understanding of patient health status
Limitations of Existing Methods:
- Most models rely on fixed modality or channel subsets
- Simple soft voting ensemble methods assume averaging is a sufficient aggregation function
- Implicitly treat all contributors as equally reliable
- Operate at the epoch level, ignoring temporal dependencies
Research Motivation: Develop an attention-based model capable of flexibly handling varying input dimensions, effectively aggregating multimodal prediction streams, and maintaining modularity.

Core Contributions

Proposed NAP Model: A lightweight attention-based meta-model that learns to aggregate predictions from pretrained single-channel models by explicitly capturing temporal, spatial/channel, model-level, and cross-modal dependencies.
Extended Cross-Attention Mechanism: Generalized the criss-cross attention mechanism from spatiotemporal dimensions to tri-axial attention as an effective fusion strategy.
Dimension-Adaptive Training: Extended dimension-adaptive training to dynamically sample varying sequence lengths, channel counts, model counts, and modality counts.
SOTA Zero-Shot Performance: Achieved state-of-the-art zero-shot generalization performance across multiple datasets, significantly outperforming individual predictors and simple ensemble methods.

Methodology Details

Task Definition

Input: PSG recording X containing T consecutive 30-second sleep epochs, each associated with M physiological modalities
Output: Sleep stage predictions for each epoch, categorized into five classes: {Wake, N1, N2, N3, REM}
Constraints: Model must adapt to varying modality combinations, channel counts, and sequence lengths

Model Architecture

The NAP architecture comprises four main modules:

1. Base Predictions Generator

For modality $m_k$ , channel $c_j$ , and base predictor $b_\ell$ , generates predictions $\hat{h}_{(m_k,c_j,b_\ell),t} \in \mathbb{R}^5$
Predictions are linearly projected into high-dimensional feature space $\mathbb{R}^{d_{model}}$
Generates hypnodensities (probabilistic representations of sleep stages)

2. Tri-Axial Attention Encoder

Extends criss-cross attention into three pathways:

Spatial Attention: Attention computation along channel axis $C_{m_k}$ $Z_s^{(i)} = \text{Softmax}\left(\frac{\text{LN}(Q_s^{(i)}) \text{LN}(K_s^{(i)})^T}{\sqrt{d_k}}\right) V_s^{(i)}$

Temporal Attention: Attention computation along sequence length axis T

Hybrid Attention: Attention computation along base predictor axis $B_{m_k}$

Each pathway is allocated h/3 attention heads, with all pathway outputs concatenated.

3. Modality Fusion Layer

Employs attention-based fusion mechanism: $\alpha_{t,n} = \frac{\exp(\tanh(W_A x_{t,n} + b_A)^T u_A)}{\sum_{j=1}^N \exp(\tanh(W_A x_{t,j} + b_A)^T u_A)}$

Computes weighted combination: $\hat{z}_t = \sum_{n=1}^N \alpha_{t,n} \tilde{z}_{t,n}$

4. Classifier Head

Single hidden layer feedforward network trained end-to-end with cross-entropy loss.

Technical Innovations

Tri-Axial Attention Mechanism: Decomposes attention computation into spatial, temporal, and predictor dimensions, more efficient and targeted than traditional joint attention.
Dynamic Dimension Adaptation: Randomly samples varying timesteps, modality sets, channel counts, and base predictors during training to enhance generalization.
Gradient Accumulation Strategy: Accumulates gradients across G distinct batches, avoiding padding and masking operations for improved computational efficiency.

Experimental Setup

Datasets

Training Datasets:

BSWR: 8,410 PSG records (≈67,000 hours) covering the complete sleep-wake disorder spectrum
Held-out NSRR datasets: Including ABC, APOE, APPLES, CCSHS, CFS, CHAT, HOMEPAP, MESA, MNC, MROS, MSP, NCHSDB, SHHS, SOF, WSC

Evaluation Datasets (Zero-shot):

DOD-H & DOD-O: Healthy adults and OSA patients
DCSM: Danish Center for Sleep Medicine data
SEDF-SC & SEDF-ST: Sleep-EDF extended datasets
PHYS: PhysioNet Challenge 2018 data

Evaluation Metrics

Macro-averaged F1 score (Macro F1, MF1)
F1 scores for each sleep stage (F1W, F1N1, F1N2, F1N3, F1REM)

Comparison Methods

Best single-modality models (e.g., DeepResNetEEG, U-SleepEEG)
SOMNUS ensemble method (soft voting across all channels, modalities, and models)

Implementation Details

Embedding dimension: $d_{model}$ = 24
Number of attention heads: h = 6 (2 heads per pathway)
Number of encoder layers: L = 4
Batch size: B = 8 records, K = 4 segments per record
Gradient accumulation: G = 4 forward-backward passes
Optimizer: AdamW, learning rate η = 10⁻³

Experimental Results

Main Results

Dataset	Model	MF1	F1W	F1N1	F1N2	F1N3	F1REM
BSWR	DeepResNetEEG	.695(.120)	.828(.143)	.397(.172)	.793(.148)	.629(.270)	.848(.180)
	SOMNUS	.708(.120)	.836(.141)	.404(.178)	.804(.146)	.696(.280)	.864(.173)
	NAP	.749(.117)‡	.856(.132)	.533(.164)	.809(.146)	.705(.260)	.864(.172)
DCSM	SOMNUS	.803(.084)	.983(.023)	.505(.153)	.858(.097)	.783(.202)	.891(.146)
	NAP	.815(.081)‡	.986(.020)	.550(.143)	.848(.103)	.802(.190)	.893(.145)

‡ Indicates statistically significant improvement in MF1 over other methods (α < 0.05)

Key Findings

Consistent Improvements: NAP achieved zero-shot MF1 improvements across most out-of-distribution datasets
- DCSM: 0.803 → 0.815
- DOD-H: 0.828 → 0.834
- PHYS: 0.693 → 0.732
- SEDF-SC: 0.734 → 0.752
- SEDF-ST: 0.761 → 0.796
N1 Stage Improvements: MF1 improvements primarily stem from enhanced identification of the challenging N1 stage, with improvements in Wake stage recognition in some cases
Maximum Improvement Scenarios: NAP achieved the largest improvements on datasets where SOMNUS performed relatively poorly (e.g., PHYS and SEDF)

Ablation Studies

While the paper lacks detailed ablation experiments, comparison with simple soft voting (SOMNUS) validates the advantages of attention mechanisms over simple averaging.

Main Research Directions

Automatic Sleep Staging: Multiple modeling paradigms using convolutional, recurrent, and attention networks
Multimodal Fusion: Early fusion (representation fusion) vs. late fusion (prediction aggregation)
Ensemble Methods: Soft voting strategies across channels, modalities, or models

Advantages of This Work

Flexibility: Handles arbitrary numbers of modalities, channels, and predictors
Temporal Modeling: Explicitly models temporal dependencies compared to epoch-level soft voting
Attention Mechanism: Learns adaptive weights rather than assuming equal weighting

Conclusions and Discussion

Main Conclusions

NAP effectively aggregates multimodal prediction streams through attention mechanisms, achieving SOTA zero-shot performance across multiple datasets
Principled late fusion can bridge performance gaps of existing methods on certain datasets
Tri-axial attention is an effective strategy for handling multidimensional dependencies

Limitations

Modality Constraints: Current experiments only consider EEG and EOG modalities due to pretrained model availability
Base Model Dependency: Performance is limited by the quality of pretrained single-channel models
Computational Overhead: While more efficient than joint attention, still requires additional computational resources

Future Directions

Extended Modalities: Integrate pretrained models for additional physiological signals (EMG, ECG, etc.)
Early Fusion: Adapt as Neural Aggregator of Representations for representation-level fusion
Cross-Domain Applications: Extend to other physiological signal applications requiring multimodal prediction aggregation

In-Depth Evaluation

Strengths

Strong Innovation: Novel tri-axial attention mechanism design effectively addresses multidimensional dependency modeling
High Practical Value: Addresses the important clinical problem of PSG data heterogeneity
Comprehensive Experiments: Extensive zero-shot evaluation across multiple large-scale datasets
Generalizable Framework: Extensible to other multimodal physiological signal applications

Weaknesses

Insufficient Theoretical Analysis: Lacks theoretical analysis and complexity analysis of the tri-axial attention mechanism
Limited Ablation Studies: No detailed analysis of individual component contributions (spatial, temporal, hybrid attention)
Incomplete Modality Coverage: Only validates EEG and EOG; lacks verification on other important modalities (EMG, ECG)

Impact

Academic Contribution: Provides new fusion strategies for multimodal physiological signal processing
Clinical Value: Potentially improves practicality and accuracy of automatic sleep staging systems
Reproducibility: Provides detailed implementation details facilitating reproduction and extension

Applicable Scenarios

Clinical Sleep Medicine: Automatic sleep staging across different hospital configurations and equipment
Multimodal Physiological Signals: Other medical applications requiring fusion of multiple physiological signal predictions
Heterogeneous Data Fusion: Any task requiring multimodal prediction aggregation with variable dimensions

References

The paper cites important works in sleep medicine, deep learning, and multimodal fusion, including:

Berry et al. (2017): AASM sleep staging standards
Perslev et al. (2021): U-Sleep model
Phan et al. (2022): SleepTransformer
Huang et al. (2019): Original criss-cross attention work
Zhang et al. (2018, 2024): NSRR data resources

Overall Assessment: This is a high-quality machine learning paper addressing a clinically important problem with innovative solutions. The tri-axial attention mechanism design is elegant, and experimental results are convincing. While improvements in theoretical analysis and ablation studies are possible, its practical value and technical innovation make it an important contribution to multimodal physiological signal processing.