2025-11-11T15:10:09.360380

NAP: Attention-Based Late Fusion for Automatic Sleep Staging

Rossi, van der Meer, Schmidt et al.
Polysomnography signals are highly heterogeneous, varying in modality composition (e.g., EEG, EOG, ECG), channel availability (e.g., frontal, occipital EEG), and acquisition protocols across datasets and clinical sites. Most existing models that process polysomnography data rely on a fixed subset of modalities or channels and therefore neglect to fully exploit its inherently multimodal nature. We address this limitation by introducing NAP (Neural Aggregator of Predictions), an attention-based model which learns to combine multiple prediction streams using a tri-axial attention mechanism that captures temporal, spatial, and predictor-level dependencies. NAP is trained to adapt to different input dimensions. By aggregating outputs from frozen, pretrained single-channel models, NAP consistently outperforms individual predictors and simple ensembles, achieving state-of-the-art zero-shot generalization across multiple datasets. While demonstrated in the context of automated sleep staging from polysomnography, the proposed approach could be extended to other multimodal physiological applications.
academic

NAP: Attention-Based Late Fusion for Automatic Sleep Staging

Basic Information

  • Paper ID: 2511.03488
  • Title: NAP: Attention-Based Late Fusion for Automatic Sleep Staging
  • Authors: Alvise Dei Rossi, Julia van der Meer, Markus H. Schmidt, Claudio L.A. Bassetti, Luigi Fiorillo, Francesca Faraci
  • Classification: cs.LG (Machine Learning)
  • Publication Date: November 5, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2511.03488v1

Abstract

Polysomnography (PSG) signals exhibit high heterogeneity across modality composition (e.g., EEG, EOG, ECG), channel availability (e.g., frontal, occipital EEG), and acquisition protocols across different datasets and clinical centers. Existing models for processing multimodal PSG data largely depend on fixed modality or channel subsets, thus failing to fully exploit their inherent multimodal characteristics. This paper addresses this limitation by introducing NAP (Neural Aggregator of Predictions), an attention-based model that employs tri-axial attention mechanisms to learn the combination of multiple prediction streams, capturing temporal, spatial, and predictor-level dependencies. NAP is trained to adapt to varying input dimensions. By aggregating outputs from frozen pretrained single-channel models, NAP consistently outperforms individual predictors and simple ensemble methods, achieving state-of-the-art zero-shot generalization performance across multiple datasets.

Research Background and Motivation

Problem Definition

  1. Core Problem: The heterogeneity of polysomnography (PSG) data, including varying modality compositions, channel configurations, and acquisition protocols, with existing models unable to fully exploit its multimodal characteristics.
  2. Significance:
    • Sleep staging is the clinical gold standard for diagnosing sleep-wake disorders
    • Manual sleep staging is time-consuming and subject to observer bias
    • Multimodal information provides a more comprehensive view of sleep dynamics, facilitating better understanding of patient health status
  3. Limitations of Existing Methods:
    • Most models rely on fixed modality or channel subsets
    • Simple soft voting ensemble methods assume averaging is a sufficient aggregation function
    • Implicitly treat all contributors as equally reliable
    • Operate at the epoch level, ignoring temporal dependencies
  4. Research Motivation: Develop an attention-based model capable of flexibly handling varying input dimensions, effectively aggregating multimodal prediction streams, and maintaining modularity.

Core Contributions

  1. Proposed NAP Model: A lightweight attention-based meta-model that learns to aggregate predictions from pretrained single-channel models by explicitly capturing temporal, spatial/channel, model-level, and cross-modal dependencies.
  2. Extended Cross-Attention Mechanism: Generalized the criss-cross attention mechanism from spatiotemporal dimensions to tri-axial attention as an effective fusion strategy.
  3. Dimension-Adaptive Training: Extended dimension-adaptive training to dynamically sample varying sequence lengths, channel counts, model counts, and modality counts.
  4. SOTA Zero-Shot Performance: Achieved state-of-the-art zero-shot generalization performance across multiple datasets, significantly outperforming individual predictors and simple ensemble methods.

Methodology Details

Task Definition

  • Input: PSG recording X containing T consecutive 30-second sleep epochs, each associated with M physiological modalities
  • Output: Sleep stage predictions for each epoch, categorized into five classes: {Wake, N1, N2, N3, REM}
  • Constraints: Model must adapt to varying modality combinations, channel counts, and sequence lengths

Model Architecture

The NAP architecture comprises four main modules:

1. Base Predictions Generator

  • For modality mkm_k, channel cjc_j, and base predictor bb_\ell, generates predictions h^(mk,cj,b),tR5\hat{h}_{(m_k,c_j,b_\ell),t} \in \mathbb{R}^5
  • Predictions are linearly projected into high-dimensional feature space Rdmodel\mathbb{R}^{d_{model}}
  • Generates hypnodensities (probabilistic representations of sleep stages)

2. Tri-Axial Attention Encoder

Extends criss-cross attention into three pathways:

Spatial Attention: Attention computation along channel axis CmkC_{m_k}Zs(i)=Softmax(LN(Qs(i))LN(Ks(i))Tdk)Vs(i)Z_s^{(i)} = \text{Softmax}\left(\frac{\text{LN}(Q_s^{(i)}) \text{LN}(K_s^{(i)})^T}{\sqrt{d_k}}\right) V_s^{(i)}

Temporal Attention: Attention computation along sequence length axis T

Hybrid Attention: Attention computation along base predictor axis BmkB_{m_k}

Each pathway is allocated h/3 attention heads, with all pathway outputs concatenated.

3. Modality Fusion Layer

Employs attention-based fusion mechanism: αt,n=exp(tanh(WAxt,n+bA)TuA)j=1Nexp(tanh(WAxt,j+bA)TuA)\alpha_{t,n} = \frac{\exp(\tanh(W_A x_{t,n} + b_A)^T u_A)}{\sum_{j=1}^N \exp(\tanh(W_A x_{t,j} + b_A)^T u_A)}

Computes weighted combination: z^t=n=1Nαt,nz~t,n\hat{z}_t = \sum_{n=1}^N \alpha_{t,n} \tilde{z}_{t,n}

4. Classifier Head

Single hidden layer feedforward network trained end-to-end with cross-entropy loss.

Technical Innovations

  1. Tri-Axial Attention Mechanism: Decomposes attention computation into spatial, temporal, and predictor dimensions, more efficient and targeted than traditional joint attention.
  2. Dynamic Dimension Adaptation: Randomly samples varying timesteps, modality sets, channel counts, and base predictors during training to enhance generalization.
  3. Gradient Accumulation Strategy: Accumulates gradients across G distinct batches, avoiding padding and masking operations for improved computational efficiency.

Experimental Setup

Datasets

Training Datasets:

  • BSWR: 8,410 PSG records (≈67,000 hours) covering the complete sleep-wake disorder spectrum
  • Held-out NSRR datasets: Including ABC, APOE, APPLES, CCSHS, CFS, CHAT, HOMEPAP, MESA, MNC, MROS, MSP, NCHSDB, SHHS, SOF, WSC

Evaluation Datasets (Zero-shot):

  • DOD-H & DOD-O: Healthy adults and OSA patients
  • DCSM: Danish Center for Sleep Medicine data
  • SEDF-SC & SEDF-ST: Sleep-EDF extended datasets
  • PHYS: PhysioNet Challenge 2018 data

Evaluation Metrics

  • Macro-averaged F1 score (Macro F1, MF1)
  • F1 scores for each sleep stage (F1W, F1N1, F1N2, F1N3, F1REM)

Comparison Methods

  • Best single-modality models (e.g., DeepResNetEEG, U-SleepEEG)
  • SOMNUS ensemble method (soft voting across all channels, modalities, and models)

Implementation Details

  • Embedding dimension: dmodeld_{model} = 24
  • Number of attention heads: h = 6 (2 heads per pathway)
  • Number of encoder layers: L = 4
  • Batch size: B = 8 records, K = 4 segments per record
  • Gradient accumulation: G = 4 forward-backward passes
  • Optimizer: AdamW, learning rate η = 10⁻³

Experimental Results

Main Results

DatasetModelMF1F1WF1N1F1N2F1N3F1REM
BSWRDeepResNetEEG.695(.120).828(.143).397(.172).793(.148).629(.270).848(.180)
SOMNUS.708(.120).836(.141).404(.178).804(.146).696(.280).864(.173)
NAP.749(.117)‡.856(.132).533(.164).809(.146).705(.260).864(.172)
DCSMSOMNUS.803(.084).983(.023).505(.153).858(.097).783(.202).891(.146)
NAP.815(.081)‡.986(.020).550(.143).848(.103).802(.190).893(.145)

‡ Indicates statistically significant improvement in MF1 over other methods (α < 0.05)

Key Findings

  1. Consistent Improvements: NAP achieved zero-shot MF1 improvements across most out-of-distribution datasets
    • DCSM: 0.803 → 0.815
    • DOD-H: 0.828 → 0.834
    • PHYS: 0.693 → 0.732
    • SEDF-SC: 0.734 → 0.752
    • SEDF-ST: 0.761 → 0.796
  2. N1 Stage Improvements: MF1 improvements primarily stem from enhanced identification of the challenging N1 stage, with improvements in Wake stage recognition in some cases
  3. Maximum Improvement Scenarios: NAP achieved the largest improvements on datasets where SOMNUS performed relatively poorly (e.g., PHYS and SEDF)

Ablation Studies

While the paper lacks detailed ablation experiments, comparison with simple soft voting (SOMNUS) validates the advantages of attention mechanisms over simple averaging.

Main Research Directions

  1. Automatic Sleep Staging: Multiple modeling paradigms using convolutional, recurrent, and attention networks
  2. Multimodal Fusion: Early fusion (representation fusion) vs. late fusion (prediction aggregation)
  3. Ensemble Methods: Soft voting strategies across channels, modalities, or models

Advantages of This Work

  1. Flexibility: Handles arbitrary numbers of modalities, channels, and predictors
  2. Temporal Modeling: Explicitly models temporal dependencies compared to epoch-level soft voting
  3. Attention Mechanism: Learns adaptive weights rather than assuming equal weighting

Conclusions and Discussion

Main Conclusions

  1. NAP effectively aggregates multimodal prediction streams through attention mechanisms, achieving SOTA zero-shot performance across multiple datasets
  2. Principled late fusion can bridge performance gaps of existing methods on certain datasets
  3. Tri-axial attention is an effective strategy for handling multidimensional dependencies

Limitations

  1. Modality Constraints: Current experiments only consider EEG and EOG modalities due to pretrained model availability
  2. Base Model Dependency: Performance is limited by the quality of pretrained single-channel models
  3. Computational Overhead: While more efficient than joint attention, still requires additional computational resources

Future Directions

  1. Extended Modalities: Integrate pretrained models for additional physiological signals (EMG, ECG, etc.)
  2. Early Fusion: Adapt as Neural Aggregator of Representations for representation-level fusion
  3. Cross-Domain Applications: Extend to other physiological signal applications requiring multimodal prediction aggregation

In-Depth Evaluation

Strengths

  1. Strong Innovation: Novel tri-axial attention mechanism design effectively addresses multidimensional dependency modeling
  2. High Practical Value: Addresses the important clinical problem of PSG data heterogeneity
  3. Comprehensive Experiments: Extensive zero-shot evaluation across multiple large-scale datasets
  4. Generalizable Framework: Extensible to other multimodal physiological signal applications

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks theoretical analysis and complexity analysis of the tri-axial attention mechanism
  2. Limited Ablation Studies: No detailed analysis of individual component contributions (spatial, temporal, hybrid attention)
  3. Incomplete Modality Coverage: Only validates EEG and EOG; lacks verification on other important modalities (EMG, ECG)

Impact

  1. Academic Contribution: Provides new fusion strategies for multimodal physiological signal processing
  2. Clinical Value: Potentially improves practicality and accuracy of automatic sleep staging systems
  3. Reproducibility: Provides detailed implementation details facilitating reproduction and extension

Applicable Scenarios

  1. Clinical Sleep Medicine: Automatic sleep staging across different hospital configurations and equipment
  2. Multimodal Physiological Signals: Other medical applications requiring fusion of multiple physiological signal predictions
  3. Heterogeneous Data Fusion: Any task requiring multimodal prediction aggregation with variable dimensions

References

The paper cites important works in sleep medicine, deep learning, and multimodal fusion, including:

  • Berry et al. (2017): AASM sleep staging standards
  • Perslev et al. (2021): U-Sleep model
  • Phan et al. (2022): SleepTransformer
  • Huang et al. (2019): Original criss-cross attention work
  • Zhang et al. (2018, 2024): NSRR data resources

Overall Assessment: This is a high-quality machine learning paper addressing a clinically important problem with innovative solutions. The tri-axial attention mechanism design is elegant, and experimental results are convincing. While improvements in theoretical analysis and ablation studies are possible, its practical value and technical innovation make it an important contribution to multimodal physiological signal processing.