2025-11-14T10:40:11.215635

MSM-Seg: A Modality-and-Slice Memory Framework with Category-Agnostic Prompting for Multi-Modal Brain Tumor Segmentation

Luo, Xu, Huang et al.

Multi-modal brain tumor segmentation is critical for clinical diagnosis, and it requires accurate identification of distinct internal anatomical subregions. While the recent prompt-based segmentation paradigms enable interactive experiences for clinicians, existing methods ignore cross-modal correlations and rely on labor-intensive category-specific prompts, limiting their applicability in real-world scenarios. To address these issues, we propose a MSM-Seg framework for multi-modal brain tumor segmentation. The MSM-Seg introduces a novel dual-memory segmentation paradigm that synergistically integrates multi-modal and inter-slice information with the efficient category-agnostic prompt for brain tumor understanding. To this end, we first devise a modality-and-slice memory attention (MSMA) to exploit the cross-modal and inter-slice relationships among the input scans. Then, we propose a multi-scale category-agnostic prompt encoder (MCP-Encoder) to provide tumor region guidance for decoding. Moreover, we devise a modality-adaptive fusion decoder (MF-Decoder) that leverages the complementary decoding information across different modalities to improve segmentation accuracy. Extensive experiments on different MRI datasets demonstrate that our MSM-Seg framework outperforms state-of-the-art methods in multi-modal metastases and glioma tumor segmentation. The code is available at https://github.com/xq141839/MSM-Seg.

academic

Basic Information

Paper ID: 2510.10679
Title: MSM-Seg: A Modality-and-Slice Memory Framework with Category-Agnostic Prompting for Multi-Modal Brain Tumor Segmentation
Authors: Yuxiang Luo, Qing Xu, Hai Huang, Yuqi Ouyang, Zhen Chen, Wenting Duan
Classification: cs.CV (Computer Vision)
Published Journal: IEEE Transactions on Medical Imaging
Paper Link: https://arxiv.org/abs/2510.10679
Code Link: https://github.com/xq141839/MSM-Seg

Abstract

Multi-modal brain tumor segmentation is critical for clinical diagnosis, requiring accurate identification of different internal anatomical sub-regions. Although recent prompt-based segmentation paradigms provide interactive experiences for clinicians, existing methods overlook cross-modal correlations and rely on labor-intensive category-specific prompts, limiting their applicability in practical scenarios. To address these issues, this paper proposes the MSM-Seg framework for multi-modal brain tumor segmentation. MSM-Seg introduces a novel dual-memory segmentation paradigm that synergistically integrates cross-modal and inter-slice information with efficient category-agnostic prompting for brain tumor understanding.

Research Background and Motivation

Core Problems

Complexity of Multi-Modal Brain Tumor Segmentation: Requires simultaneous identification of heterogeneous tumor components, including contrast-enhanced core, necrotic regions, and peritumoral edema, each providing different clinical biomarkers for tumor grading and treatment planning.
Limitations of Existing Methods:
- Classical 3D multi-modal segmentation frameworks are constrained by computational inefficiency inherent to volumetric processing
- Neglect the natural sequential relationships between adjacent slices
- Methods like SAM2 rely on category-specific annotations as prompts, requiring labor-intensive manual annotation
- Existing approaches typically process different MRI modalities independently or through simple prior connections, failing to fully exploit rich complementary information across modalities

Research Motivation

Different MRI modalities exhibit strong complementary relationships: FLAIR sequences excel at displaying peritumoral edema and high-signal lesions, while T1c sequences provide contrast-enhanced visualization of active tumor regions and blood-brain barrier disruption. This complementary relationship motivates the development of a unified framework capable of effectively capturing cross-modal relationships and spatial continuity.

Core Contributions

Proposes a Dual-Memory Segmentation Paradigm: Leverages cross-modal and inter-slice relationships in input scans to achieve comprehensive understanding of tumor sub-regions
Designs Modality-and-Slice Memory Attention Mechanism (MSMA): Efficiently utilizes cross-modal and inter-slice relationships to enhance multi-modal feature representation
Develops Multi-Scale Category-Agnostic Prompt Encoder (MCP-Encoder): Provides tumor region guidance and designs a Modality-Adaptive Fusion Decoder (MF-Decoder)
Achieves Significant Performance Improvements: Surpasses existing state-of-the-art segmentation methods on glioma and metastatic tumor datasets

Methodology Details

Task Definition

Given multi-modal MRI scans {X_{t,m}}, where t ∈ {1,...,T} denotes slice index and m ∈ {1,...,M} denotes modality index, the objective is to generate accurate brain tumor segmentation masks identifying three hierarchical regions: Enhanced Tumor (ET), Tumor Core (TC), and Whole Tumor (WT).

Model Architecture

1. Dual-Memory Segmentation Paradigm

The core idea is to establish progressive memory integration that gradually refines understanding of the entire tumor structure. Given input slice X_{t,m}, the model maintains latent state S_{t,m} ∈ R^{C×H×W} with update rule:

{S_{t,m} = R(X_{t,m}, θ_{t,m}, S_{t,≺m}, S_{≺t})
{Ŷ_{t,m} = P(S_{t,m})

Where:

R(·) is the state update function
P(·) is the segmentation prediction head
S_{t,≺m} represents cross-modal context from preceding modalities at current slice t
S_{≺t} represents inter-slice context from preceding slices
θ_{t,m} is the efficient category-agnostic prompt

2. Modality-and-Slice Memory Attention (MSMA)

Image embeddings F are uniformly split along the channel dimension:

[F_slice, F_modal] = Split(F)

Embeddings are updated through self-attention:

Q_slice = SA(φ(F_slice)), Q_modal = SA(φ(F_modal))

Cross-attention integrates memory information:

Z = CA(Q=Q_slice, K=V=S_{≺t}) + CA(Q=Q_modal, K=V=S_{≺t,m})

3. Multi-Scale Category-Agnostic Prompt Encoder (MCP-Encoder)

Supports two modes:

Category-Agnostic Prompt Mode: Requires only a single bounding box covering the entire tumor region
Automatic Mode: No manual annotation required; autonomously generates tumor region guidance

Multi-scale fusion process:

F^fusion_i = {
    Concat(F^fusion_{i-1}, F_i, G_i), if prompt available
    Concat(F^fusion_{i-1}, F_i), otherwise
}

Final tumor region guidance:

P = DS(σ(φ(F^fusion_l)))

4. Modality-Adaptive Fusion Decoder (MF-Decoder)

For each modality m at slice t, receives memory-enhanced embeddings Z_{t,m} and corresponding tumor guidance P_{t,m}. Prompt embeddings are fused through element-wise addition:

H_{t,m} = Z_{t,m} ⊕ P_{t,m}

Generates modality-specific predictions:

Ŷ_{t,m} = P_pd(H_{t,m}) ⊗ P_mlp(E_{t,m})

Final segmentation mask is obtained through adaptive weighting strategy:

Ŷ_t = Σ_{m=1}^M w_m · Ŷ_{t,m}

Technical Innovations

Dual-Memory Mechanism: First to simultaneously model cross-modal and inter-slice relationships, breaking isolation between modalities and slices
Category-Agnostic Prompting: Eliminates labor-intensive category-specific annotation, improving clinical applicability
Modality-Adaptive Fusion: Dynamically selects the most informative modality for each voxel
Memory-Enhanced Attention: Effectively captures long-range dependencies and contextual information

Experimental Setup

Datasets

BraTS-METS: Brain metastatic tumor segmentation dataset containing 652 multi-contrast MRI examinations covering four modalities: T1, T1c, T2, and FLAIR

BraTS-AGPT: Adult post-treatment glioma segmentation dataset containing 1,349 cases, focusing on segmentation of residual or recurrent gliomas following therapeutic intervention

Evaluation Metrics

Dice Similarity Coefficient: Measures segmentation quality; higher values indicate better performance
95% Hausdorff Distance (HD95): Evaluates boundary delineation accuracy; lower values indicate more accurate boundaries

Evaluation of three hierarchical tumor regions:

Enhanced Tumor (ET): Enhanced tumor region
Tumor Core (TC): Union of ET and surrounding non-enhanced FLAIR hyperintense regions
Whole Tumor (WT): Union of TC and non-enhanced tumor core

Comparison Methods

Includes traditional methods (TransBTS, EoFormer, 3D-TransUNet, UNETR++, nnUnet-V2, SegMamba-V2) and prompt-based methods (SAM, MA-SAM, SAM2, MedSAM-2, SAM2-Adapter, SAMed-2)

Implementation Details

Hardware: NVIDIA A6000 GPU
Optimizer: AdamW (β1=0.9, β2=0.999)
Learning rate: 1×10^-4, weight decay: 0.01
Batch size: 16, training epochs: 300
Image size: 256×256
Modality memory buffer k=3, slice memory buffer n=7

Experimental Results

Main Results

BraTS-METS Dataset:

MSM-Seg achieves 79.51% average Dice score, surpassing the best traditional method SegMamba-V2 (73.92%) by 5.59%
Improves over the best prompt-based method SAMed-2 (77.47%) by 2.04%
HD95 reduced from SAMed-2's 14.27mm to 13.75mm

BraTS-AGPT Dataset:

MSM-Seg achieves 83.84% average Dice score, surpassing SegMamba-V2 (76.49%) by 7.35%
Improves over SAMed-2 (81.44%) by 2.40%
HD95 reduced from SAMed-2's 6.12mm to 5.56mm

Ablation Studies

Systematic ablation studies validate the contribution of each component:

MSMA: Provides 0.65% and 0.81% Dice improvement
MCP-Encoder: Contributes additional 0.87% and 1.07% improvement
MF-Decoder: Further enhances by 1.08% and 1.33%
Dual-Memory Paradigm: Most significant contribution, averaging 1.73% and 2.08% improvement

Memory Capacity Analysis

Modality Memory Capacity: Increasing from k=0 to k=3 shows continuous performance improvement, with k=3 achieving optimal results, averaging 5.13% and 3.98% Dice improvement

Slice Memory Capacity: Improvement from n=0 to n=16 shows significant gains, with n=8 providing the best balance between accuracy and efficiency

Modality Sequence Robustness

T-test analysis shows no significant differences across different modality input sequences (P-values >0.05), demonstrating MSM-Seg's significant robustness to modality sequence variations.

Early research adopted U-shaped encoder-decoder frameworks with 3D CNNs. Recent methods integrate 3D CNNs with Vision Transformers to capture local spatial patterns and global contextual information. Current research explores replacing ViT with Vision Mamba and RWKV to model long-range dependencies with linear computational complexity.

Memory-Based Prompt Segmentation

Memory mechanisms are widely applied in video object segmentation tasks. SAM2 introduces complex memory buffers and memory attention mechanisms to enhance prediction consistency across sequential slices in volumetric scans. Subsequent works such as ReSurgSAM2 and Medical SAM2 optimize memory buffer storage and similarity metrics.

Conclusions and Discussion

Main Conclusions

MSM-Seg effectively integrates cross-modal and inter-slice information through a dual-memory segmentation paradigm, combined with category-agnostic prompt design, achieving significant performance improvements in multi-modal brain tumor segmentation tasks and providing an efficient and practical solution for clinical applications.

Limitations

Computational Overhead: Dual-memory mechanism increases inference latency from 3.86s to 4.17s
Memory Capacity Constraints: Diminishing marginal returns with larger memory capacity
Dataset Scale: Validation only on two BraTS datasets; broader dataset validation needed

Future Directions

Explore more efficient memory mechanisms to reduce computational overhead
Extend to other medical image segmentation tasks
Investigate adaptive memory capacity selection strategies

In-Depth Evaluation

Strengths

Strong Technical Innovation: Dual-memory paradigm and category-agnostic prompt design demonstrate significant novelty
Comprehensive Experiments: Thorough ablation and comparative experiments validate method effectiveness
High Practical Value: Reduces annotation burden and improves clinical applicability
Significant Performance Gains: Surpasses state-of-the-art methods across multiple metrics

Weaknesses

Insufficient Computational Complexity Analysis: Lacks detailed time and space complexity analysis
Limited Cross-Dataset Generalization Validation: Validation only on BraTS series datasets
Missing Failure Case Analysis: No specific analysis of method failure cases

Impact

This work provides a new technical paradigm for multi-modal medical image segmentation. The dual-memory mechanism and category-agnostic prompt design have broad application potential and are expected to have significant impact on the medical image analysis field.

Applicable Scenarios

Clinical Brain Tumor Diagnosis: Reduces physician annotation workload
Multi-Modal Medical Image Segmentation: Extensible to other organs and diseases
Computer-Aided Diagnosis Systems: Provides foundation for high-precision segmentation

References

The paper cites 45 relevant references covering key works in multi-modal segmentation, Vision Transformers, SAM series methods, and other critical domains, providing a solid theoretical foundation for this research.