2025-11-20T12:19:22.539414

Deep Attention-guided Adaptive Subsampling

Shankaranarayana, Roy, Sudhakar et al.
Although deep neural networks have provided impressive gains in performance, these improvements often come at the cost of increased computational complexity and expense. In many cases, such as 3D volume or video classification tasks, not all slices or frames are necessary due to inherent redundancies. To address this issue, we propose a novel learnable subsampling framework that can be integrated into any neural network architecture. Subsampling, being a nondifferentiable operation, poses significant challenges for direct adaptation into deep learning models. While some works, have proposed solutions using the Gumbel-max trick to overcome the problem of non-differentiability, they fall short in a crucial aspect: they are only task-adaptive and not inputadaptive. Once the sampling mechanism is learned, it remains static and does not adjust to different inputs, making it unsuitable for real-world applications. To this end, we propose an attention-guided sampling module that adapts to inputs even during inference. This dynamic adaptation results in performance gains and reduces complexity in deep neural network models. We demonstrate the effectiveness of our method on 3D medical imaging datasets from MedMNIST3D as well as two ultrasound video datasets for classification tasks, one of them being a challenging in-house dataset collected under real-world clinical conditions.
academic

Deep Attention-guided Adaptive Subsampling

Basic Information

  • Paper ID: 2510.12376
  • Title: Deep Attention-guided Adaptive Subsampling
  • Authors: Sharath M Shankaranarayana, Soumava Kumar Roy, Prasad Sudhakar, Chandan Aladahalli (GE Healthcare, Bangalore, India)
  • Categories: cs.CV, cs.AI, cs.LG
  • Publication Date: October 14, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.12376v1

Abstract

Despite significant performance improvements in deep neural networks, these advances often come at the cost of increased computational complexity and expense. In many scenarios, such as 3D volumetric or video classification tasks, not all slices or frames are necessary due to inherent redundancy. To address this issue, the authors propose a novel learnable subsampling framework that can be integrated into any neural network architecture. The framework dynamically adapts to inputs during inference through an attention-guided sampling module, achieving performance improvements while reducing the complexity of deep neural network models.

Research Background and Motivation

Core Problems

  1. Computational Efficiency Challenges: Deep neural networks face enormous computational costs when processing high-dimensional data such as videos and volumetric scans
  2. Data Redundancy: Substantial redundant information exists in 3D medical imaging and video data, where not all frames/slices are useful for the final task
  3. Sampling Strategy Limitations: Traditional uniform sampling or hand-crafted heuristic methods cannot identify and prioritize the most salient information

Limitations of Existing Methods

  1. Deep Probabilistic Subsampling (DPS): While effective, it learns fixed, content-agnostic strategies
  2. Active Deep Probabilistic Subsampling (ADPS): Although introducing instance-level adaptivity, it conditions only on already-sampled components without directly leveraging input features themselves
  3. Static Nature: Once learned, the sampling mechanism in existing methods remains static and cannot adapt to different inputs

Research Motivation

Addressing the limitations of existing methods, this paper proposes a dynamic sampling framework that is both task-adaptive and input-adaptive, capable of adjusting sampling strategies at inference time based on specific inputs.

Core Contributions

  1. Novel Plug-and-Play Neural Sampling Module: Proposes a module for dynamic sampling of 3D volumes and videos that adapts to inputs at inference time, achieving both task and input adaptivity
  2. Comprehensive Performance Validation: Validates the framework's effectiveness on eight medical imaging datasets, including six MedMNIST3D datasets, one public ultrasound video dataset, and one proprietary dataset collected in a clinical setting
  3. End-to-End Trainable Framework: Ensures end-to-end differentiability of discrete sample selection through Gumbel-Softmax reparameterization
  4. Interpretability: Produces sampling matrices as outputs, making the sampling process explicitly controllable and interpretable

Methodology Details

Task Definition

Given a sequence XRB×T×C×H×WX \in \mathbb{R}^{B \times T \times C \times H \times W} containing T frames, the objective is to learn a sampling function SθS_\theta that selects a subset of k frames (where kTk \ll T).

Model Architecture

1. Lightweight Feature Extraction

The feature extraction module contains multiple parallel pathways to compute rich representations of the input sequence:

  • Temporal Dynamics Capture: Computes inter-frame variance across spatial and channel dimensions
  • Anatomical Boundary Identification: Applies Sobel and Laplacian kernel sets to compute edge magnitudes
  • Feature Aggregation: Concatenates extracted features to form a comprehensive feature representation FRB×T×dF \in \mathbb{R}^{B \times T \times d}

2. Multi-Head Attention Layer

The aggregated feature tensor F is processed through a multi-head attention layer to generate final sampling logits:

sh=Softplus(MLPh(F))s^h = \text{Softplus}(\text{MLP}^h(F))

Ah(:,j,:)=abasesh(:,j)A^{(:,j,:)}_h = a_{\text{base}} \odot s^{(:,j)}_h

A=1Hh=1HAhA = \frac{1}{H} \sum_{h=1}^H A^h

where H is the number of attention heads and shRB×ks^h \in \mathbb{R}^{B \times k} are head-specific scale factors.

3. Differentiable Gumbel-Softmax Sampling

To enable end-to-end training, the Gumbel-Softmax trick is employed for differentiable sampling:

Adaptive Temperature Scaling: τ=τ0(0.5+σ(MLPtemp(F)))\tau = \tau_0 \cdot (0.5 + \sigma(\text{MLP}_{\text{temp}}(F)))

Sampling Process: Gb,j,tGumbel(0,1)G_{b,j,t} \sim \text{Gumbel}(0,1)Psoft=Softmaxt(A+Gτ)P_{\text{soft}} = \text{Softmax}_t\left(\frac{A + G}{\tau}\right)

A straight-through estimator (STE) ensures differentiability, ultimately yielding sampling matrix PRB×k×TP \in \mathbb{R}^{B \times k \times T}.

Technical Innovations

  1. Dynamic Input Adaptation: Unlike DPS's static strategy, DAS dynamically adjusts sampling strategy based on input content
  2. Lightweight Design: Compared to ADPS's multi-stage process, DAS employs a single-pass lightweight module
  3. Adaptive Temperature Mechanism: Dynamically controls the trade-off between exploration and exploitation
  4. Multi-modal Feature Fusion: Combines temporal dynamics and spatial structure information

Experimental Setup

Datasets

  1. MedMNIST3D: Six 3D volumetric datasets (Organ, Nodule, Adrenal, Fracture, Vessel, Synapse) covering multi-organ segmentation and pathology detection tasks
  2. Breast Ultrasound Video (BUSV): Public breast ultrasound video dataset for binary classification benchmark of breast lesion detection
  3. Internal Gastric Antrum Dataset: Proprietary clinical ultrasound video dataset collected in real hospital environments, containing five-class gastric content classification

Evaluation Metrics

  • Balanced Accuracy
  • AUC (Area Under Curve)
  • All results are averaged over three independent runs

Comparison Methods

  1. Full Sequence: Processing all frames or slices (computational upper bound)
  2. Random Sampling: Randomly selecting k frames
  3. Uniform Sampling: Equidistantly selecting frames
  4. Deep Probabilistic Subsampling (DPS): Task-adaptive but content-agnostic learned sampling
  5. Active Deep Probabilistic Subsampling (ADPS): Input-adaptive but conditioned only on already-sampled components

Implementation Details

  • Downstream Architecture: MobileNetV3-Small as feature extractor
  • Optimizer: Adam (lr=1e-4, batch size=16)
  • Sampling Ratio: All subsampling methods select 50% of original sequence length
  • Early Stopping Strategy: Based on validation loss

Experimental Results

Main Results

Performance on Public Datasets (Table 1)

DAS significantly outperforms DPS and ADPS on most MedMNIST3D datasets:

  • Organ Dataset: AUC 0.931 vs ADPS 0.928, accuracy 58.1% vs ADPS 57.3%
  • Nodule Dataset: AUC 0.799 vs ADPS 0.782, accuracy 75.8% vs ADPS 75.8%
  • Vessel Dataset: AUC 0.752 vs ADPS 0.739, accuracy 82.9% vs ADPS 80.7%

Performance on Internal Dataset (Table 2)

On the challenging gastric antrum dataset, DAS even surpasses the full sequence baseline:

  • AUC: 0.639 vs Full Sequence 0.611
  • Accuracy: 34.1% vs Full Sequence 30.1%

Key Findings

  1. Redundancy Exploitation: ADPS and DAS achieve near full-sequence performance on many datasets, indicating that data redundancy exists in classification tasks and can be exploited by superior sampling strategies
  2. Real-World Scenario Advantages: DAS performs particularly well on noisy clinical ultrasound scans
  3. Computational Efficiency: Achieves significant computational savings while maintaining or improving performance

Ablation Studies

Although the paper lacks detailed ablation experiments, comparisons with different baselines reveal:

  • The importance of attention mechanisms (improvements over random and uniform sampling)
  • The value of input adaptivity (improvements over DPS)
  • The advantages of dynamic sampling (compared to static methods)

Learnable Subsampling

  • DPS: First proposed a differentiable framework for learning task-adaptive sampling patterns, but adopts fixed content-agnostic strategies
  • ADPS: Extended DPS by enabling instance-adaptive sampling, but the multi-stage process introduces significant computational overhead at inference time

Attention Mechanisms

  • Widely used for identifying salient frames in videos, but often lack end-to-end differentiability or integration within a unified sampling framework

Differentiable Sampling Techniques

  • Gumbel-Softmax Trick: Enables training of networks with discrete choices
  • This work combines attention mechanisms with Gumbel-Softmax-based samplers, achieving high adaptivity and end-to-end trainability

Conclusions and Discussion

Main Conclusions

  1. DAS successfully achieves dual task and input adaptivity, dynamically adjusting sampling strategies at inference time
  2. Validates the method's effectiveness on multiple medical imaging datasets, particularly excelling in real clinical environments
  3. The framework demonstrates good generality and can be integrated into any neural network architecture

Limitations

  1. Feature Extraction Dependency: Current use of predefined features (temporal variance, edge detection) may limit adaptivity
  2. Evaluation Scope: Primarily validated in medical imaging; generalization to other domains requires further verification
  3. Computational Overhead Analysis: Lacks detailed computational complexity analysis and actual inference time comparisons

Future Directions

The paper suggests a promising research direction: developing learnable feature extraction modules that can automatically identify salient features for guiding the sampling process, further enhancing DAS performance.

In-Depth Evaluation

Strengths

  1. Clear Problem Definition: Accurately identifies core limitations of existing methods (static vs. dynamic sampling)
  2. Technical Innovation: Cleverly combines attention mechanisms with differentiable sampling to achieve input adaptivity
  3. Comprehensive Experiments: Thorough evaluation across multiple datasets, including real clinical data
  4. High Practical Value: Simple and effective method, easily integrated into existing architectures

Weaknesses

  1. Lack of Theoretical Analysis: Missing theoretical analysis of convergence and stability
  2. Insufficient Ablation Studies: No detailed analysis of individual component contributions (multi-head attention, adaptive temperature, etc.)
  3. Quantification of Computational Efficiency: While claiming efficiency improvements, lacks specific computational time and memory usage comparisons
  4. Hyperparameter Sensitivity: No analysis of critical hyperparameter effects (e.g., number of heads H, temperature τ₀)

Impact

  1. Academic Contribution: Provides new perspectives for learnable sampling, particularly regarding input adaptivity
  2. Practical Application: Direct application value in medical image processing, especially suitable for resource-constrained environments
  3. Reproducibility: Method description is relatively clear, but lacks code and detailed implementation details

Applicable Scenarios

  1. Medical Image Analysis: 3D volumetric data and ultrasound video processing
  2. Video Understanding: Efficient processing of long video sequences
  3. Resource-Constrained Environments: Mobile devices and edge computing scenarios
  4. Real-Time Applications: Clinical diagnostic systems requiring rapid response

References

The paper cites key works in the field, including:

  • Gumbel-Softmax related works 3,4
  • Pioneering works on learnable sampling: DPS 1 and ADPS 2
  • MedMNIST3D benchmark datasets 5
  • Applications of attention mechanisms in video processing 7,8

Overall Assessment: This is a technically sound paper with clear problem definition. While it could be strengthened in theoretical analysis and experimental depth, the proposed dynamic input-adaptive sampling approach has significant value, particularly demonstrating strong potential in practical applications such as medical imaging. The simplicity and generality of the method contribute to its practical utility.