2025-11-20T12:19:22.539414

Deep Attention-guided Adaptive Subsampling

Shankaranarayana, Roy, Sudhakar et al.

Although deep neural networks have provided impressive gains in performance, these improvements often come at the cost of increased computational complexity and expense. In many cases, such as 3D volume or video classification tasks, not all slices or frames are necessary due to inherent redundancies. To address this issue, we propose a novel learnable subsampling framework that can be integrated into any neural network architecture. Subsampling, being a nondifferentiable operation, poses significant challenges for direct adaptation into deep learning models. While some works, have proposed solutions using the Gumbel-max trick to overcome the problem of non-differentiability, they fall short in a crucial aspect: they are only task-adaptive and not inputadaptive. Once the sampling mechanism is learned, it remains static and does not adjust to different inputs, making it unsuitable for real-world applications. To this end, we propose an attention-guided sampling module that adapts to inputs even during inference. This dynamic adaptation results in performance gains and reduces complexity in deep neural network models. We demonstrate the effectiveness of our method on 3D medical imaging datasets from MedMNIST3D as well as two ultrasound video datasets for classification tasks, one of them being a challenging in-house dataset collected under real-world clinical conditions.

academic

Deep Attention-guided Adaptive Subsampling

Basic Information

Paper ID: 2510.12376
Title: Deep Attention-guided Adaptive Subsampling
Authors: Sharath M Shankaranarayana, Soumava Kumar Roy, Prasad Sudhakar, Chandan Aladahalli (GE Healthcare, Bangalore, India)
Categories: cs.CV, cs.AI, cs.LG
Publication Date: October 14, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.12376v1

Abstract

Despite significant performance improvements in deep neural networks, these advances often come at the cost of increased computational complexity and expense. In many scenarios, such as 3D volumetric or video classification tasks, not all slices or frames are necessary due to inherent redundancy. To address this issue, the authors propose a novel learnable subsampling framework that can be integrated into any neural network architecture. The framework dynamically adapts to inputs during inference through an attention-guided sampling module, achieving performance improvements while reducing the complexity of deep neural network models.

Research Background and Motivation

Core Problems

Computational Efficiency Challenges: Deep neural networks face enormous computational costs when processing high-dimensional data such as videos and volumetric scans
Data Redundancy: Substantial redundant information exists in 3D medical imaging and video data, where not all frames/slices are useful for the final task
Sampling Strategy Limitations: Traditional uniform sampling or hand-crafted heuristic methods cannot identify and prioritize the most salient information

Limitations of Existing Methods

Deep Probabilistic Subsampling (DPS): While effective, it learns fixed, content-agnostic strategies
Active Deep Probabilistic Subsampling (ADPS): Although introducing instance-level adaptivity, it conditions only on already-sampled components without directly leveraging input features themselves
Static Nature: Once learned, the sampling mechanism in existing methods remains static and cannot adapt to different inputs

Research Motivation

Addressing the limitations of existing methods, this paper proposes a dynamic sampling framework that is both task-adaptive and input-adaptive, capable of adjusting sampling strategies at inference time based on specific inputs.

Core Contributions

Novel Plug-and-Play Neural Sampling Module: Proposes a module for dynamic sampling of 3D volumes and videos that adapts to inputs at inference time, achieving both task and input adaptivity
Comprehensive Performance Validation: Validates the framework's effectiveness on eight medical imaging datasets, including six MedMNIST3D datasets, one public ultrasound video dataset, and one proprietary dataset collected in a clinical setting
End-to-End Trainable Framework: Ensures end-to-end differentiability of discrete sample selection through Gumbel-Softmax reparameterization
Interpretability: Produces sampling matrices as outputs, making the sampling process explicitly controllable and interpretable

Methodology Details

Task Definition

Given a sequence $X \in \mathbb{R}^{B \times T \times C \times H \times W}$ containing T frames, the objective is to learn a sampling function $S_\theta$ that selects a subset of k frames (where $k \ll T$ ).

Model Architecture

1. Lightweight Feature Extraction

The feature extraction module contains multiple parallel pathways to compute rich representations of the input sequence:

Temporal Dynamics Capture: Computes inter-frame variance across spatial and channel dimensions
Anatomical Boundary Identification: Applies Sobel and Laplacian kernel sets to compute edge magnitudes
Feature Aggregation: Concatenates extracted features to form a comprehensive feature representation $F \in \mathbb{R}^{B \times T \times d}$

2. Multi-Head Attention Layer

The aggregated feature tensor F is processed through a multi-head attention layer to generate final sampling logits:

$s^h = \text{Softplus}(\text{MLP}^h(F))$

$A^{(:,j,:)}_h = a_{\text{base}} \odot s^{(:,j)}_h$

$A = \frac{1}{H} \sum_{h=1}^H A^h$

where H is the number of attention heads and $s^h \in \mathbb{R}^{B \times k}$ are head-specific scale factors.

3. Differentiable Gumbel-Softmax Sampling

To enable end-to-end training, the Gumbel-Softmax trick is employed for differentiable sampling:

Adaptive Temperature Scaling: $\tau = \tau_0 \cdot (0.5 + \sigma(\text{MLP}_{\text{temp}}(F)))$

Sampling Process: $G_{b,j,t} \sim \text{Gumbel}(0,1)$ $P_{\text{soft}} = \text{Softmax}_t\left(\frac{A + G}{\tau}\right)$

A straight-through estimator (STE) ensures differentiability, ultimately yielding sampling matrix $P \in \mathbb{R}^{B \times k \times T}$ .

Technical Innovations

Dynamic Input Adaptation: Unlike DPS's static strategy, DAS dynamically adjusts sampling strategy based on input content
Lightweight Design: Compared to ADPS's multi-stage process, DAS employs a single-pass lightweight module
Adaptive Temperature Mechanism: Dynamically controls the trade-off between exploration and exploitation
Multi-modal Feature Fusion: Combines temporal dynamics and spatial structure information

Experimental Setup

Datasets

MedMNIST3D: Six 3D volumetric datasets (Organ, Nodule, Adrenal, Fracture, Vessel, Synapse) covering multi-organ segmentation and pathology detection tasks
Breast Ultrasound Video (BUSV): Public breast ultrasound video dataset for binary classification benchmark of breast lesion detection
Internal Gastric Antrum Dataset: Proprietary clinical ultrasound video dataset collected in real hospital environments, containing five-class gastric content classification

Evaluation Metrics

Balanced Accuracy
AUC (Area Under Curve)
All results are averaged over three independent runs

Comparison Methods

Full Sequence: Processing all frames or slices (computational upper bound)
Random Sampling: Randomly selecting k frames
Uniform Sampling: Equidistantly selecting frames
Deep Probabilistic Subsampling (DPS): Task-adaptive but content-agnostic learned sampling
Active Deep Probabilistic Subsampling (ADPS): Input-adaptive but conditioned only on already-sampled components

Implementation Details

Downstream Architecture: MobileNetV3-Small as feature extractor
Optimizer: Adam (lr=1e-4, batch size=16)
Sampling Ratio: All subsampling methods select 50% of original sequence length
Early Stopping Strategy: Based on validation loss

Experimental Results

Main Results

Performance on Public Datasets (Table 1)

DAS significantly outperforms DPS and ADPS on most MedMNIST3D datasets:

Organ Dataset: AUC 0.931 vs ADPS 0.928, accuracy 58.1% vs ADPS 57.3%
Nodule Dataset: AUC 0.799 vs ADPS 0.782, accuracy 75.8% vs ADPS 75.8%
Vessel Dataset: AUC 0.752 vs ADPS 0.739, accuracy 82.9% vs ADPS 80.7%

Performance on Internal Dataset (Table 2)

On the challenging gastric antrum dataset, DAS even surpasses the full sequence baseline:

AUC: 0.639 vs Full Sequence 0.611
Accuracy: 34.1% vs Full Sequence 30.1%

Key Findings

Redundancy Exploitation: ADPS and DAS achieve near full-sequence performance on many datasets, indicating that data redundancy exists in classification tasks and can be exploited by superior sampling strategies
Real-World Scenario Advantages: DAS performs particularly well on noisy clinical ultrasound scans
Computational Efficiency: Achieves significant computational savings while maintaining or improving performance

Ablation Studies

Although the paper lacks detailed ablation experiments, comparisons with different baselines reveal:

The importance of attention mechanisms (improvements over random and uniform sampling)
The value of input adaptivity (improvements over DPS)
The advantages of dynamic sampling (compared to static methods)

Learnable Subsampling

DPS: First proposed a differentiable framework for learning task-adaptive sampling patterns, but adopts fixed content-agnostic strategies
ADPS: Extended DPS by enabling instance-adaptive sampling, but the multi-stage process introduces significant computational overhead at inference time

Attention Mechanisms

Widely used for identifying salient frames in videos, but often lack end-to-end differentiability or integration within a unified sampling framework

Differentiable Sampling Techniques

Gumbel-Softmax Trick: Enables training of networks with discrete choices
This work combines attention mechanisms with Gumbel-Softmax-based samplers, achieving high adaptivity and end-to-end trainability

Conclusions and Discussion

Main Conclusions

DAS successfully achieves dual task and input adaptivity, dynamically adjusting sampling strategies at inference time
Validates the method's effectiveness on multiple medical imaging datasets, particularly excelling in real clinical environments
The framework demonstrates good generality and can be integrated into any neural network architecture

Limitations

Feature Extraction Dependency: Current use of predefined features (temporal variance, edge detection) may limit adaptivity
Evaluation Scope: Primarily validated in medical imaging; generalization to other domains requires further verification
Computational Overhead Analysis: Lacks detailed computational complexity analysis and actual inference time comparisons

Future Directions

The paper suggests a promising research direction: developing learnable feature extraction modules that can automatically identify salient features for guiding the sampling process, further enhancing DAS performance.

In-Depth Evaluation

Strengths

Clear Problem Definition: Accurately identifies core limitations of existing methods (static vs. dynamic sampling)
Technical Innovation: Cleverly combines attention mechanisms with differentiable sampling to achieve input adaptivity
Comprehensive Experiments: Thorough evaluation across multiple datasets, including real clinical data
High Practical Value: Simple and effective method, easily integrated into existing architectures

Weaknesses

Lack of Theoretical Analysis: Missing theoretical analysis of convergence and stability
Insufficient Ablation Studies: No detailed analysis of individual component contributions (multi-head attention, adaptive temperature, etc.)
Quantification of Computational Efficiency: While claiming efficiency improvements, lacks specific computational time and memory usage comparisons
Hyperparameter Sensitivity: No analysis of critical hyperparameter effects (e.g., number of heads H, temperature τ₀)

Impact

Academic Contribution: Provides new perspectives for learnable sampling, particularly regarding input adaptivity
Practical Application: Direct application value in medical image processing, especially suitable for resource-constrained environments
Reproducibility: Method description is relatively clear, but lacks code and detailed implementation details

Applicable Scenarios

Medical Image Analysis: 3D volumetric data and ultrasound video processing
Video Understanding: Efficient processing of long video sequences
Resource-Constrained Environments: Mobile devices and edge computing scenarios
Real-Time Applications: Clinical diagnostic systems requiring rapid response

References

The paper cites key works in the field, including:

Gumbel-Softmax related works 3,4
Pioneering works on learnable sampling: DPS 1 and ADPS 2
MedMNIST3D benchmark datasets 5
Applications of attention mechanisms in video processing 7,8

Overall Assessment: This is a technically sound paper with clear problem definition. While it could be strengthened in theoretical analysis and experimental depth, the proposed dynamic input-adaptive sampling approach has significant value, particularly demonstrating strong potential in practical applications such as medical imaging. The simplicity and generality of the method contribute to its practical utility.