2025-11-20T19:34:14.388746

Hybrid Explanation-Guided Learning for Transformer-Based Chest X-Ray Diagnosis

Shu, Luo, Poellinger et al.
Transformer-based deep learning models have demonstrated exceptional performance in medical imaging by leveraging attention mechanisms for feature representation and interpretability. However, these models are prone to learning spurious correlations, leading to biases and limited generalization. While human-AI attention alignment can mitigate these issues, it often depends on costly manual supervision. In this work, we propose a Hybrid Explanation-Guided Learning (H-EGL) framework that combines self-supervised and human-guided constraints to enhance attention alignment and improve generalization. The self-supervised component of H-EGL leverages class-distinctive attention without relying on restrictive priors, promoting robustness and flexibility. We validate our approach on chest X-ray classification using the Vision Transformer (ViT), where H-EGL outperforms two state-of-the-art Explanation-Guided Learning (EGL) methods, demonstrating superior classification accuracy and generalization capability. Additionally, it produces attention maps that are better aligned with human expertise.
academic

Hybrid Explanation-Guided Learning for Transformer-Based Chest X-Ray Diagnosis

Basic Information

  • Paper ID: 2510.12704
  • Title: Hybrid Explanation-Guided Learning for Transformer-Based Chest X-Ray Diagnosis
  • Authors: Shelley Zixin Shu, Haozhe Luo, Alexander Poellinger, Mauricio Reyes
  • Classification: cs.CV cs.AI
  • Publication Date: October 14, 2025
  • Paper Link: https://arxiv.org/abs/2510.12704v1

Abstract

Transformer-based deep learning models have demonstrated superior feature representation and interpretability capabilities through attention mechanisms in medical imaging. However, these models are prone to learning spurious correlations, leading to bias and limited generalization ability. While human-machine attention alignment can mitigate these issues, it often relies on expensive manual supervision. This work proposes a Hybrid Explanation-Guided Learning (H-EGL) framework that combines self-supervised and human-guided constraints to enhance attention alignment and improve generalization. The self-supervised component of H-EGL leverages class-discriminative attention without relying on restrictive priors, promoting robustness and flexibility. Validated on chest X-ray classification tasks using Vision Transformer (ViT), H-EGL surpasses two state-of-the-art explanation-guided learning methods, demonstrating superior classification accuracy and generalization ability while producing attention maps better aligned with human expert knowledge.

Research Background and Motivation

Problem Definition

The core problems addressed in this research are spurious correlation learning and attention alignment in Transformer-based medical imaging models. Specifically:

  1. Spurious Correlation Problem: Deep neural networks tend to learn spurious correlations in data, leading to shortcut learning, bias, and fairness issues
  2. Attention Alignment Challenge: While human-machine attention alignment can improve model robustness, it requires expensive manual annotation
  3. Limitations of Existing Methods: Pure self-supervised methods may reinforce incorrect interpretations, while contrastive learning methods lack standardized positive and negative sample generation approaches

Research Significance

In medical image diagnosis, model interpretability and reliability are crucial. Incorrect attention patterns may lead to:

  • Clinical decision errors
  • Missed critical pathological features
  • Model generalization failure across different data distributions

Limitations of Existing Methods

  1. Pure Supervised Methods: Rely on expensive expert annotation with high costs
  2. Pure Self-Supervised Methods: May reinforce spurious or incorrectly aligned interpretations
  3. Traditional Constraint Methods: Depend on rigid priors such as sparsity and smoothness, potentially inhibiting complex feature learning

Core Contributions

  1. Proposes H-EGL Framework: First application of hybrid explanation-guided methods to Transformer architecture, evaluating and enhancing human-machine attention alignment
  2. Designs DAL Component: Proposes Discriminative Attention Learning (DAL) that leverages class-discriminative attention maps for self-supervised learning
  3. Achieves Performance Improvement: Surpasses existing state-of-the-art methods on chest X-ray classification tasks with AUC reaching 89.3%
  4. Enhances Interpretability: Generates attention maps better aligned with expert knowledge while maintaining classification performance

Methodology Details

Task Definition

Input: Chest X-ray images and disease label text Output: Multi-label disease classification predictions and class-specific attention maps Objective: Improve classification accuracy while generating attention maps aligned with human expert-annotated regions

Model Architecture

Overall Framework

H-EGL is built upon the DWARF architecture, employing a ViT encoder-decoder structure:

  1. Text Encoder: Frozen Med-KEBERT for processing disease labels
  2. Visual Encoder: Trainable ViT-B for processing 224×224 input images
  3. Cross-Attention Decoder: Fusing visual and textual features

Core Components

1. Human-Machine Alignment Module Implements attention map alignment with expert annotations using penalized Dice loss:

L_HA = 1 - (2×|A_i ⊙ M_i|)/(|A_i| + |M_i| + w_FP×N_FP)

Where A_i is the model-generated attention map and M_i is the expert mask.

2. Discriminative Attention Learning (DAL) Enhances class discrimination by minimizing similarity between attention maps of different classes:

L_DAL = (2)/(C(C-1)) × ∑∑|S(A_i, A_j)|

Where S(A_i, A_j) is the cosine similarity between attention maps A_i and A_j.

Unified Loss Function

L_H-EGL = L_CE + α×L_HA + β×L_DAL

Technical Innovations

  1. No Negative Sample Generation Required: DAL avoids the complex negative sample construction problem in traditional contrastive learning
  2. Flexible Inductive Bias: Does not depend on rigid constraints such as sparsity, maintaining the model's ability to learn complex features
  3. Direct Utilization of ViT Attention: Fully leverages the inherent attention mechanism of Transformers rather than post-hoc explanation tools
  4. Hybrid Supervision Strategy: Balances human guidance and autonomous learning, achieving optimal balance between cost-effectiveness and performance

Experimental Setup

Dataset

  • ChestXDet Dataset: Subset of NIH ChestX-ray14
  • Scale: 3,578 patients, 3,025 training samples, 553 test samples
  • Annotations: Bounding boxes and polygon annotations for 4 thoracic pathologies (atelectasis, cardiomegaly, consolidation, effusion)
  • Validation: Quality verified by three radiologists
  • Split: 80-20 training-validation split

Evaluation Metrics

  • Classification Metrics: AUC, F1 score, MCC (Matthews Correlation Coefficient)
  • Generalization Ability: Performance gap between validation and test sets
  • Robustness: Performance under different noise levels

Comparison Methods

  1. KAD: Knowledge-aware detection framework leveraging knowledge graphs for enhanced visual reasoning
  2. GAIN: Gradient-based attention network improving interpretability through refined attention mechanisms
  3. DWARF* (β=0): Explanation-guided learning using only human annotation guidance
  4. DAL (α=0): Pure self-supervised explanation-guided learning

Implementation Details

  • Optimizer: AdamW with learning rate 1e-5
  • Training Strategy: 1000 epochs, early stopping patience 50, 20-epoch warmup
  • Batch Size: 32
  • Hardware: RTX 4090 GPU, CUDA v12.2
  • Hyperparameters: α=1.0, β=1.0, w_FP=1

Experimental Results

Main Results

MethodAUC_test(%)AUC_gap(%)F1_test(%)F1_gap(%)MCC_test(%)MCC_gap(%)
KAD88.1±0.32.568.2±2.51.857.5±2.34.8
GAIN88.0±0.42.767.8±2.22.457.2±2.05.6
H-EGL89.3±0.71.569.4±1.90.558.3±2.53.8

Key Findings:

  • H-EGL achieves best performance across all metrics
  • Significantly reduces generalization gap, indicating superior robustness
  • Low variance (0.7%) demonstrates stable performance

Ablation Study

  • H-EGL(α=0): AUC 89.3±1.0%, validating DAL effectiveness
  • H-EGL(β=0): AUC 88.4±0.2%, demonstrating human alignment contribution
  • Hybrid method outperforms any single component alone

Robustness Analysis

Testing under different noise levels (σ=0, 0.03, 0.05, 0.1) shows:

  • All methods experience performance degradation with increased noise
  • H-EGL maintains optimal performance across all noise levels
  • Demonstrates superior robustness

Qualitative Analysis

Attention map visualization reveals:

  • Baseline KAD: Covers expert-annotated regions but incorrectly highlights bilateral lower lobes
  • DWARF: Reduces lower lobe false positives but incorrectly focuses on left lung
  • H-EGL and DAL: More accurately identifies pathological regions with significantly reduced false positives

Main Research Directions

  1. Explanation-Guided Learning (EGL): Utilizing explanation information to guide model learning
  2. Human-Machine Attention Alignment: Integrating human knowledge to improve model interpretability
  3. Transformer Applications in Medical Imaging: Leveraging attention mechanisms for disease diagnosis

Advantages of This Work

  • First application of hybrid explanation-guided methods to medical imaging Transformers
  • Proposes self-supervised attention learning strategy without negative samples
  • Achieves dual improvement in performance and interpretability

Conclusions and Discussion

Main Conclusions

  1. H-EGL effectively combines self-supervised and human supervision, achieving superior classification performance and attention alignment
  2. The DAL component provides flexible inductive bias, avoiding over-regularization
  3. The hybrid strategy achieves good balance between cost-effectiveness and performance

Limitations

  1. Dataset Scale: Validated only on relatively small ChestXDet dataset
  2. Disease Categories: Evaluated on only 4 thoracic diseases
  3. Architecture Dependency: Primarily designed for ViT architecture
  4. Hyperparameter Sensitivity: Optimal settings for α and β parameters may vary across tasks

Future Directions

  1. Dynamic Alignment Mechanisms: Explore adaptive adjustment of self-supervised and human alignment during training
  2. Large-Scale Validation: Verify on larger datasets and more disease categories
  3. Cross-Modal Extension: Extend to other medical imaging modalities
  4. Clinical Deployment: Investigate real-world clinical application effectiveness

In-Depth Evaluation

Strengths

  1. Methodological Innovation: First application of hybrid explanation-guided learning to medical imaging Transformers
  2. Technical Soundness: DAL design is elegant, avoiding complexity of traditional contrastive learning
  3. Comprehensive Experiments: Includes thorough comparative experiments, ablation studies, and robustness analysis
  4. Practical Value: Significantly improves interpretability while maintaining performance

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks in-depth theoretical explanation for why the hybrid method is effective
  2. Computational Complexity: Insufficient analysis of additional loss terms' impact on training efficiency
  3. Hyperparameter Sensitivity: Insufficient guidance on α and β parameter selection
  4. Missing Clinical Validation: No expert evaluation in real clinical environments

Impact

  1. Academic Contribution: Provides new perspectives for medical imaging interpretability research
  2. Practical Value: Can be directly applied to existing medical imaging diagnostic systems
  3. Reproducibility: Provides detailed implementation details facilitating reproduction

Applicable Scenarios

  1. Medical Image Diagnosis: Particularly suitable for clinical applications requiring high interpretability
  2. Multi-Label Classification Tasks: Extensible to other classification problems requiring attention alignment
  3. Resource-Constrained Environments: Hybrid supervision strategy suitable for scenarios with limited annotation resources

References

The paper cites multiple important related works, including:

  • Original Vision Transformer (ViT) paper 3
  • Spurious correlation research in medical imaging 2,5,6
  • Explanation-guided learning survey 4
  • DWARF method 11 and KAD method 19

Overall Assessment: This is a high-quality research paper making meaningful contributions to medical imaging interpretability. The hybrid explanation-guided learning framework is well-designed with sufficient experimental validation and convincing results. Despite some limitations, it provides a solid foundation and direction for future research.