2025-11-20T19:34:14.388746

Hybrid Explanation-Guided Learning for Transformer-Based Chest X-Ray Diagnosis

Shu, Luo, Poellinger et al.

Transformer-based deep learning models have demonstrated exceptional performance in medical imaging by leveraging attention mechanisms for feature representation and interpretability. However, these models are prone to learning spurious correlations, leading to biases and limited generalization. While human-AI attention alignment can mitigate these issues, it often depends on costly manual supervision. In this work, we propose a Hybrid Explanation-Guided Learning (H-EGL) framework that combines self-supervised and human-guided constraints to enhance attention alignment and improve generalization. The self-supervised component of H-EGL leverages class-distinctive attention without relying on restrictive priors, promoting robustness and flexibility. We validate our approach on chest X-ray classification using the Vision Transformer (ViT), where H-EGL outperforms two state-of-the-art Explanation-Guided Learning (EGL) methods, demonstrating superior classification accuracy and generalization capability. Additionally, it produces attention maps that are better aligned with human expertise.

academic

Hybrid Explanation-Guided Learning for Transformer-Based Chest X-Ray Diagnosis

Basic Information

Paper ID: 2510.12704
Title: Hybrid Explanation-Guided Learning for Transformer-Based Chest X-Ray Diagnosis
Authors: Shelley Zixin Shu, Haozhe Luo, Alexander Poellinger, Mauricio Reyes
Classification: cs.CV cs.AI
Publication Date: October 14, 2025
Paper Link: https://arxiv.org/abs/2510.12704v1

Abstract

Transformer-based deep learning models have demonstrated superior feature representation and interpretability capabilities through attention mechanisms in medical imaging. However, these models are prone to learning spurious correlations, leading to bias and limited generalization ability. While human-machine attention alignment can mitigate these issues, it often relies on expensive manual supervision. This work proposes a Hybrid Explanation-Guided Learning (H-EGL) framework that combines self-supervised and human-guided constraints to enhance attention alignment and improve generalization. The self-supervised component of H-EGL leverages class-discriminative attention without relying on restrictive priors, promoting robustness and flexibility. Validated on chest X-ray classification tasks using Vision Transformer (ViT), H-EGL surpasses two state-of-the-art explanation-guided learning methods, demonstrating superior classification accuracy and generalization ability while producing attention maps better aligned with human expert knowledge.

Research Background and Motivation

Problem Definition

The core problems addressed in this research are spurious correlation learning and attention alignment in Transformer-based medical imaging models. Specifically:

Spurious Correlation Problem: Deep neural networks tend to learn spurious correlations in data, leading to shortcut learning, bias, and fairness issues
Attention Alignment Challenge: While human-machine attention alignment can improve model robustness, it requires expensive manual annotation
Limitations of Existing Methods: Pure self-supervised methods may reinforce incorrect interpretations, while contrastive learning methods lack standardized positive and negative sample generation approaches

Research Significance

In medical image diagnosis, model interpretability and reliability are crucial. Incorrect attention patterns may lead to:

Clinical decision errors
Missed critical pathological features
Model generalization failure across different data distributions

Limitations of Existing Methods

Pure Supervised Methods: Rely on expensive expert annotation with high costs
Pure Self-Supervised Methods: May reinforce spurious or incorrectly aligned interpretations
Traditional Constraint Methods: Depend on rigid priors such as sparsity and smoothness, potentially inhibiting complex feature learning

Core Contributions

Proposes H-EGL Framework: First application of hybrid explanation-guided methods to Transformer architecture, evaluating and enhancing human-machine attention alignment
Designs DAL Component: Proposes Discriminative Attention Learning (DAL) that leverages class-discriminative attention maps for self-supervised learning
Achieves Performance Improvement: Surpasses existing state-of-the-art methods on chest X-ray classification tasks with AUC reaching 89.3%
Enhances Interpretability: Generates attention maps better aligned with expert knowledge while maintaining classification performance

Methodology Details

Task Definition

Input: Chest X-ray images and disease label text Output: Multi-label disease classification predictions and class-specific attention maps Objective: Improve classification accuracy while generating attention maps aligned with human expert-annotated regions

Model Architecture

Overall Framework

H-EGL is built upon the DWARF architecture, employing a ViT encoder-decoder structure:

Text Encoder: Frozen Med-KEBERT for processing disease labels
Visual Encoder: Trainable ViT-B for processing 224×224 input images
Cross-Attention Decoder: Fusing visual and textual features

Core Components

1. Human-Machine Alignment Module Implements attention map alignment with expert annotations using penalized Dice loss:

L_HA = 1 - (2×|A_i ⊙ M_i|)/(|A_i| + |M_i| + w_FP×N_FP)

Where A_i is the model-generated attention map and M_i is the expert mask.

2. Discriminative Attention Learning (DAL) Enhances class discrimination by minimizing similarity between attention maps of different classes:

L_DAL = (2)/(C(C-1)) × ∑∑|S(A_i, A_j)|

Where S(A_i, A_j) is the cosine similarity between attention maps A_i and A_j.

Unified Loss Function

L_H-EGL = L_CE + α×L_HA + β×L_DAL

Technical Innovations

No Negative Sample Generation Required: DAL avoids the complex negative sample construction problem in traditional contrastive learning
Flexible Inductive Bias: Does not depend on rigid constraints such as sparsity, maintaining the model's ability to learn complex features
Direct Utilization of ViT Attention: Fully leverages the inherent attention mechanism of Transformers rather than post-hoc explanation tools
Hybrid Supervision Strategy: Balances human guidance and autonomous learning, achieving optimal balance between cost-effectiveness and performance

Experimental Setup

Dataset

ChestXDet Dataset: Subset of NIH ChestX-ray14
Scale: 3,578 patients, 3,025 training samples, 553 test samples
Annotations: Bounding boxes and polygon annotations for 4 thoracic pathologies (atelectasis, cardiomegaly, consolidation, effusion)
Validation: Quality verified by three radiologists
Split: 80-20 training-validation split

Evaluation Metrics

Classification Metrics: AUC, F1 score, MCC (Matthews Correlation Coefficient)
Generalization Ability: Performance gap between validation and test sets
Robustness: Performance under different noise levels

Comparison Methods

KAD: Knowledge-aware detection framework leveraging knowledge graphs for enhanced visual reasoning
GAIN: Gradient-based attention network improving interpretability through refined attention mechanisms
DWARF* (β=0): Explanation-guided learning using only human annotation guidance
DAL (α=0): Pure self-supervised explanation-guided learning

Implementation Details

Optimizer: AdamW with learning rate 1e-5
Training Strategy: 1000 epochs, early stopping patience 50, 20-epoch warmup
Batch Size: 32
Hardware: RTX 4090 GPU, CUDA v12.2
Hyperparameters: α=1.0, β=1.0, w_FP=1

Experimental Results

Main Results

Method	AUC_test(%)	AUC_gap(%)	F1_test(%)	F1_gap(%)	MCC_test(%)	MCC_gap(%)
KAD	88.1±0.3	2.5	68.2±2.5	1.8	57.5±2.3	4.8
GAIN	88.0±0.4	2.7	67.8±2.2	2.4	57.2±2.0	5.6
H-EGL	89.3±0.7	1.5	69.4±1.9	0.5	58.3±2.5	3.8

Key Findings:

H-EGL achieves best performance across all metrics
Significantly reduces generalization gap, indicating superior robustness
Low variance (0.7%) demonstrates stable performance

Ablation Study

H-EGL(α=0): AUC 89.3±1.0%, validating DAL effectiveness
H-EGL(β=0): AUC 88.4±0.2%, demonstrating human alignment contribution
Hybrid method outperforms any single component alone

Robustness Analysis

Testing under different noise levels (σ=0, 0.03, 0.05, 0.1) shows:

All methods experience performance degradation with increased noise
H-EGL maintains optimal performance across all noise levels
Demonstrates superior robustness

Qualitative Analysis

Attention map visualization reveals:

Baseline KAD: Covers expert-annotated regions but incorrectly highlights bilateral lower lobes
DWARF: Reduces lower lobe false positives but incorrectly focuses on left lung
H-EGL and DAL: More accurately identifies pathological regions with significantly reduced false positives

Main Research Directions

Explanation-Guided Learning (EGL): Utilizing explanation information to guide model learning
Human-Machine Attention Alignment: Integrating human knowledge to improve model interpretability
Transformer Applications in Medical Imaging: Leveraging attention mechanisms for disease diagnosis

Advantages of This Work

First application of hybrid explanation-guided methods to medical imaging Transformers
Proposes self-supervised attention learning strategy without negative samples
Achieves dual improvement in performance and interpretability

Conclusions and Discussion

Main Conclusions

H-EGL effectively combines self-supervised and human supervision, achieving superior classification performance and attention alignment
The DAL component provides flexible inductive bias, avoiding over-regularization
The hybrid strategy achieves good balance between cost-effectiveness and performance

Limitations

Dataset Scale: Validated only on relatively small ChestXDet dataset
Disease Categories: Evaluated on only 4 thoracic diseases
Architecture Dependency: Primarily designed for ViT architecture
Hyperparameter Sensitivity: Optimal settings for α and β parameters may vary across tasks

Future Directions

Dynamic Alignment Mechanisms: Explore adaptive adjustment of self-supervised and human alignment during training
Large-Scale Validation: Verify on larger datasets and more disease categories
Cross-Modal Extension: Extend to other medical imaging modalities
Clinical Deployment: Investigate real-world clinical application effectiveness

In-Depth Evaluation

Strengths

Methodological Innovation: First application of hybrid explanation-guided learning to medical imaging Transformers
Technical Soundness: DAL design is elegant, avoiding complexity of traditional contrastive learning
Comprehensive Experiments: Includes thorough comparative experiments, ablation studies, and robustness analysis
Practical Value: Significantly improves interpretability while maintaining performance

Weaknesses

Insufficient Theoretical Analysis: Lacks in-depth theoretical explanation for why the hybrid method is effective
Computational Complexity: Insufficient analysis of additional loss terms' impact on training efficiency
Hyperparameter Sensitivity: Insufficient guidance on α and β parameter selection
Missing Clinical Validation: No expert evaluation in real clinical environments

Impact

Academic Contribution: Provides new perspectives for medical imaging interpretability research
Practical Value: Can be directly applied to existing medical imaging diagnostic systems
Reproducibility: Provides detailed implementation details facilitating reproduction

Applicable Scenarios

Medical Image Diagnosis: Particularly suitable for clinical applications requiring high interpretability
Multi-Label Classification Tasks: Extensible to other classification problems requiring attention alignment
Resource-Constrained Environments: Hybrid supervision strategy suitable for scenarios with limited annotation resources

References

The paper cites multiple important related works, including:

Original Vision Transformer (ViT) paper 3
Spurious correlation research in medical imaging 2,5,6
Explanation-guided learning survey 4
DWARF method 11 and KAD method 19

Overall Assessment: This is a high-quality research paper making meaningful contributions to medical imaging interpretability. The hybrid explanation-guided learning framework is well-designed with sufficient experimental validation and convincing results. Despite some limitations, it provides a solid foundation and direction for future research.