2025-11-12T02:07:28.338293

Lightweight Joint Optimization of General-Purpose Vision-Language Models and Retrievers for RAG-Based Medical Diagnosis

Mazor, Hope
Retrieving relevant visual and textual information from medical literature and hospital records can enhance diagnostic accuracy for clinical image interpretation. We develop a multimodal retrieval model jointly optimized with an LVLM for medical diagnosis, unlike standard RAG which doesn't backpropagate LVLM errors to the retriever. Using only general-purpose backbones with lightweight fine-tuning, our model achieves competitive results with medically-pretrained models on clinical classification and VQA tasks. In a novel analysis, we find that different top-retrieved images often yield different predictions for the same target, and that these cases are challenging for all models, even for non-retrieval models. Our joint retrieval optimization significantly improves these cases over standard RAG. However, oracle analysis reveals that while the correct diagnosis is frequently achievable using one of the top retrieved images, in practice there is a large performance gap from the oracle, and rerankers using frontier LVLMs do not close this gap -- leaving ample room for improvement by future methods. Code available at https://github.com/Nirmaz/JOMED.
academic

Lightweight Joint Optimization of General-Purpose Vision-Language Models and Retrievers for RAG-Based Medical Diagnosis

Basic Information

  • Paper ID: 2508.17394
  • Title: Lightweight Joint Optimization of General-Purpose Vision-Language Models and Retrievers for RAG-Based Medical Diagnosis
  • Authors: Nir Mazor, Tom Hope (The Hebrew University of Jerusalem & The Allen Institute for AI)
  • Category: cs.CV
  • Publication Date: October 11, 2025 (arXiv v3)
  • Paper Link: https://arxiv.org/abs/2508.17394v3

Abstract

This paper develops a multimodal retrieval model jointly optimized with large vision-language models (LVLMs) for medical diagnosis. Unlike standard RAG, this approach backpropagates LVLM errors to the retriever. Using only general-purpose backbones and lightweight fine-tuning, the model achieves competitive results with medical pre-trained models on clinical classification and visual question-answering tasks. The study reveals that different top retrieved images often produce different predictions for the same target, presenting challenges for all models. Joint retrieval optimization significantly improves these cases, though oracle analysis indicates substantial room for improvement.

Research Background and Motivation

Problem Definition

Medical image diagnosis is a fundamental component of clinical decision-making. Large vision-language models (LVLMs) have been extensively explored for medical diagnosis. To enhance LVLM performance in the medical domain, retrieval-augmented generation (RAG) has been adopted and shows promising results.

Research Motivation

  1. Limitations of Standard RAG: Traditional RAG methods optimize retrievers and LVLMs independently, without backpropagating LVLM errors to the retriever
  2. Resource Intensity of Medical Pre-training: Medical domain pre-training is computationally expensive, necessitating exploration of lightweight alternatives
  3. Retrieval Inconsistency Problem: Different retrieval candidates may lead to different predictions for identical queries, affecting model reliability

Limitations of Existing Methods

  • Retrievers and LVLMs are trained separately in traditional multimodal RAG settings
  • Competitive performance requires large-scale medical pre-training
  • Lack of systematic analysis of retrieval inconsistency issues

Core Contributions

  1. Joint Optimization Framework: Proposes JOMED method for jointly optimizing multimodal retrievers and LVLMs for medical classification and visual question-answering tasks
  2. Lightweight Fine-tuning Strategy: Achieves competitive performance using only general-purpose backbones without medical pre-training through lightweight fine-tuning
  3. Direct Downstream Task Optimization: Unlike previous joint optimization requiring pre-training, directly optimizes on downstream tasks
  4. Retrieval Inconsistency Analysis: Identifies and analyzes the "inconsistent retrieval prediction" problem and proposes effective solutions

Methodology Details

Task Definition

Given medical images and diagnostic questions, the system retrieves relevant visual and textual information from medical literature and hospital records, then generates accurate diagnostic answers based on retrieved information and query images.

Model Architecture

Overall Framework

JOMED comprises two main components:

  1. Multimodal Retriever: Dual-head architecture including text retrieval head and image retrieval head
  2. Reader: Large vision-language model responsible for analyzing retrieval candidates and generating answers

Training Strategy

Employs two-stage sequential training:

Stage 1: Reader Retrieval-Augmented Fine-tuning

  • Objective: Improve reader performance on datasets and teach readers to effectively utilize retrieved (image, text) pairs
  • Loss function: Negative log-likelihood loss
L(θ) = -∑∑ log p_θ(a_d | z_k ◦ q_d)

Stage 2: Sequential Multimodal Retriever Fine-tuning

  • Freezes reader while optimizing retriever embedding space
  • Uses KL divergence to minimize differences between LVLM posterior distribution and retriever distribution

Technical Innovations

1. Dual-Head Retrieval Architecture

  • Text Retrieval Head: Retrieves relevant (image, text) pairs based on textual similarity
  • Image Retrieval Head: Retrieves relevant pairs based on visual similarity

2. Customized Retrieval Loss

For open-ended questions, converts them to closed-ended questions using o3 model to improve training:

KL(p_LVLM^C || p_RETR) = ∑ p_LVLM^C(z_k) log(p_LVLM^C(z_k) / p_RETR(z_k))

3. Inference-Time Fusion Strategy

Final output probability is a weighted fusion of retrieval candidates:

p_LVLM(a|q) = ∑ p_LVLM(a|z_k ◦ q) · p_R(z_k|q)

Experimental Setup

Datasets

Classification Tasks

  • BreastMNIST: Breast ultrasound imaging, binary classification (546 training samples)
  • DermaMNIST: Pigmented skin lesions, multi-class (7,007 training samples)
  • RetinaMNIST: Retinal fundus images, multi-class (1,080 training samples)
  • VinDr-PCXR: Pediatric chest X-rays, multi-label 15 classes (7,728 training samples)
  • BRSET: Brazilian ophthalmology dataset, multi-label 14 classes (11,386 training samples)

Visual Question-Answering Tasks

  • VQA-RAD: Radiology VQA (1,753 training questions)
  • SLAKE-English: Bilingual medical VQA English subset (4,920 training questions)
  • PathVQA: Pathology VQA (19,700 training questions)

Retrieval Index

External indices constructed from PMC-OA, MIMIC-CXR, and ROCO containing medical images and corresponding captions/reports.

Evaluation Metrics

  • Classification Tasks: Accuracy (ACC) and macro F1 score
  • VQA Tasks: Exact match for closed-ended questions, token recall for open-ended questions

Comparison Methods

  • RAG Baselines: MMed-RAG, RAD, standard fine-tuned RAG
  • Medical Pre-trained Models: BiomedGPT, LLaVA-Med variants, MedVInT, InternVL variants
  • General-Purpose Backbones: Pixtral (12B), Qwen2-VL (7B)

Experimental Results

Main Results

Classification Task Performance

JOMED consistently outperforms all comparison methods across five medical classification benchmarks:

ModelBreastDermaRetinaVinDr-PCXRBRSETAverage
MMed-RAG85%/84%75%/30%63%/46%55%/11%42%/30%64%/40%
FT RAG (Qwen2-VL)85%/82%71%/42%62%/48%55%/9%48%/27%64%/42%
JOMED (Qwen2-VL)87%/84%76%/50%65%/50%57%/14%49%/37%67%/47%
JOMED (Pixtral)90%/87%80%/62%60%/51%56%/14%51%/37%67%/50%

VQA Task Performance

Significant improvements achieved on visual question-answering tasks:

ModelVQA-RADSLAKEPathVQAAverage
MMed-RAG74%/39%87%/81%90%/31%84%/50%
JOMED (Qwen2-VL)79%/48%90%/84%93%/38%87%/57%
JOMED (Pixtral)76%/45%90%/84%90%/36%85%/55%

Comparison with Medical Pre-trained Models

JOMED achieves competitive performance with large-scale medical pre-trained models without medical pre-training:

  • Breast Dataset: JOMED (Pixtral) 90% vs GSCo 93%
  • Derma Dataset: JOMED (Pixtral) 80% vs MedVInT-TD 80%
  • VQA Tasks: Matches or exceeds LLaVA-Med variants on SLAKE and PathVQA

Ablation Studies

Validates necessity of each component:

  1. Text Retrieval Head: 2-3 percentage point improvement over FT RAG
  2. Image Retrieval Head: Additional 1-2 percentage point improvement
  3. Customized Retrieval Loss: Outperforms standard perplexity distillation loss

Inconsistent Retrieval Prediction Analysis

Problem Identification

Identifies "inconsistent retrieval prediction" phenomenon: different retrieval candidates produce different predictions for identical query images. These cases represent 3%-93% across datasets.

Performance Improvement

JOMED achieves significant improvements on inconsistent prediction cases:

  • Qwen2-VL: Accuracy improvement +12%, F1 improvement +13%
  • Pixtral: Accuracy and F1 improvements +9%

Oracle Analysis

Oracle analysis reveals correct answers frequently exist in top retrieved images, but actual performance shows substantial gap with oracle, leaving room for future improvement.

Retrieval-Augmented Joint Optimization

  • ATLAS: Large-scale joint optimization pre-training in general domains
  • REVEAL: Extension to multimodal settings requiring substantial pre-training
  • This paper first explores direct downstream task joint optimization in medical domain

Medical Multimodal Retrieval-Augmented Methods

  • RAD: Retrieval-based classification approach
  • MMed-RAG: Multimodal RAG framework using medical pre-trained retrievers
  • PMC-VQA Series: Medical visual instruction tuning methods

Conclusions and Discussion

Main Conclusions

  1. Lightweight Joint Optimization Effective: Achieves competitive performance without medical pre-training
  2. Retrieval Inconsistency Prevalent: Important but overlooked problem
  3. Direct Downstream Optimization Feasible: Demonstrates viability of data-efficient joint optimization

Limitations

  1. Sequential Rather Than End-to-End: Gradients cannot simultaneously flow between retriever and reader
  2. Limited Evaluation Scope: Primarily focuses on classification and VQA, not report generation
  3. Incomplete Modality Coverage: Not evaluated on specialized modalities like PET, microscopy, OCT

Future Directions

  1. End-to-End Joint Optimization: Develop truly end-to-end training strategies
  2. Better Reranking Methods: Close gap with oracle performance
  3. Extension to More Tasks: Explore applications in report generation and other tasks

In-Depth Evaluation

Strengths

  1. Strong Methodological Innovation: First lightweight joint optimization directly on downstream medical tasks
  2. Comprehensive Experimental Design: Covers multiple datasets and task types with thorough comparison methods
  3. In-Depth Analysis: Systematically identifies and analyzes retrieval inconsistency problem
  4. High Practical Value: Avoids resource-intensive medical pre-training

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks theoretical explanation for why joint optimization works
  2. Sequential Training Limitations: Not truly end-to-end optimization
  3. Significant Oracle Gap: Notable gap between actual and theoretical maximum performance

Impact

  1. Academic Contribution: Provides new lightweight training paradigm for medical AI
  2. Practical Value: Lowers deployment barriers for medical AI systems
  3. Reproducibility: Provides complete code and experimental details

Applicable Scenarios

  • AI diagnostic system deployment in resource-constrained medical institutions
  • Scenarios requiring rapid adaptation to specific medical center data distributions
  • Rapid prototyping in medical AI research

References

The paper cites extensive related work including:

  • Classical retrieval-augmented generation work (ATLAS, REVEAL, etc.)
  • Medical vision-language models (LLaVA-Med, BiomedGPT, etc.)
  • Multimodal retrieval methods (PMC-CLIP, BiomedCLIP, etc.)

Overall Assessment: This is a high-quality research paper proposing innovative lightweight joint optimization methods in medical AI. The paper presents clear technical contributions, rigorous experimental design, and in-depth analysis, providing valuable solutions for practical medical AI applications. Particularly noteworthy is the discovery and analysis of the retrieval inconsistency problem, which points to important directions for future research.