Lightweight Joint Optimization of General-Purpose Vision-Language Models and Retrievers for RAG-Based Medical Diagnosis
Mazor, Hope
Retrieving relevant visual and textual information from medical literature and hospital records can enhance diagnostic accuracy for clinical image interpretation. We develop a multimodal retrieval model jointly optimized with an LVLM for medical diagnosis, unlike standard RAG which doesn't backpropagate LVLM errors to the retriever. Using only general-purpose backbones with lightweight fine-tuning, our model achieves competitive results with medically-pretrained models on clinical classification and VQA tasks. In a novel analysis, we find that different top-retrieved images often yield different predictions for the same target, and that these cases are challenging for all models, even for non-retrieval models. Our joint retrieval optimization significantly improves these cases over standard RAG. However, oracle analysis reveals that while the correct diagnosis is frequently achievable using one of the top retrieved images, in practice there is a large performance gap from the oracle, and rerankers using frontier LVLMs do not close this gap -- leaving ample room for improvement by future methods. Code available at https://github.com/Nirmaz/JOMED.
academic
Lightweight Joint Optimization of General-Purpose Vision-Language Models and Retrievers for RAG-Based Medical Diagnosis
This paper develops a multimodal retrieval model jointly optimized with large vision-language models (LVLMs) for medical diagnosis. Unlike standard RAG, this approach backpropagates LVLM errors to the retriever. Using only general-purpose backbones and lightweight fine-tuning, the model achieves competitive results with medical pre-trained models on clinical classification and visual question-answering tasks. The study reveals that different top retrieved images often produce different predictions for the same target, presenting challenges for all models. Joint retrieval optimization significantly improves these cases, though oracle analysis indicates substantial room for improvement.
Medical image diagnosis is a fundamental component of clinical decision-making. Large vision-language models (LVLMs) have been extensively explored for medical diagnosis. To enhance LVLM performance in the medical domain, retrieval-augmented generation (RAG) has been adopted and shows promising results.
Limitations of Standard RAG: Traditional RAG methods optimize retrievers and LVLMs independently, without backpropagating LVLM errors to the retriever
Resource Intensity of Medical Pre-training: Medical domain pre-training is computationally expensive, necessitating exploration of lightweight alternatives
Retrieval Inconsistency Problem: Different retrieval candidates may lead to different predictions for identical queries, affecting model reliability
Joint Optimization Framework: Proposes JOMED method for jointly optimizing multimodal retrievers and LVLMs for medical classification and visual question-answering tasks
Lightweight Fine-tuning Strategy: Achieves competitive performance using only general-purpose backbones without medical pre-training through lightweight fine-tuning
Direct Downstream Task Optimization: Unlike previous joint optimization requiring pre-training, directly optimizes on downstream tasks
Retrieval Inconsistency Analysis: Identifies and analyzes the "inconsistent retrieval prediction" problem and proposes effective solutions
Given medical images and diagnostic questions, the system retrieves relevant visual and textual information from medical literature and hospital records, then generates accurate diagnostic answers based on retrieved information and query images.
Identifies "inconsistent retrieval prediction" phenomenon: different retrieval candidates produce different predictions for identical query images. These cases represent 3%-93% across datasets.
Oracle analysis reveals correct answers frequently exist in top retrieved images, but actual performance shows substantial gap with oracle, leaving room for future improvement.
Overall Assessment: This is a high-quality research paper proposing innovative lightweight joint optimization methods in medical AI. The paper presents clear technical contributions, rigorous experimental design, and in-depth analysis, providing valuable solutions for practical medical AI applications. Particularly noteworthy is the discovery and analysis of the retrieval inconsistency problem, which points to important directions for future research.