Retrieval-augmented large language models (RAG-LLMs) demonstrate superior performance in the medical domain by integrating external knowledge, particularly for clinical diagnosis. However, existing RAG methods struggle to tailor retrieval strategies based on diagnostic difficulty and information completeness of input samples, resulting in excessive and unnecessary retrieval that compromises computational efficiency and increases the risk of introducing noise, thereby reducing diagnostic accuracy. To address this issue, this paper proposes ICA-RAG (Information Completeness-guided Adaptive Retrieval-Augmented Generation), a novel framework that enhances the reliability of RAG in disease diagnosis. ICA-RAG leverages an adaptive control module to assess retrieval necessity based on input information completeness, optimizing retrieval operations and knowledge filtering to better align retrieval with clinical requirements. Experiments on three Chinese electronic medical record datasets demonstrate that ICA-RAG significantly outperforms baseline methods, highlighting its effectiveness in clinical diagnosis.
Large language models face two major challenges in medical tasks:
Direct Disease Diagnosis: Given a token sequence representing input text, LLM text generation can be formalized as:
RAG Disease Diagnosis: Retrieving and integrating relevant knowledge from external sources: where
Adaptive RAG Disease Diagnosis: Introducing a control function F to assess input Q:
\text{LLM}(Q, \text{prompt}), & \text{if } F(Q) = \langle\text{Activate}\rangle \\ \text{LLM}(Q, d, \text{prompt}), & \text{otherwise} \end{cases}$$ ### Model Architecture The ICA-RAG framework comprises three main stages: #### Stage (a): Retrieval Decision Optimization Based on Input Information Completeness 1. **Text Segmentation**: Dividing input Q into text units (sentences by default): $Q = \{s_i\}_{i=1}^n$ 2. **Importance Classification**: Training a classifier to predict the importance of each unit: $$l_i = \text{Classifier}(s_i) \quad \forall i \in \{1, 2, ..., n\}$$ Labels are categorized into three classes: - A: Information critical for diagnostic decision-making - B: Information that positively contributes to retrieval but cannot directly infer results - C: Relatively unimportant information 3. **Information Completeness Calculation**: $$I_{\text{norm}}(Q) = \frac{1}{\alpha \cdot n} \sum_{i=1}^n (\alpha \cdot I(l_i = A) + \beta \cdot I(l_i = B) + \gamma \cdot I(l_i = C))$$ #### Stage (b): Retrieval Based on Document Segmentation and Mapping 1. **Sentence-Level Retrieval**: Each sentence serves as a query to retrieve top-m relevant text chunks 2. **Document-Level Re-ranking**: Re-ranking documents based on the count of retrieved chunks per document 3. **Mapping Strategy**: Mapping text chunks back to original documents and re-ranking based on chunk counts #### Stage (c): Knowledge Filtering and Diagnostic Generation Based on Prompt Guidance Using a differential diagnosis prompt template to filter irrelevant documents, simulating the physician's differential diagnosis process. ### Technical Innovations 1. **Information Completeness Assessment**: Transforming complex document understanding into simple sentence-level tasks 2. **Masking Annotation Strategy**: Automatically obtaining training labels through sequence masking operations 3. **Chunk-Document Mapping Re-ranking**: Computing re-ranking based solely on retrieval result values, reducing memory overhead 4. **Differential Diagnosis Filtering**: Filtering irrelevant information by simulating the clinical differential diagnosis process ## Experimental Setup ### Datasets - **CMEMR**: Chinese Electronic Medical Record dataset - **ClinicalBench**: Clinical benchmark dataset - **CMB-Clin**: Chinese Medical Benchmark Clinical dataset All datasets are configured as end-to-end diagnostic tasks, with patient information as input and physician diagnostic conclusions as ground truth labels. ### Evaluation Metrics Using International Classification of Diseases (ICD-10) standardized disease terminology, computing set-level Precision, Recall, and F1-score using fuzzy matching (threshold 0.5). ### Baseline Methods 1. **Non-Retrieval Methods**: CoT, SC-CoT, ATP 2. **Standard Retrieval Methods**: RAG2, LongRAG 3. **Adaptive Retrieval Methods**: Adaptive-RAG, DRAGIN, SEAKR ### Implementation Details - **Backbone Model**: qwen2.5-7B-instruct - **Classifier**: BERT-base-Chinese - **Retriever**: BM25 - **External Knowledge Base**: CMKD Clinical Medical Knowledge Database ## Experimental Results ### Main Results | Method | CMEMR F1(%) | ClinicalBench F1(%) | CMB-Clin F1(%) | |--------|-------------|---------------------|-----------------| | CoT | 48.82 | 38.46 | 52.14 | | LongRAG | 49.07 | 39.25 | 51.81 | | Adaptive-RAG | 49.27 | 38.04 | 53.44 | | **ICA-RAG** | **50.88** | **40.79** | **53.53** | Key Findings: 1. ICA-RAG achieves optimal or near-optimal F1 scores across all datasets 2. Compared to LongRAG, F1 improvements of 1.81%, 1.54%, and 1.72% respectively 3. Significantly outperforms other adaptive RAG methods ### Ablation Study Ablation results on the CMEMR dataset: | Variant | F1(%) | Decrease | |---------|-------|----------| | ICA-RAG | 50.88 | - | | w/o Decision | 48.07 | -2.81% | | w/o Chunk | 49.78 | -1.10% | | w/o M-rerank | 49.59 | -1.29% | | w/o Diff | 49.85 | -1.03% | ### Efficiency Analysis - **Temporal Efficiency**: Significant improvements compared to non-adaptive RAG methods - **Parameter Efficiency**: BERT-Base classifier (110M parameters) is more lightweight than Adaptive-RAG's T5-Large (770M parameters) - **Applicability**: No need to access LLM output probability distributions, applicable to closed-source models and API deployments ## Related Work ### RAG Applications in Clinical Disease Diagnosis - Most research employs basic retrieval methods, encoding external knowledge and task queries through embedding models - Knowledge graphs are also widely adopted - Lack of optimization tailored to medical domain characteristics ### Adaptive RAG - **FLARE and DRAGIN**: Activate search when LLM generates low-confidence tokens - **Self-RAG**: Train models to dynamically retrieve, critique, and generate text - **Adaptive-RAG**: Assess query complexity to determine retrieval necessity - Existing methods primarily target question-answering tasks, difficult to directly transfer to medical diagnosis ## Conclusions and Discussion ### Main Conclusions ICA-RAG effectively addresses the rigid retrieval strategy problem in traditional retrieval-augmented methods by optimizing adaptive retrieval decisions based on input information completeness, demonstrating strong adaptability in complex clinical scenarios. ### Limitations 1. **Annotation Strategy Constraints**: Due to potential redundancy in patient information, LLMs may still reach correct diagnoses after masking key sentences, leading to inaccurate annotation labels 2. **Complexity of Medical Text**: Clinical medical text contains abbreviations, synonyms, and aliases, with significant variations in documentation across different physicians, affecting retrieval accuracy 3. **Manual Verification Requirements**: Automatic annotation strategies still require human inspection and correction ### Future Directions 1. Explore more effective medical text preprocessing strategies to enhance retrieval quality 2. Apply ICA-RAG to other medical tasks 3. Further optimize the retrieval process ## In-Depth Evaluation ### Strengths 1. **Strong Innovation**: First to propose an adaptive retrieval decision mechanism based on information completeness 2. **High Practicality**: Requires no fine-tuning of the backbone LLM, with strong applicability 3. **Comprehensive Experiments**: Thorough evaluation and ablation studies across multiple datasets 4. **Efficiency Improvements**: Significantly enhances computational efficiency while maintaining performance ### Weaknesses 1. **Dataset Limitations**: Validation only on Chinese EMR datasets, lacking cross-lingual and cross-domain verification 2. **Annotation Quality**: Automatic annotation strategy contains noise, requiring human intervention 3. **Threshold Setting**: Lacks theoretical guidance for setting information completeness thresholds θ₁ and θ₂ 4. **Knowledge Base Dependency**: Performance heavily depends on external knowledge base quality ### Impact 1. **Academic Contribution**: Provides new insights for RAG applications in medical AI 2. **Practical Value**: Can be directly applied to clinical decision support systems 3. **Reproducibility**: Detailed method description and clear experimental setup ### Applicable Scenarios 1. **Clinical Diagnosis**: Particularly suitable for cases with complex symptoms requiring differential diagnosis 2. **Medical Question-Answering Systems**: Can enhance accuracy and efficiency of medical consultation systems 3. **Medical Education**: Can serve as an auxiliary tool for medical student learning ## References The paper cites 41 relevant references covering important works in large language models, retrieval-augmented generation, medical AI, and other domains, providing a solid theoretical foundation for the research. --- **Overall Assessment**: This is a high-quality paper with significant contributions to the medical AI field. The authors address limitations of existing RAG methods in medical diagnosis and propose an innovative solution, validated through comprehensive experiments. Despite certain limitations, its innovation and practicality make it an important advance in the field.