Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs aim to address healthcare-specific challenges, including privacy demands and computational constraints. Assessing the models' suitability for this sensitive application area is of the utmost importance. However, biomedical training has not been systematically evaluated on medical tasks. This study investigates the effect of biomedical training in the context of six practical medical tasks evaluating $25$ models. In contrast to previous evaluations, our results reveal a performance decline in nine out of twelve biomedical models after fine-tuning, particularly on tasks involving hallucinations, ICD10 coding, and instruction adherence. General-domain models like Meta-Llama-3.1-70B-Instruct outperformed their biomedical counterparts, indicating a trade-off between domain-specific fine-tuning and general medical task performance. We open-source all evaluation scripts and datasets at https://github.com/TIO-IKIM/CLUE to support further research in this critical area.
- Paper ID: 2404.04067
- Title: Does Biomedical Training Lead to Better Medical Performance?
- Authors: Amin Dada, Osman Alperen Koraş, Marie Bauer, Jean-Philippe Corbeil, Amanda Butler Contreras, Constantin Marc Seibold, Kaleb E Smith, Julian Friedrich, Jens Kleesiek
- Classification: cs.CL cs.AI cs.LG
- Publication Date/Venue: arXiv preprint (submitted April 2024, updated October 2025)
- Paper Link: https://arxiv.org/abs/2404.04067v5
Large Language Models (LLMs) hold tremendous potential for healthcare applications, with biomedical domain-adapted models promising superior performance on medical tasks. However, the effectiveness of biomedical domain adaptation for clinical tasks remains uncertain. This study presents a direct comparison of 12 biomedical-adapted models and their general-domain foundation models across six clinical tasks. The results reveal that 11 of the 12 biomedical models exhibit performance degradation, challenging previous findings reporting positive effects of biomedical adaptation. Notably, prior positive results relied primarily on multiple-choice question (MCQA) evaluations, which may not reflect performance in real-world clinical applications.
The core research question is: Does specialized biomedical training truly enhance the performance of large language models on practical clinical tasks?
- Practical Application Needs: LLMs in healthcare have enormous potential to improve patient care quality and efficiency
- Resource Investment Considerations: Development of biomedical LLMs requires substantial computational resources and specialized data
- Safety Considerations: Medical applications demand exceptionally high standards for model accuracy and reliability
- Evaluation Method Constraints: Previous research primarily relied on multiple-choice question (MCQA) assessments, lacking evaluation on real clinical documents
- Inconsistent Conclusions: Recent studies have begun questioning the effectiveness of biomedical domain adaptation
- Lack of Systematic Comparison: Absence of direct systematic comparisons between multiple biomedical models and their foundation models
The authors aim to reveal the true effects of biomedical training through systematic evaluation on authentic clinical tasks, providing objective evidence for the field's development.
- Systematic Evaluation Framework: Constructed the CLUE (Clinical Language Understanding Evaluation) framework encompassing 6 practical clinical tasks
- Large-Scale Model Comparison: Evaluated 24 language models, including 12 biomedical models and their foundation models
- Disruptive Findings: Discovered that 11/12 biomedical models show performance degradation on clinical tasks, challenging conventional wisdom
- Open-Source Contribution: Released the complete evaluation pipeline to promote reproducible research
- In-Depth Error Analysis: Identified major issues in biomedical models: hallucinations, degraded instruction-following ability, etc.
The CLUE evaluation framework comprises 6 clinical tasks divided into two difficulty levels:
Level 1 (Simple Tasks, Short Input):
- MedNLI: Natural language inference based on MIMIC-III clinical notes
- MeQSum: Consumer health question summarization
- Problem Summary: Patient problem extraction from SOAP-structured clinical notes
Level 2 (Complex Tasks, Long Input):
- LongHealth: Long document comprehension and question-answering
- MeDiSumQA: Discharge summary question-answering and simplification
- MeDiSumCode: ICD-10 coding prediction
Evaluated biomedical models include:
- Meditron Series (7B/70B): Continued pretraining based on Llama-2
- BioMistral Series: Training based on Mistral-7B
- OpenBioLLM Series (8B/70B): Training based on Llama-3 using SFT+DPO
- Med42 Series (8B/70B): Training based on Llama-3
- Other Models: Internist.ai, Aloe, Meditron3, etc.
- Real Clinical Task Evaluation: Unlike traditional MCQA, uses authentic clinical documents and tasks
- Multi-Dimensional Metrics: Combines ROUGE, BERTScore, UMLS entity F1, and other metrics
- Systematic Comparison: Each biomedical model directly compared with its foundation model
- Error Pattern Analysis: In-depth analysis of specific error types including hallucinations and repetitive loops
- MedNLI: 1,425 samples based on MIMIC-III clinical notes
- MeQSum: 1,000 consumer health inquiries
- Problem Summary: 237 SOAP-structured clinical notes
- LongHealth: 400 long document QA pairs (average 5,537 words)
- MeDiSumQA: 453 discharge summary QA pairs
- MeDiSumCode: 500 ICD-10 coding tasks
- Text Generation Tasks: ROUGE-1/2/L, BERTScore, UMLS entity F1
- Classification Tasks: Accuracy, F1 score
- Coding Tasks: Exact match, approximate match, valid code percentage
- 12 biomedical models compared with their corresponding foundation models
- Additional general-domain models as reference baselines
- Computational Resources: NVIDIA DGX A100 640GB nodes, approximately 1536 GPU hours
- Prompting Strategy: 3-shot for Level 1, 1-shot for Level 2 (except LongHealth)
- Model Configuration: Hugging Face default instruction templates
| Model Category | Level 1 Avg Performance Change | Level 2 Avg Performance Change | Overall Trend |
|---|
| Meditron-7B | -7.08 | - | Degradation |
| Meditron-70B | -4.59 | - | Degradation |
| BioMistral-7B | +0.26 | +0.71 | Slight Improvement |
| BioMistral-7B-DARE | +2.93 | +2.70 | Improvement |
| OpenBioLLM-8B | -15.17 | -13.54 | Significant Degradation |
| Med42-8B | +2.51 | -1.40 | Mixed |
Key Findings:
- Only BioMistral-7B-DARE consistently outperforms its foundation model across all tasks
- 11/12 models show performance degradation on at least one task
- 4 models show degradation across all tasks
Task Complexity Impact:
- Level 1 tasks: Some models show slight improvements
- Level 2 tasks: Most models show significant degradation
Model Scale Impact:
- 8B parameter models: More likely to achieve improvements
- 70B parameter models: More prone to performance degradation after training
Error Pattern Examples:
- Hallucination Issues: In LongHealth task 3, Llama3-OpenBioLLM-8B drops from 56.25 to 1.55 compared to the foundation model
- Repetitive Loops: Biomedical models frequently get stuck in token repetition, producing incoherent outputs
- ICD-10 Coding Errors: Models tend to increment digits rather than predict valid codes
- Divergence from MCQA Evaluation: Traditional multiple-choice evaluation shows positive effects, but actual clinical task performance degrades
- Importance of Foundation Model Quality: Newer general-domain models (e.g., Llama-3) outweigh biomedical adaptation
- Degraded Instruction-Following Ability: Biomedical training impairs the model's instruction-following capabilities
- Commercial Models: Med-PaLM, MedGemini
- Open-Source Models: Meditron, Biomistral, Internist.ai, Med42
Recent studies have begun questioning the effectiveness of biomedical adaptation:
- Jeong et al. (2024): Found no significant advantages for biomedical LLMs
- Ceballos-Arroyo et al. (2024): Domain adaptation may impair instruction-following
This paper provides empirical evidence for this controversy through systematic evaluation on authentic clinical tasks.
- Biomedical Training is Not Always Beneficial: Most biomedical models show performance degradation on practical clinical tasks
- Competitiveness of General-Domain Models: General-domain models like Meta-Llama-3.1-70B demonstrate superior performance
- Importance of Evaluation Methods: MCQA evaluation may be misleading; real task evaluation is more critical
- Potential of Weight Merging: BioMistral-DARE's success suggests weight merging is a promising direction
- Computational Resource Constraints: Did not explore different temperature settings, chain-of-thought prompting, and other techniques
- Data Contamination Risk: Using public datasets cannot completely eliminate data contamination risks
- Clinical Environment Differences: Evaluation was not conducted in real clinical settings
- Insufficient Safety Assessment: Requires prospective clinical trials to validate safety
- Improved Training Methods: Explore better domain adaptation strategies
- Data Quality Enhancement: Utilize higher-quality training data
- Weight Merging Techniques: Further research on weight merging methods
- Clinical Trial Validation: Test in real clinical environments
- Rigorous Research Design: Systematic comparison of 12 biomedical models with foundation models
- Practical Task Design: Uses authentic clinical documents and tasks, closely aligned with real applications
- Disruptive Findings: Challenges mainstream perspectives in the field
- High-Value Open-Source Contribution: Complete evaluation framework promotes subsequent research
- In-Depth Error Analysis: Detailed analysis of specific issues like hallucinations and repetition
- Limited Sample Size: Some tasks have relatively small sample numbers (e.g., Problem Summary with only 237 samples)
- Narrow Evaluation Scope: Primarily focuses on English and specific types of clinical tasks
- Lack of Theoretical Analysis: Insufficient theoretical explanation for why biomedical training causes performance degradation
- Insufficient Training Details: Limited description of specific training processes for various biomedical models
- Academic Value: Provides important reflection for biomedical LLM research
- Practical Guidance: Helps practitioners make more rational model selection decisions
- Methodological Contribution: CLUE evaluation framework can be widely adopted
- Resource Optimization: Prevents blind investment in biomedical model development
- Model Selection Decisions: Choosing appropriate foundation models for medical AI applications
- Research Direction Guidance: Provides new perspectives for biomedical LLM research
- Evaluation Standard Setting: Establishes more rigorous standards for medical AI evaluation
- Investment Decision Reference: Provides basis for related investment and resource allocation
- Chen, Z. et al. (2023). MEDITRON-70B: Scaling Medical Pretraining for Large Language Models.
- Labrak, Y. et al. (2024). BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains.
- Jeong, D. P. et al. (2024). Medical adaptation of large language and vision-language models: Are we making progress?
- Ceballos-Arroyo, A. M. et al. (2024). Open (clinical) LLMs are sensitive to instruction phrasings.
Summary: This paper reveals the limitations of biomedical training on practical clinical tasks through rigorous experimental design, providing important reflection for the field. While the conclusions may be surprising, the methodological rigor and significance of the findings make it an important contribution to medical AI research. The study reminds us to more carefully evaluate the effects of specialized training and to value the importance of general-domain models in medical applications.