Large language models (LLMs) excel at clinical information extraction but their computational demands limit practical deployment. Knowledge distillation--the process of transferring knowledge from larger to smaller models--offers a potential solution. We evaluate the performance of distilled BERT models, which are approximately 1,000 times smaller than modern LLMs, for clinical named entity recognition (NER) tasks. We leveraged state-of-the-art LLMs (Gemini and OpenAI models) and medical ontologies (RxNorm and SNOMED) as teacher labelers for medication, disease, and symptom extraction. We applied our approach to over 3,300 clinical notes spanning five publicly available datasets, comparing distilled BERT models against both their teacher labelers and BERT models fine-tuned on human labels. External validation was conducted using clinical notes from the MedAlign dataset. For disease extraction, F1 scores were 0.82 (teacher model), 0.89 (BioBERT trained on human labels), and 0.84 (BioBERT-distilled). For medication, F1 scores were 0.84 (teacher model), 0.91 (BioBERT-human), and 0.87 (BioBERT-distilled). For symptoms: F1 score of 0.73 (teacher model) and 0.68 (BioBERT-distilled). Distilled BERT models had faster inference (12x, 4x, 8x faster than GPT-4o, o1-mini, and Gemini Flash respectively) and lower costs (85x, 101x, 2x cheaper than GPT-4o, o1-mini, and Gemini Flash respectively). On the external validation dataset, the distilled BERT model achieved F1 scores of 0.883 (medication), 0.726 (disease), and 0.699 (symptom). Distilled BERT models were up to 101x cheaper and 12x faster than state-of-the-art LLMs while achieving similar performance on NER tasks. Distillation offers a computationally efficient and scalable alternative to large LLMs for clinical information extraction.
- Paper ID: 2501.00031
- Title: Distilling Large Language Models for Efficient Clinical Information Extraction
- Authors: Karthik S. Vedula, Annika Gupta, Akshay Swaminathan, Ivan Lopez, Suhana Bedi, Nigam H. Shah
- Classification: cs.CL (Computation and Language)
- Publication Date: January 3, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2501.00031
This study employs knowledge distillation techniques to transfer knowledge from large language models to BERT models approximately 1000 times smaller for clinical named entity recognition tasks. The research utilizes state-of-the-art LLMs (Gemini and OpenAI models) and medical ontologies (RxNorm and SNOMED) as teacher annotators to extract medications, diseases, and symptoms from over 3,300 clinical notes. The distilled BERT model achieves 4-12x faster inference speed and 2-101x cost reduction while maintaining comparable performance, providing an efficient and scalable solution for clinical information extraction.
Clinical notes in electronic health records contain substantial valuable unstructured information that often cannot be captured in structured fields. Converting free-text information into structured data is crucial for cohort selection, observational analysis, and question-answering systems, yet extracting information from clinical notes remains challenging.
- Traditional Methods: Rule-based approaches using string matching and medical ontologies are interpretable and computationally efficient but often fail to capture diverse representations of clinical entities, including synonyms, abbreviations, detailed descriptions, and spelling variations.
- Machine Learning Approaches: BERT-based models demonstrate strong performance, but current clinical NER models often focus on specific domains or entity types, limiting broad applicability. Fine-tuning requires substantial annotated data, which is costly and time-consuming.
- Large Language Models: LLMs excel at clinical NER tasks but require substantial computational resources and high costs. Proprietary LLMs require HIPAA-compliant endpoints to handle protected health information.
Knowledge distillation offers a promising solution to these challenges by transferring knowledge from large models to smaller models, addressing limitations of domain-specific BERT models while avoiding expensive LLM deployment.
- Multi-Teacher Annotation System: Developed a teacher annotation system combining state-of-the-art LLMs (Gemini and OpenAI models) with medical ontologies (RxNorm and SNOMED) for clinical NER tasks across diverse note types.
- Efficient Distilled Model: Created and released a BERT-based distilled model approximately 1/1000 the size of modern LLMs, trained on over 2,000 clinical documents covering oncology progress notes, discharge summaries, radiology reports, and scientific abstracts.
- Comprehensive Evaluation: Conducted comprehensive evaluation on five public clinical datasets, including model failure mode analysis and external validation across health systems.
This research focuses on three distinct NER tasks:
- Medication Extraction: Identifying medication names and drug categories in clinical notes
- Disease Extraction: Identifying diseases, syndromes, and pathological conditions
- Symptom Extraction: Identifying patient symptoms and clinical manifestations
Each task employs Inside-Outside (IO) tagging format, with words inside entities tagged as "Inside" and others as "Outside".
- LLM Annotators: Evaluated four state-of-the-art LLMs as teacher annotators
- GPT-4o (version 2024-08-06)
- GPT-4o-mini (version 2024-07-18)
- o1-mini (version 2024-09-12)
- Gemini 1.5 Flash (gemini-1.5-flash-002)
- Ontology Annotators: Leveraged BioPortal Annotator API to access biomedical ontologies
- RxNorm: for medication extraction
- SNOMED CT: for disease and symptom extraction
- Optimal Teacher Combination: Evaluated all 31 possible subsets of 5 teacher annotators, selecting the combination with highest F1 score on the development set.
For each NER task, the optimal teacher annotation pipeline generates training labels, followed by fine-tuning independent BERT models:
- BERT base: General-purpose language model
- BioBERT: Pre-trained on biomedical literature
- BioClinBERT: Specifically designed for clinical text
Training parameters: learning rate = 2×10⁻⁵, batch size = 8, weight decay = 0.01, 10 epochs.
- Multi-Teacher Fusion Strategy: Unlike existing research using single teacher models, this study systematically evaluates 31 combinations of LLMs and ontologies, selecting optimal combinations for different tasks.
- Cross-Domain Generalization: Training and testing across multiple clinical note types, including discharge summaries, progress notes, radiology reports, etc.
- Cost-Benefit Analysis: Provides detailed inference time and cost comparisons, quantifying practical deployment advantages of distilled models.
- n2c2 2018 Track 2: 505 MIMIC-III discharge summaries with expert annotations for medication extraction
- Training: 303 documents, Test: 202 documents, Development: 25 documents
- NCBI Disease Corpus: 793 PubMed abstracts with expert annotations for disease extraction
- Using official dataset split
- CORAL Dataset: De-identified progress notes from 40 patients (20 breast cancer, 20 pancreatic cancer)
- Test: 35 documents, Development: 5 documents
Combined all available datasets, including 1,000 MIMIC-III clinical notes (stratified sampling by document type), resulting in 2,096 documents for teacher annotation.
Used MedAlign dataset for external validation, containing 276 longitudinal patient records from Stanford Hospital and Lucile Packard Children's Hospital.
Standard token-level precision, recall, and F1 score with human annotation as gold standard.
- Direct teacher annotator predictions
- BERT models fine-tuned on human labels
- BERT models distilled from teacher labels
- Training on NVIDIA 4xH100 GPUs
- All LLMs executed through HIPAA-compliant API endpoints
- Standardized parameters: temperature = 0.01, top-p = 0.9
| Task | Optimal Combination | F1 Score |
|---|
| Disease Extraction | o1-mini | 0.787 |
| Medication Extraction | Gemini-1.5-flash + GPT-4o | 0.881 |
| Symptom Extraction | Gemini-1.5-flash + GPT-4o | 0.801 |
| Task | BERT + Human Labels | BERT + Teacher Labels | Teacher Annotator Only |
|---|
| Disease Extraction | 0.89 | 0.84 | 0.82 |
| Medication Extraction | 0.91 | 0.87 | 0.84 |
| Symptom Extraction | - | 0.68 | 0.73 |
| Model | Inference Time per Note (seconds) | Cost per Note (USD) |
|---|
| Distilled BioBERT | 0.14 | 0.000187 |
| GPT-4o | 1.66 (+1086%) | 0.0159 (+8402%) |
| o1-mini | 0.58 (+314%) | 0.0189 (+1001%) |
| Gemini Flash | 1.17 (+736%) | 0.000460 (+146%) |
Performance on MedAlign dataset:
- Medication Extraction: F1 = 0.883
- Disease Extraction: F1 = 0.726
- Symptom Extraction: F1 = 0.699
Manual review revealed most false positives were actually due to annotation set errors:
- Symptom Extraction: 82.05% of false positives were actually correct annotations
- Medication Extraction: 62.93% of false positives were actually correct annotations
- Disease Extraction: 73.33% of false positives were actually correct annotations
- Performance Hierarchy: Human-labeled BERT > Teacher-labeled BERT > Direct teacher prediction
- Limited Ontology Role: Ontology annotators absent from optimal combinations for symptom extraction
- BioBERT Advantages: Best performance on most tasks
- Significant Cost-Benefit: Distilled models 2-101x cheaper and 4-12x faster than LLMs
- Traditional Methods: Rule-based and ontology-based approaches, such as UMLS
- Deep Learning Methods: BERT-based models, including domain-specific variants like BioBERT and ClinicalBERT
- Weakly Supervised Methods: Such as TROVE, using UMLS ontologies to generate weak labels for BERT training
- General Distillation: Distillation from GPT-4 to medium-sized models like LLaMA
- Medical Domain Distillation: Successful applications of DistilFLERT and distilled PubMedBERT in medical applications
- Multi-Teacher Fusion: Systematic evaluation of 31 LLM and ontology combinations
- Cross-Domain Validation: Verification of generalization across multiple note types and health systems
- Comprehensive Evaluation: Includes cost-benefit analysis and detailed error analysis
Distilled BERT models achieve performance comparable to large LLMs in clinical NER tasks with significantly lower computational costs and inference time, providing a practical solution for clinical information extraction.
- Inconsistent Teacher Quality: Particularly variable quality in symptom annotation
- Limited Entity Types: Covers only three entity types, excluding procedures and social determinants
- Missing Complex Tasks: Does not address assertion status (e.g., negation) or relation extraction
- Insufficient Prompt Engineering: All LLMs use identical prompts without task-specific optimization
- Test Set Quality: Annotation inconsistency issues present
- Extend to more entity types and complex NER tasks
- Improve prompt engineering strategies
- Explore advanced distillation techniques
- Enhance test set annotation quality
- High Practical Value: Addresses the real problem of high LLM deployment costs
- Systematic Methodology: Comprehensive evaluation of multiple teacher combination strategies
- Thorough Validation: Includes external validation and detailed error analysis
- Transparency: Provides code and detailed experimental settings
- Cost Quantification: Offers specific time and cost comparison data
- Limited Novelty: Knowledge distillation itself is not new; main contribution is application-level
- Insufficient Baseline Comparisons: Lacks direct comparison with other distillation methods
- Lack of Theoretical Analysis: No deep analysis of why certain teacher combinations perform better
- Limited Applicability: Primarily targets English clinical text; generalization capability needs verification
- High Practical Value: Provides feasible solutions for clinical NLP deployment
- Good Reproducibility: Provides complete code and dataset information
- High Promotion Potential: Methods extensible to other medical NLP tasks
- Resource-Constrained Applications: Important for environments with limited computational resources
- Hospital Information Systems: Need real-time processing of large volumes of clinical notes
- Research Institutions: Limited computational resources but require high-quality NER
- Medical AI Products: Need to balance performance and deployment costs
- Multilingual Extension: Can serve as foundation framework for clinical NER in other languages
The paper cites 61 related works, primarily including:
- BERT-related work: Devlin et al. (2019), Lee et al. (2020) BioBERT
- Knowledge Distillation: Hinton et al. (2015), Zhou et al. (2024)
- Clinical NLP: Henry et al. (2020) n2c2, Fleming et al. (2023) MedAlign
- Medical Ontologies: Bodenreider (2004) UMLS, Liu et al. (2005) RxNorm
This research provides a practical and efficient solution for clinical information extraction, successfully balancing model performance with deployment costs through knowledge distillation techniques, with significant practical value and promotion potential.