Large language models work well for many NLP tasks, but they are hard to deploy in health settings with strict cost, latency, and privacy limits. We revisit a lightweight recipe for medical abstract classification and ask how far compact encoders can go under a controlled budget. Using the public medical abstracts corpus, we finetune BERT base and DistilBERT with three objectives standard cross-entropy, class weighted cross entropy, and focal loss keeping tokenizer, sequence length, optimizer, and schedule fixed. DistilBERT with plain cross-entropy gives the best balance on the test set while using far fewer parameters than BERT base. We report accuracy, Macro F1, and Weighted F1, release the evaluation code, and include confusion analyses to make error patterns clear. Our results suggest a practical default: start with a compact encoder and cross-entropy, then add calibration and task-specific checks before moving to heavier models.
- Paper ID: 2510.10025
- Title: Lightweight Baselines for Medical Abstract Classification: DistilBERT with Cross-Entropy as a Strong Default
- Authors: Jiaqi Liu, Lanruo Wang, Su Liu, Xin Hu
- Categories: cs.CL cs.AI
- Publication Date: October 11, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.10025
Large language models demonstrate strong performance across many NLP tasks but are difficult to deploy in medical environments with strict cost, latency, and privacy constraints. This paper revisits lightweight solutions for medical abstract classification, exploring the performance limits of compact encoders under controlled budgets. Using publicly available medical abstract corpora, the authors fine-tune BERT-base and DistilBERT with three objective functions (standard cross-entropy, class-weighted cross-entropy, and focal loss) while keeping tokenizers, sequence length, optimizers, and schedulers fixed. Results demonstrate that DistilBERT combined with standard cross-entropy achieves the best balance on the test set while using significantly fewer parameters than BERT-base.
With the rapid growth of biomedical literature, manual tracking has become infeasible, necessitating reliable automated systems for classification, triage, and summarization. While large language models demonstrate superior performance, their computational and memory costs limit their use in medical environments, particularly in scenarios with budget, latency, and privacy constraints (such as HIPAA compliance).
- Practical Deployment Requirements: Medical pipelines typically operate under strict cost-of-service and governance requirements (on-premises deployment, air-gapped or VPC-restricted deployment)
- Efficiency-Performance Trade-off: Compact encoders often provide better accuracy-efficiency trade-offs in terms of ease of fine-tuning and calibration
- Baseline Establishment: Establishing clean baselines is valuable for future comparisons with domain-specialized encoders
- Large models incur high deployment costs and latency
- Domain-adaptive pre-trained models (e.g., SciBERT, BioBERT) offer good performance but consume substantial resources
- The effectiveness of class imbalance handling methods (resampling, cost-sensitive loss) remains insufficiently validated on medical texts
- Establishing Lightweight Baselines: Systematically comparing the performance of BERT-base and DistilBERT on medical abstract classification tasks
- Loss Function Comparison: Comparing three loss functions (CE, WCE, FL) under controlled conditions
- Practical Guidance: Providing recommended deployment pathways: starting with compact encoders and cross-entropy
- Open-Source Contribution: Releasing evaluation code and detailed confusion matrix analysis to ensure reproducibility
- Efficiency Analysis: Providing efficiency gain analysis in terms of parameter count, disk footprint, and throughput
Defining medical literature abstract classification as a five-class single-label classification problem using publicly available medical abstract corpora from Hugging Face. Categories include:
- Neoplasms (21.91%)
- Digestive System Diseases (10.35%)
- Nervous System Diseases (13.33%)
- Cardiovascular Diseases (21.13%)
- General Pathological Conditions (33.28%)
Encoder Selection:
- BERT-base-uncased (~110M parameters)
- DistilBERT-base-uncased (~66M parameters)
Classification Head: Randomly initialized linear classification layer (hidden size 768, output size 5)
Loss Function Comparison:
- Standard Cross-Entropy (CE): LCE=−logpt
- Class-Weighted Cross-Entropy (WCE): LWCE=−wtlogpt
- Focal Loss (FL): LFL=−αt(1−pt)γlogpt, where γ=2.0
- Controlled Experimental Design: Keeping tokenizers, sequence length, optimizers, and schedulers fixed while varying only the loss function
- Deployment-Oriented Approach: Focusing on deployment-friendly preprocessing and fixed-length strategies
- Comprehensive Evaluation: Combining accuracy, Macro-F1, Weighted-F1, and confusion matrix analysis
- Source: Hugging Face medical abstract corpus
- Scale: 10,395 training abstracts, 1,155 validation abstracts, 2,888 test abstracts
- Preprocessing: Minimalist deployment-friendly preprocessing, preserving punctuation, 256-token truncation/padding
- Accuracy: Overall classification accuracy
- Macro-F1: Macro-averaged F1 score (sensitive to class imbalance)
- Weighted-F1: Weighted F1 score
- Confusion Matrix: Detailed error pattern analysis
Systematic comparison of six configurations:
- BERT-base + CE/WCE/FL
- DistilBERT + CE/WCE/FL
- Optimizer: AdamW, learning rate 2×10^-5
- Batch Size: 16
- Training Epochs: 3
- Sequence Length: 256 tokens
- Model Selection: Best checkpoint based on validation set Macro-F1
| Model | Loss Function | Accuracy (%) | Macro-F1 (%) | Weighted-F1 (%) |
|---|
| DistilBERT | CE | 64.61 | 64.38 | 63.25 |
| BERT-base | CE | 64.51 | 63.85 | 62.12 |
| BERT-base | WCE | 62.88 | 62.43 | 59.66 |
| DistilBERT | WCE | 62.29 | 62.22 | 59.24 |
Observation 1 - Loss Function Selection: For both encoders, WCE and FL underperform compared to CE. The relative decline in Macro-F1 indicates that emphasizing difficult/minority samples does not translate to better global balance on this corpus.
Observation 2 - Encoder Selection: DistilBERT matches or slightly exceeds BERT-base despite substantial capacity reduction, supporting compact baselines as a strong default choice under computational or latency constraints.
Observation 3 - Stability: The ranking (DistilBERT+CE > BERT+CE > {WCE, FL}) remains consistent across multiple runs.
- Stable Classes: Classes 1 and 4 maintain robustness across various losses and encoders
- Fragile Classes: Class 5 exhibits recall deficits and overflow to Class 4
- Redistribution Rather Than Reduction: WCE/FL slightly redistribute errors among adjacent classes but rarely reduce the total global error count
- Parameter Reduction: DistilBERT reduces 40% of parameters compared to BERT-base (66M vs 110M)
- Disk Footprint: Smaller checkpoint file sizes
- Inference Speed: Lower cold-start latency
The field has evolved from feature engineering models to fine-tuned Transformers customized for scientific and biomedical texts, including SciBERT, BioBERT, and ClinicalBERT. Novel pre-training approaches are combining structured laboratory data with knowledge-guided learning.
Typically addressed through resampling or cost-sensitive losses (such as re-weighting and focal loss). This work finds that under moderate skew and label ambiguity, these methods may amplify noise and reduce accuracy.
Widely employing efficiency methods such as distillation (DistilBERT), pruning, and quantization to reduce computation and latency.
- Simplicity is Effective: DistilBERT with cross-entropy is a robust, computationally efficient baseline
- Loss Function Selection: Under moderate class skew, standard cross-entropy outperforms weighted variants
- Practical Pathway: Recommending starting with compact encoders and cross-entropy, then adding calibration and task-specific checks
- Dataset Limitations: Using only one public corpus may not generalize to clinical notes or radiology reports
- Domain Transfer Risk: Results may not transfer to other medical text types due to domain shift
- Calibration Issues: Addressing calibration only through post-hoc scaling; further validation required before clinical deployment
- Multimodal Extension: Extending to multimodal inputs from charts
- Safety Auditing: Building robust safety and bias audits
- Longitudinal Prediction: Extending from static abstracts to longitudinal prediction
- Federated Learning: Exploring federated learning under privacy and non-IID settings
- High Practical Value: Focusing on actual deployment needs, considering cost, latency, and privacy constraints
- Rigorous Experimental Design: Controlled experiments with all variables fixed except the objective function
- Comprehensive Analysis: Providing detailed confusion matrices and per-class analysis
- Reproducibility: Releasing evaluation code and detailed implementation details
- Balanced Perspective: Offering a balanced view between performance and efficiency
- Single Dataset: Validation on only one dataset limits generalizability
- Limited Model Scope: Comparing only two encoders without including domain-specific models
- Insufficient Hyperparameter Tuning: Using fixed hyperparameters may limit the performance of certain methods
- Lack of Statistical Significance Testing: No confidence intervals reported for multiple runs
- Practical Guidance Value: Providing practical model selection guidance for medical AI practitioners
- Baseline Establishment: Establishing reliable lightweight baselines for future research
- Cost Awareness: Emphasizing the importance of model selection in resource-constrained environments
- Resource-Constrained Medical Environments: On-premises deployment, high privacy protection requirements
- Real-Time Classification Needs: Applications requiring low-latency responses
- Prototype Development: Starting point for more complex systems
- Educational Research: Teaching and foundational research in medical NLP
This paper cites 43 relevant references covering medical AI, model compression, class imbalance handling, and other aspects, providing a solid theoretical foundation. Important references include the original DistilBERT paper, medical domain pre-trained models (BioBERT, SciBERT), and key technical literature on focal loss.
Overall Assessment: This is a highly practical paper that, while limited in technical novelty, provides valuable practical guidance for medical text classification. The paper's controlled experimental design and comprehensive analysis are commendable and hold significant reference value for practitioners needing to deploy NLP systems in resource-constrained environments.