2025-11-12T11:28:10.381466

Lightweight Baselines for Medical Abstract Classification: DistilBERT with Cross-Entropy as a Strong Default

Liu, Wang, Liu et al.

Large language models work well for many NLP tasks, but they are hard to deploy in health settings with strict cost, latency, and privacy limits. We revisit a lightweight recipe for medical abstract classification and ask how far compact encoders can go under a controlled budget. Using the public medical abstracts corpus, we finetune BERT base and DistilBERT with three objectives standard cross-entropy, class weighted cross entropy, and focal loss keeping tokenizer, sequence length, optimizer, and schedule fixed. DistilBERT with plain cross-entropy gives the best balance on the test set while using far fewer parameters than BERT base. We report accuracy, Macro F1, and Weighted F1, release the evaluation code, and include confusion analyses to make error patterns clear. Our results suggest a practical default: start with a compact encoder and cross-entropy, then add calibration and task-specific checks before moving to heavier models.

academic

Lightweight Baselines for Medical Abstract Classification: DistilBERT with Cross-Entropy as a Strong Default

Basic Information

Paper ID: 2510.10025
Title: Lightweight Baselines for Medical Abstract Classification: DistilBERT with Cross-Entropy as a Strong Default
Authors: Jiaqi Liu, Lanruo Wang, Su Liu, Xin Hu
Categories: cs.CL cs.AI
Publication Date: October 11, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.10025

Abstract

Large language models demonstrate strong performance across many NLP tasks but are difficult to deploy in medical environments with strict cost, latency, and privacy constraints. This paper revisits lightweight solutions for medical abstract classification, exploring the performance limits of compact encoders under controlled budgets. Using publicly available medical abstract corpora, the authors fine-tune BERT-base and DistilBERT with three objective functions (standard cross-entropy, class-weighted cross-entropy, and focal loss) while keeping tokenizers, sequence length, optimizers, and schedulers fixed. Results demonstrate that DistilBERT combined with standard cross-entropy achieves the best balance on the test set while using significantly fewer parameters than BERT-base.

Research Background and Motivation

Problem Definition

With the rapid growth of biomedical literature, manual tracking has become infeasible, necessitating reliable automated systems for classification, triage, and summarization. While large language models demonstrate superior performance, their computational and memory costs limit their use in medical environments, particularly in scenarios with budget, latency, and privacy constraints (such as HIPAA compliance).

Research Motivation

Practical Deployment Requirements: Medical pipelines typically operate under strict cost-of-service and governance requirements (on-premises deployment, air-gapped or VPC-restricted deployment)
Efficiency-Performance Trade-off: Compact encoders often provide better accuracy-efficiency trade-offs in terms of ease of fine-tuning and calibration
Baseline Establishment: Establishing clean baselines is valuable for future comparisons with domain-specialized encoders

Limitations of Existing Approaches

Large models incur high deployment costs and latency
Domain-adaptive pre-trained models (e.g., SciBERT, BioBERT) offer good performance but consume substantial resources
The effectiveness of class imbalance handling methods (resampling, cost-sensitive loss) remains insufficiently validated on medical texts

Core Contributions

Establishing Lightweight Baselines: Systematically comparing the performance of BERT-base and DistilBERT on medical abstract classification tasks
Loss Function Comparison: Comparing three loss functions (CE, WCE, FL) under controlled conditions
Practical Guidance: Providing recommended deployment pathways: starting with compact encoders and cross-entropy
Open-Source Contribution: Releasing evaluation code and detailed confusion matrix analysis to ensure reproducibility
Efficiency Analysis: Providing efficiency gain analysis in terms of parameter count, disk footprint, and throughput

Methodology Details

Task Definition

Defining medical literature abstract classification as a five-class single-label classification problem using publicly available medical abstract corpora from Hugging Face. Categories include:

Neoplasms (21.91%)
Digestive System Diseases (10.35%)
Nervous System Diseases (13.33%)
Cardiovascular Diseases (21.13%)
General Pathological Conditions (33.28%)

Model Architecture

Encoder Selection:

BERT-base-uncased (~110M parameters)
DistilBERT-base-uncased (~66M parameters)

Classification Head: Randomly initialized linear classification layer (hidden size 768, output size 5)

Loss Function Comparison:

Standard Cross-Entropy (CE): $L_{CE} = -\log p_t$
Class-Weighted Cross-Entropy (WCE): $L_{WCE} = -w_t \log p_t$
Focal Loss (FL): $L_{FL} = -\alpha_t(1-p_t)^{\gamma} \log p_t$ , where $\gamma=2.0$

Technical Innovations

Controlled Experimental Design: Keeping tokenizers, sequence length, optimizers, and schedulers fixed while varying only the loss function
Deployment-Oriented Approach: Focusing on deployment-friendly preprocessing and fixed-length strategies
Comprehensive Evaluation: Combining accuracy, Macro-F1, Weighted-F1, and confusion matrix analysis

Experimental Setup

Dataset

Source: Hugging Face medical abstract corpus
Scale: 10,395 training abstracts, 1,155 validation abstracts, 2,888 test abstracts
Preprocessing: Minimalist deployment-friendly preprocessing, preserving punctuation, 256-token truncation/padding

Evaluation Metrics

Accuracy: Overall classification accuracy
Macro-F1: Macro-averaged F1 score (sensitive to class imbalance)
Weighted-F1: Weighted F1 score
Confusion Matrix: Detailed error pattern analysis

Comparative Methods

Systematic comparison of six configurations:

BERT-base + CE/WCE/FL
DistilBERT + CE/WCE/FL

Implementation Details

Optimizer: AdamW, learning rate 2×10^-5
Batch Size: 16
Training Epochs: 3
Sequence Length: 256 tokens
Model Selection: Best checkpoint based on validation set Macro-F1

Experimental Results

Main Results

Model	Loss Function	Accuracy (%)	Macro-F1 (%)	Weighted-F1 (%)
DistilBERT	CE	64.61	64.38	63.25
BERT-base	CE	64.51	63.85	62.12
BERT-base	WCE	62.88	62.43	59.66
DistilBERT	WCE	62.29	62.22	59.24

Key Findings

Observation 1 - Loss Function Selection: For both encoders, WCE and FL underperform compared to CE. The relative decline in Macro-F1 indicates that emphasizing difficult/minority samples does not translate to better global balance on this corpus.

Observation 2 - Encoder Selection: DistilBERT matches or slightly exceeds BERT-base despite substantial capacity reduction, supporting compact baselines as a strong default choice under computational or latency constraints.

Observation 3 - Stability: The ranking (DistilBERT+CE > BERT+CE > {WCE, FL}) remains consistent across multiple runs.

Error Pattern Analysis

Stable Classes: Classes 1 and 4 maintain robustness across various losses and encoders
Fragile Classes: Class 5 exhibits recall deficits and overflow to Class 4
Redistribution Rather Than Reduction: WCE/FL slightly redistribute errors among adjacent classes but rarely reduce the total global error count

Efficiency Gains

Parameter Reduction: DistilBERT reduces 40% of parameters compared to BERT-base (66M vs 110M)
Disk Footprint: Smaller checkpoint file sizes
Inference Speed: Lower cold-start latency

Medical Text Classification

The field has evolved from feature engineering models to fine-tuned Transformers customized for scientific and biomedical texts, including SciBERT, BioBERT, and ClinicalBERT. Novel pre-training approaches are combining structured laboratory data with knowledge-guided learning.

Class Imbalance Handling

Typically addressed through resampling or cost-sensitive losses (such as re-weighting and focal loss). This work finds that under moderate skew and label ambiguity, these methods may amplify noise and reduce accuracy.

Model Efficiency

Widely employing efficiency methods such as distillation (DistilBERT), pruning, and quantization to reduce computation and latency.

Conclusions and Discussion

Main Conclusions

Simplicity is Effective: DistilBERT with cross-entropy is a robust, computationally efficient baseline
Loss Function Selection: Under moderate class skew, standard cross-entropy outperforms weighted variants
Practical Pathway: Recommending starting with compact encoders and cross-entropy, then adding calibration and task-specific checks

Limitations

Dataset Limitations: Using only one public corpus may not generalize to clinical notes or radiology reports
Domain Transfer Risk: Results may not transfer to other medical text types due to domain shift
Calibration Issues: Addressing calibration only through post-hoc scaling; further validation required before clinical deployment

Future Directions

Multimodal Extension: Extending to multimodal inputs from charts
Safety Auditing: Building robust safety and bias audits
Longitudinal Prediction: Extending from static abstracts to longitudinal prediction
Federated Learning: Exploring federated learning under privacy and non-IID settings

In-Depth Evaluation

Strengths

High Practical Value: Focusing on actual deployment needs, considering cost, latency, and privacy constraints
Rigorous Experimental Design: Controlled experiments with all variables fixed except the objective function
Comprehensive Analysis: Providing detailed confusion matrices and per-class analysis
Reproducibility: Releasing evaluation code and detailed implementation details
Balanced Perspective: Offering a balanced view between performance and efficiency

Weaknesses

Single Dataset: Validation on only one dataset limits generalizability
Limited Model Scope: Comparing only two encoders without including domain-specific models
Insufficient Hyperparameter Tuning: Using fixed hyperparameters may limit the performance of certain methods
Lack of Statistical Significance Testing: No confidence intervals reported for multiple runs

Impact

Practical Guidance Value: Providing practical model selection guidance for medical AI practitioners
Baseline Establishment: Establishing reliable lightweight baselines for future research
Cost Awareness: Emphasizing the importance of model selection in resource-constrained environments

Applicable Scenarios

Resource-Constrained Medical Environments: On-premises deployment, high privacy protection requirements
Real-Time Classification Needs: Applications requiring low-latency responses
Prototype Development: Starting point for more complex systems
Educational Research: Teaching and foundational research in medical NLP

References

This paper cites 43 relevant references covering medical AI, model compression, class imbalance handling, and other aspects, providing a solid theoretical foundation. Important references include the original DistilBERT paper, medical domain pre-trained models (BioBERT, SciBERT), and key technical literature on focal loss.

Overall Assessment: This is a highly practical paper that, while limited in technical novelty, provides valuable practical guidance for medical text classification. The paper's controlled experimental design and comprehensive analysis are commendable and hold significant reference value for practitioners needing to deploy NLP systems in resource-constrained environments.