2025-11-12T11:28:10.381466

Lightweight Baselines for Medical Abstract Classification: DistilBERT with Cross-Entropy as a Strong Default

Liu, Wang, Liu et al.
Large language models work well for many NLP tasks, but they are hard to deploy in health settings with strict cost, latency, and privacy limits. We revisit a lightweight recipe for medical abstract classification and ask how far compact encoders can go under a controlled budget. Using the public medical abstracts corpus, we finetune BERT base and DistilBERT with three objectives standard cross-entropy, class weighted cross entropy, and focal loss keeping tokenizer, sequence length, optimizer, and schedule fixed. DistilBERT with plain cross-entropy gives the best balance on the test set while using far fewer parameters than BERT base. We report accuracy, Macro F1, and Weighted F1, release the evaluation code, and include confusion analyses to make error patterns clear. Our results suggest a practical default: start with a compact encoder and cross-entropy, then add calibration and task-specific checks before moving to heavier models.
academic

Lightweight Baselines for Medical Abstract Classification: DistilBERT with Cross-Entropy as a Strong Default

Basic Information

  • Paper ID: 2510.10025
  • Title: Lightweight Baselines for Medical Abstract Classification: DistilBERT with Cross-Entropy as a Strong Default
  • Authors: Jiaqi Liu, Lanruo Wang, Su Liu, Xin Hu
  • Categories: cs.CL cs.AI
  • Publication Date: October 11, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.10025

Abstract

Large language models demonstrate strong performance across many NLP tasks but are difficult to deploy in medical environments with strict cost, latency, and privacy constraints. This paper revisits lightweight solutions for medical abstract classification, exploring the performance limits of compact encoders under controlled budgets. Using publicly available medical abstract corpora, the authors fine-tune BERT-base and DistilBERT with three objective functions (standard cross-entropy, class-weighted cross-entropy, and focal loss) while keeping tokenizers, sequence length, optimizers, and schedulers fixed. Results demonstrate that DistilBERT combined with standard cross-entropy achieves the best balance on the test set while using significantly fewer parameters than BERT-base.

Research Background and Motivation

Problem Definition

With the rapid growth of biomedical literature, manual tracking has become infeasible, necessitating reliable automated systems for classification, triage, and summarization. While large language models demonstrate superior performance, their computational and memory costs limit their use in medical environments, particularly in scenarios with budget, latency, and privacy constraints (such as HIPAA compliance).

Research Motivation

  1. Practical Deployment Requirements: Medical pipelines typically operate under strict cost-of-service and governance requirements (on-premises deployment, air-gapped or VPC-restricted deployment)
  2. Efficiency-Performance Trade-off: Compact encoders often provide better accuracy-efficiency trade-offs in terms of ease of fine-tuning and calibration
  3. Baseline Establishment: Establishing clean baselines is valuable for future comparisons with domain-specialized encoders

Limitations of Existing Approaches

  • Large models incur high deployment costs and latency
  • Domain-adaptive pre-trained models (e.g., SciBERT, BioBERT) offer good performance but consume substantial resources
  • The effectiveness of class imbalance handling methods (resampling, cost-sensitive loss) remains insufficiently validated on medical texts

Core Contributions

  1. Establishing Lightweight Baselines: Systematically comparing the performance of BERT-base and DistilBERT on medical abstract classification tasks
  2. Loss Function Comparison: Comparing three loss functions (CE, WCE, FL) under controlled conditions
  3. Practical Guidance: Providing recommended deployment pathways: starting with compact encoders and cross-entropy
  4. Open-Source Contribution: Releasing evaluation code and detailed confusion matrix analysis to ensure reproducibility
  5. Efficiency Analysis: Providing efficiency gain analysis in terms of parameter count, disk footprint, and throughput

Methodology Details

Task Definition

Defining medical literature abstract classification as a five-class single-label classification problem using publicly available medical abstract corpora from Hugging Face. Categories include:

  • Neoplasms (21.91%)
  • Digestive System Diseases (10.35%)
  • Nervous System Diseases (13.33%)
  • Cardiovascular Diseases (21.13%)
  • General Pathological Conditions (33.28%)

Model Architecture

Encoder Selection:

  • BERT-base-uncased (~110M parameters)
  • DistilBERT-base-uncased (~66M parameters)

Classification Head: Randomly initialized linear classification layer (hidden size 768, output size 5)

Loss Function Comparison:

  1. Standard Cross-Entropy (CE): LCE=logptL_{CE} = -\log p_t
  2. Class-Weighted Cross-Entropy (WCE): LWCE=wtlogptL_{WCE} = -w_t \log p_t
  3. Focal Loss (FL): LFL=αt(1pt)γlogptL_{FL} = -\alpha_t(1-p_t)^{\gamma} \log p_t, where γ=2.0\gamma=2.0

Technical Innovations

  1. Controlled Experimental Design: Keeping tokenizers, sequence length, optimizers, and schedulers fixed while varying only the loss function
  2. Deployment-Oriented Approach: Focusing on deployment-friendly preprocessing and fixed-length strategies
  3. Comprehensive Evaluation: Combining accuracy, Macro-F1, Weighted-F1, and confusion matrix analysis

Experimental Setup

Dataset

  • Source: Hugging Face medical abstract corpus
  • Scale: 10,395 training abstracts, 1,155 validation abstracts, 2,888 test abstracts
  • Preprocessing: Minimalist deployment-friendly preprocessing, preserving punctuation, 256-token truncation/padding

Evaluation Metrics

  • Accuracy: Overall classification accuracy
  • Macro-F1: Macro-averaged F1 score (sensitive to class imbalance)
  • Weighted-F1: Weighted F1 score
  • Confusion Matrix: Detailed error pattern analysis

Comparative Methods

Systematic comparison of six configurations:

  • BERT-base + CE/WCE/FL
  • DistilBERT + CE/WCE/FL

Implementation Details

  • Optimizer: AdamW, learning rate 2×10^-5
  • Batch Size: 16
  • Training Epochs: 3
  • Sequence Length: 256 tokens
  • Model Selection: Best checkpoint based on validation set Macro-F1

Experimental Results

Main Results

ModelLoss FunctionAccuracy (%)Macro-F1 (%)Weighted-F1 (%)
DistilBERTCE64.6164.3863.25
BERT-baseCE64.5163.8562.12
BERT-baseWCE62.8862.4359.66
DistilBERTWCE62.2962.2259.24

Key Findings

Observation 1 - Loss Function Selection: For both encoders, WCE and FL underperform compared to CE. The relative decline in Macro-F1 indicates that emphasizing difficult/minority samples does not translate to better global balance on this corpus.

Observation 2 - Encoder Selection: DistilBERT matches or slightly exceeds BERT-base despite substantial capacity reduction, supporting compact baselines as a strong default choice under computational or latency constraints.

Observation 3 - Stability: The ranking (DistilBERT+CE > BERT+CE > {WCE, FL}) remains consistent across multiple runs.

Error Pattern Analysis

  • Stable Classes: Classes 1 and 4 maintain robustness across various losses and encoders
  • Fragile Classes: Class 5 exhibits recall deficits and overflow to Class 4
  • Redistribution Rather Than Reduction: WCE/FL slightly redistribute errors among adjacent classes but rarely reduce the total global error count

Efficiency Gains

  • Parameter Reduction: DistilBERT reduces 40% of parameters compared to BERT-base (66M vs 110M)
  • Disk Footprint: Smaller checkpoint file sizes
  • Inference Speed: Lower cold-start latency

Medical Text Classification

The field has evolved from feature engineering models to fine-tuned Transformers customized for scientific and biomedical texts, including SciBERT, BioBERT, and ClinicalBERT. Novel pre-training approaches are combining structured laboratory data with knowledge-guided learning.

Class Imbalance Handling

Typically addressed through resampling or cost-sensitive losses (such as re-weighting and focal loss). This work finds that under moderate skew and label ambiguity, these methods may amplify noise and reduce accuracy.

Model Efficiency

Widely employing efficiency methods such as distillation (DistilBERT), pruning, and quantization to reduce computation and latency.

Conclusions and Discussion

Main Conclusions

  1. Simplicity is Effective: DistilBERT with cross-entropy is a robust, computationally efficient baseline
  2. Loss Function Selection: Under moderate class skew, standard cross-entropy outperforms weighted variants
  3. Practical Pathway: Recommending starting with compact encoders and cross-entropy, then adding calibration and task-specific checks

Limitations

  1. Dataset Limitations: Using only one public corpus may not generalize to clinical notes or radiology reports
  2. Domain Transfer Risk: Results may not transfer to other medical text types due to domain shift
  3. Calibration Issues: Addressing calibration only through post-hoc scaling; further validation required before clinical deployment

Future Directions

  1. Multimodal Extension: Extending to multimodal inputs from charts
  2. Safety Auditing: Building robust safety and bias audits
  3. Longitudinal Prediction: Extending from static abstracts to longitudinal prediction
  4. Federated Learning: Exploring federated learning under privacy and non-IID settings

In-Depth Evaluation

Strengths

  1. High Practical Value: Focusing on actual deployment needs, considering cost, latency, and privacy constraints
  2. Rigorous Experimental Design: Controlled experiments with all variables fixed except the objective function
  3. Comprehensive Analysis: Providing detailed confusion matrices and per-class analysis
  4. Reproducibility: Releasing evaluation code and detailed implementation details
  5. Balanced Perspective: Offering a balanced view between performance and efficiency

Weaknesses

  1. Single Dataset: Validation on only one dataset limits generalizability
  2. Limited Model Scope: Comparing only two encoders without including domain-specific models
  3. Insufficient Hyperparameter Tuning: Using fixed hyperparameters may limit the performance of certain methods
  4. Lack of Statistical Significance Testing: No confidence intervals reported for multiple runs

Impact

  1. Practical Guidance Value: Providing practical model selection guidance for medical AI practitioners
  2. Baseline Establishment: Establishing reliable lightweight baselines for future research
  3. Cost Awareness: Emphasizing the importance of model selection in resource-constrained environments

Applicable Scenarios

  1. Resource-Constrained Medical Environments: On-premises deployment, high privacy protection requirements
  2. Real-Time Classification Needs: Applications requiring low-latency responses
  3. Prototype Development: Starting point for more complex systems
  4. Educational Research: Teaching and foundational research in medical NLP

References

This paper cites 43 relevant references covering medical AI, model compression, class imbalance handling, and other aspects, providing a solid theoretical foundation. Important references include the original DistilBERT paper, medical domain pre-trained models (BioBERT, SciBERT), and key technical literature on focal loss.


Overall Assessment: This is a highly practical paper that, while limited in technical novelty, provides valuable practical guidance for medical text classification. The paper's controlled experimental design and comprehensive analysis are commendable and hold significant reference value for practitioners needing to deploy NLP systems in resource-constrained environments.