Topic modeling is a useful tool for analyzing large corpora of written documents, particularly academic papers. Despite a wide variety of proposed topic modeling techniques, these techniques do not perform well when applied to medical texts. This can be due to the low number of documents available for some topics in the healthcare domain. In this paper, we propose ProtoTopic, a prototypical network-based topic model used for topic generation for a set of medical paper abstracts. Prototypical networks are efficient, explainable models that make predictions by computing distances between input datapoints and a set of prototype representations, making them particularly effective in low-data or few-shot learning scenarios. With ProtoTopic, we demonstrate improved topic coherence and diversity compared to two topic modeling baselines used in the literature, demonstrating the ability of our model to generate medically relevant topics even with limited data.
- Paper ID: 2510.13542
- Title: ProtoTopic: Prototypical Network for Few-Shot Medical Topic Modeling
- Authors: Martin Licht, Sara Ketabi, Farzad Khalvati
- Category: cs.LG (Machine Learning)
- Publication Date: October 15, 2025
- Paper Link: https://arxiv.org/abs/2510.13542v1
Topic modeling is a useful tool for analyzing large document corpora, particularly academic papers. Although various topic modeling techniques exist, they perform poorly when applied to medical texts, potentially due to the scarcity of available documents for certain topics in the healthcare domain. This paper proposes ProtoTopic, a topic model based on prototypical networks for topic generation from medical paper abstracts. Prototypical networks are efficient and interpretable models that make predictions by computing distances between input data points and a set of prototype representations, proving particularly effective in low-data or few-shot learning scenarios. Through ProtoTopic, the authors demonstrate improved topic coherence and diversity compared to two baseline topic modeling approaches from the literature, validating the model's capability to generate medically relevant topics even with limited data.
- Core Problem: Existing topic modeling techniques perform poorly on medical texts, particularly in data-scarce scenarios
- Significance: The rapid growth of medical literature necessitates effective topic modeling tools to help researchers and clinicians quickly screen and locate relevant information
- Limitations of Existing Methods:
- Insufficient training data: High-quality training data is scarce in clinical environments
- Lack of interpretability: Most state-of-the-art models are black-box systems
- Specificity of medical terminology: Medical texts contain specialized terminology and format variations
NLP applications in healthcare face three major challenges: data scarcity, lack of interpretability, and the specificity of medical terminology. Prototypical networks can effectively learn in few-shot scenarios while providing interpretability, making them an ideal choice for medical topic modeling.
- First Application of Prototypical Networks to Topic Modeling: Developed ProtoTopic, specifically designed for topic modeling of medical abstracts
- Comprehensive Performance Evaluation: Conducted thorough comparisons with two state-of-the-art baseline models (LDA and BERTopic)
- Multi-Topic Number Analysis: Investigated the impact of different topic numbers (25, 50, 100) on model performance
- Statistical Significance Validation: Verified ProtoTopic's significant advantages over baselines through t-tests
Input: Collection of medical paper abstracts
Output: Topic clustering results and representative keywords for each topic
Objective: Generate high-coherence, high-diversity medical topics in few-shot scenarios
Two Transformer models are used to generate text embeddings:
- PubMedBERT: A BERT variant specifically trained on medical papers, generating 768-dimensional vectors
- all-MiniLM-L6-v2: A general-purpose sentence Transformer, generating 384-dimensional vectors
K-means clustering is applied to embedding vectors to generate pseudo-labels:
- Documents are assigned to K clusters
- Cluster centers serve as pseudo-labels for training the prototypical network
The core algorithm is based on Snell et al.'s prototypical networks:
Prototype Computation:
ck=∣Sk∣1∑(xi,yi)∈Skfϕ(xi)
where Sk is the support set for class k, and fϕ is the embedding function.
Classification Probability:
pϕ(y=k∣x)=∑k′exp(−d(fϕ(x),ck′))exp(−d(fϕ(x),ck))
Loss Function:
J(ϕ)=−logpϕ(y=k∣x)
Class-based TF-IDF (c-TF-IDF) is used to extract representative keywords for each topic. This method redefines term frequency as the percentage of a word's occurrence across all groups, rather than the proportion of groups containing that word.
- Few-Shot Learning Capability: Enables learning effective topic representations with minimal samples through prototypical networks
- Interpretability: Provides explanations by displaying the most similar prototype instances
- Domain Adaptability: Combines medical-specific embeddings (PubMedBERT) with general embeddings for comparison
- Episodic Training: Each episode contains 5 classes with 5 support samples and 5 query points per class
- Dataset: PubMed200k RCT
- Scale: 200,000 randomized controlled trial abstracts, 2.3 million sentences
- Preprocessing:
- Removal of non-alphabetic characters
- Conversion to lowercase
- Text tokenization
- Removal of high-frequency words (e.g., "the", "and", "of")
- Topic Coherence: Uses the CV metric to analyze co-occurrence of topic keywords in the corpus
- Topic Diversity: Extracts the top 25 keywords for each topic and calculates the percentage of unique words across all topic keywords
- LDA (Latent Dirichlet Allocation): Classical probabilistic topic model
- BERTopic: Neural topic model based on BERT embeddings
- Optimizer: ADAM with learning rate 0.00005
- Training Configuration: 50 episodes/epoch, 10 epochs total
- Hardware: Google Colab T4 GPU (15GB RAM)
- Parameter Freezing: All layers of pre-trained Transformers frozen except the last two layers
25 Topics:
| Model | Coherence Score | Topic Diversity |
|---|
| LDA | 0.4910 | 40.8% |
| BERTopic | 0.5137 | 49.6% |
| ProtoTopic (all-MiniLM) | 0.5396 | 84.5% |
| ProtoTopic (PubMedBERT) | 0.5754 | 86.1% |
50 Topics:
| Model | Coherence Score | Topic Diversity |
|---|
| LDA | 0.5017 | 43.8% |
| BERTopic | 0.5394 | 54.5% |
| ProtoTopic (all-MiniLM) | 0.6789 | 73.5% |
| ProtoTopic (PubMedBERT) | 0.6734 | 75.9% |
100 Topics:
| Model | Coherence Score | Topic Diversity |
|---|
| LDA | 0.5090 | 55.6% |
| BERTopic | 0.6173 | 58.0% |
| ProtoTopic (all-MiniLM) | 0.7173 | 58.6% |
| ProtoTopic (PubMedBERT) | 0.7117 | 61.2% |
T-tests (p < 0.00001) confirm that ProtoTopic significantly outperforms BERTopic on both coherence and diversity metrics.
- BERTopic: Generates overly generic keywords (e.g., "patients", "median", "overall"), lacking discriminative power
- ProtoTopic: Generates highly specific keywords, avoiding generic terms, such as specialized terminology for lower limb injuries
- Coherence Trend: Topic coherence increases with the number of topics for all models
- Diversity Trend:
- Baseline models: Diversity increases with topic number
- ProtoTopic: Diversity decreases with topic number (from 86.1% to 61.2%)
- Probabilistic Models: LDA employs bag-of-words assumptions, ignoring word order
- Neural Models:
- LDA2VEC: Combines Word2Vec embeddings
- ETM: Uses CBOW embeddings
- BERTopic: Based on BERT embeddings
- Optimization Methods: Meta-learning algorithms such as MAML
- Metric Methods:
- Siamese Networks
- Matching Networks
- Relation Networks
- Prototypical Networks
- Computer Vision: Image classification tasks
- NLP Domain: ProSeNet, ProtoryNet, ProtoSeq and other text classification applications
- ProtoTopic outperforms baseline models on all evaluation metrics
- Excellent performance is achieved even with general-purpose embeddings (all-MiniLM-L6-v2)
- The model generates medically relevant and interpretable topics
- Loss Function: Uses only basic prototypical network loss without considering cluster tightness and inter-prototype distances
- Clustering Algorithm: Only K-means is explored; other methods like HDBSCAN are not investigated
- Dimensionality Reduction: The effects of dimensionality reduction on high-dimensional embeddings are not explored
- User Evaluation: Lacks subjective evaluation from clinical practitioners
- Improve loss function design
- Explore different clustering techniques
- Investigate the impact of dimensionality reduction
- Conduct clinical user studies
- Strong Innovation: First application of prototypical networks to topic modeling tasks
- Comprehensive Experiments: Thorough comparisons across multiple embedding models and topic numbers
- Statistical Rigor: Provides statistical significance testing
- High Practical Value: Addresses data scarcity issues in the medical domain
- Good Interpretability: Prototypical networks provide intuitive explanation mechanisms
- Single Dataset: Validation only on PubMed200k dataset
- Limited Evaluation Dimensions: Lacks human evaluation and downstream task assessment
- Computational Complexity Not Analyzed: No computational efficiency comparison with baselines provided
- Hyperparameter Sensitivity: Insufficient analysis of key hyperparameter impacts
- Academic Contribution: Provides a new topic modeling paradigm for medical NLP
- Practical Value: Applicable to medical literature analysis and clinical decision support
- Reproducibility: Uses public datasets with detailed experimental settings
- Medical Literature Analysis: Helps researchers quickly understand large volumes of medical papers
- Clinical Knowledge Discovery: Identifies disease patterns from limited cases
- Cross-Domain Extension: Generalizable to other data-scarce specialized domains
This paper cites 45 relevant references covering topic modeling, few-shot learning, and prototypical networks, providing a solid theoretical foundation. Key references include:
- Snell et al. (2017): Prototypical Networks for Few-Shot Learning
- Grootendorst (2022): BERTopic neural topic modeling
- Blei et al. (2003): Latent Dirichlet Allocation
Overall Assessment: This paper presents an innovative and practical medical topic modeling method with significant value in addressing data scarcity issues. The experimental design is sound, results are convincing, and it makes meaningful contributions to the medical NLP field.