2025-11-10T02:45:09.159666

ProtoTopic: Prototypical Network for Few-Shot Medical Topic Modeling

Licht, Ketabi, Khalvati

Topic modeling is a useful tool for analyzing large corpora of written documents, particularly academic papers. Despite a wide variety of proposed topic modeling techniques, these techniques do not perform well when applied to medical texts. This can be due to the low number of documents available for some topics in the healthcare domain. In this paper, we propose ProtoTopic, a prototypical network-based topic model used for topic generation for a set of medical paper abstracts. Prototypical networks are efficient, explainable models that make predictions by computing distances between input datapoints and a set of prototype representations, making them particularly effective in low-data or few-shot learning scenarios. With ProtoTopic, we demonstrate improved topic coherence and diversity compared to two topic modeling baselines used in the literature, demonstrating the ability of our model to generate medically relevant topics even with limited data.

academic

ProtoTopic: Prototypical Network for Few-Shot Medical Topic Modeling

Basic Information

Paper ID: 2510.13542
Title: ProtoTopic: Prototypical Network for Few-Shot Medical Topic Modeling
Authors: Martin Licht, Sara Ketabi, Farzad Khalvati
Category: cs.LG (Machine Learning)
Publication Date: October 15, 2025
Paper Link: https://arxiv.org/abs/2510.13542v1

Abstract

Topic modeling is a useful tool for analyzing large document corpora, particularly academic papers. Although various topic modeling techniques exist, they perform poorly when applied to medical texts, potentially due to the scarcity of available documents for certain topics in the healthcare domain. This paper proposes ProtoTopic, a topic model based on prototypical networks for topic generation from medical paper abstracts. Prototypical networks are efficient and interpretable models that make predictions by computing distances between input data points and a set of prototype representations, proving particularly effective in low-data or few-shot learning scenarios. Through ProtoTopic, the authors demonstrate improved topic coherence and diversity compared to two baseline topic modeling approaches from the literature, validating the model's capability to generate medically relevant topics even with limited data.

Research Background and Motivation

Problem Definition

Core Problem: Existing topic modeling techniques perform poorly on medical texts, particularly in data-scarce scenarios
Significance: The rapid growth of medical literature necessitates effective topic modeling tools to help researchers and clinicians quickly screen and locate relevant information
Limitations of Existing Methods:
- Insufficient training data: High-quality training data is scarce in clinical environments
- Lack of interpretability: Most state-of-the-art models are black-box systems
- Specificity of medical terminology: Medical texts contain specialized terminology and format variations

Research Motivation

NLP applications in healthcare face three major challenges: data scarcity, lack of interpretability, and the specificity of medical terminology. Prototypical networks can effectively learn in few-shot scenarios while providing interpretability, making them an ideal choice for medical topic modeling.

Core Contributions

First Application of Prototypical Networks to Topic Modeling: Developed ProtoTopic, specifically designed for topic modeling of medical abstracts
Comprehensive Performance Evaluation: Conducted thorough comparisons with two state-of-the-art baseline models (LDA and BERTopic)
Multi-Topic Number Analysis: Investigated the impact of different topic numbers (25, 50, 100) on model performance
Statistical Significance Validation: Verified ProtoTopic's significant advantages over baselines through t-tests

Methodology Details

Task Definition

Input: Collection of medical paper abstracts Output: Topic clustering results and representative keywords for each topic Objective: Generate high-coherence, high-diversity medical topics in few-shot scenarios

Model Architecture

1. Text Embedding Generation

Two Transformer models are used to generate text embeddings:

PubMedBERT: A BERT variant specifically trained on medical papers, generating 768-dimensional vectors
all-MiniLM-L6-v2: A general-purpose sentence Transformer, generating 384-dimensional vectors

2. K-means Clustering

K-means clustering is applied to embedding vectors to generate pseudo-labels:

Documents are assigned to K clusters
Cluster centers serve as pseudo-labels for training the prototypical network

3. Prototypical Network Training

The core algorithm is based on Snell et al.'s prototypical networks:

Prototype Computation: $c_k = \frac{1}{|S_k|} \sum_{(x_i,y_i) \in S_k} f_\phi(x_i)$

where $S_k$ is the support set for class k, and $f_\phi$ is the embedding function.

Classification Probability: $p_\phi(y=k|x) = \frac{\exp(-d(f_\phi(x), c_k))}{\sum_{k'} \exp(-d(f_\phi(x), c_{k'}))}$

Loss Function: $J(\phi) = -\log p_\phi(y=k|x)$

4. Keyword Extraction

Class-based TF-IDF (c-TF-IDF) is used to extract representative keywords for each topic. This method redefines term frequency as the percentage of a word's occurrence across all groups, rather than the proportion of groups containing that word.

Technical Innovations

Few-Shot Learning Capability: Enables learning effective topic representations with minimal samples through prototypical networks
Interpretability: Provides explanations by displaying the most similar prototype instances
Domain Adaptability: Combines medical-specific embeddings (PubMedBERT) with general embeddings for comparison
Episodic Training: Each episode contains 5 classes with 5 support samples and 5 query points per class

Experimental Setup

Dataset

Dataset: PubMed200k RCT
Scale: 200,000 randomized controlled trial abstracts, 2.3 million sentences
Preprocessing:
- Removal of non-alphabetic characters
- Conversion to lowercase
- Text tokenization
- Removal of high-frequency words (e.g., "the", "and", "of")

Evaluation Metrics

Topic Coherence: Uses the CV metric to analyze co-occurrence of topic keywords in the corpus
Topic Diversity: Extracts the top 25 keywords for each topic and calculates the percentage of unique words across all topic keywords

Baseline Methods

LDA (Latent Dirichlet Allocation): Classical probabilistic topic model
BERTopic: Neural topic model based on BERT embeddings

Implementation Details

Optimizer: ADAM with learning rate 0.00005
Training Configuration: 50 episodes/epoch, 10 epochs total
Hardware: Google Colab T4 GPU (15GB RAM)
Parameter Freezing: All layers of pre-trained Transformers frozen except the last two layers

Experimental Results

Main Results

Quantitative Results

25 Topics:

Model	Coherence Score	Topic Diversity
LDA	0.4910	40.8%
BERTopic	0.5137	49.6%
ProtoTopic (all-MiniLM)	0.5396	84.5%
ProtoTopic (PubMedBERT)	0.5754	86.1%

50 Topics:

Model	Coherence Score	Topic Diversity
LDA	0.5017	43.8%
BERTopic	0.5394	54.5%
ProtoTopic (all-MiniLM)	0.6789	73.5%
ProtoTopic (PubMedBERT)	0.6734	75.9%

100 Topics:

Model	Coherence Score	Topic Diversity
LDA	0.5090	55.6%
BERTopic	0.6173	58.0%
ProtoTopic (all-MiniLM)	0.7173	58.6%
ProtoTopic (PubMedBERT)	0.7117	61.2%

Statistical Significance

T-tests (p < 0.00001) confirm that ProtoTopic significantly outperforms BERTopic on both coherence and diversity metrics.

Qualitative Results Analysis

Topic Specificity Comparison

BERTopic: Generates overly generic keywords (e.g., "patients", "median", "overall"), lacking discriminative power
ProtoTopic: Generates highly specific keywords, avoiding generic terms, such as specialized terminology for lower limb injuries

Trend Analysis

Coherence Trend: Topic coherence increases with the number of topics for all models
Diversity Trend:
- Baseline models: Diversity increases with topic number
- ProtoTopic: Diversity decreases with topic number (from 86.1% to 61.2%)

Topic Modeling Evolution

Probabilistic Models: LDA employs bag-of-words assumptions, ignoring word order
Neural Models:
- LDA2VEC: Combines Word2Vec embeddings
- ETM: Uses CBOW embeddings
- BERTopic: Based on BERT embeddings

Few-Shot Learning

Optimization Methods: Meta-learning algorithms such as MAML
Metric Methods:
- Siamese Networks
- Matching Networks
- Relation Networks
- Prototypical Networks

Prototypical Network Applications

Computer Vision: Image classification tasks
NLP Domain: ProSeNet, ProtoryNet, ProtoSeq and other text classification applications

Conclusions and Discussion

Main Conclusions

ProtoTopic outperforms baseline models on all evaluation metrics
Excellent performance is achieved even with general-purpose embeddings (all-MiniLM-L6-v2)
The model generates medically relevant and interpretable topics

Limitations

Loss Function: Uses only basic prototypical network loss without considering cluster tightness and inter-prototype distances
Clustering Algorithm: Only K-means is explored; other methods like HDBSCAN are not investigated
Dimensionality Reduction: The effects of dimensionality reduction on high-dimensional embeddings are not explored
User Evaluation: Lacks subjective evaluation from clinical practitioners

Future Directions

Improve loss function design
Explore different clustering techniques
Investigate the impact of dimensionality reduction
Conduct clinical user studies

In-Depth Evaluation

Strengths

Strong Innovation: First application of prototypical networks to topic modeling tasks
Comprehensive Experiments: Thorough comparisons across multiple embedding models and topic numbers
Statistical Rigor: Provides statistical significance testing
High Practical Value: Addresses data scarcity issues in the medical domain
Good Interpretability: Prototypical networks provide intuitive explanation mechanisms

Weaknesses

Single Dataset: Validation only on PubMed200k dataset
Limited Evaluation Dimensions: Lacks human evaluation and downstream task assessment
Computational Complexity Not Analyzed: No computational efficiency comparison with baselines provided
Hyperparameter Sensitivity: Insufficient analysis of key hyperparameter impacts

Impact

Academic Contribution: Provides a new topic modeling paradigm for medical NLP
Practical Value: Applicable to medical literature analysis and clinical decision support
Reproducibility: Uses public datasets with detailed experimental settings

Applicable Scenarios

Medical Literature Analysis: Helps researchers quickly understand large volumes of medical papers
Clinical Knowledge Discovery: Identifies disease patterns from limited cases
Cross-Domain Extension: Generalizable to other data-scarce specialized domains

References

This paper cites 45 relevant references covering topic modeling, few-shot learning, and prototypical networks, providing a solid theoretical foundation. Key references include:

Snell et al. (2017): Prototypical Networks for Few-Shot Learning
Grootendorst (2022): BERTopic neural topic modeling
Blei et al. (2003): Latent Dirichlet Allocation

Overall Assessment: This paper presents an innovative and practical medical topic modeling method with significant value in addressing data scarcity issues. The experimental design is sound, results are convincing, and it makes meaningful contributions to the medical NLP field.