Ontologies have become essential in today's digital age as a way of organising the vast amount of readily available unstructured text. In providing formal structure to this information, ontologies have immense value and application across various domains, e.g., e-commerce, where countless product listings necessitate proper product organisation. However, the manual construction of these ontologies is a time-consuming, expensive and laborious process. In this paper, we harness the recent advancements in large language models (LLMs) to develop a fully-automated method of extracting product ontologies, in the form of meronymies, from raw review texts. We demonstrate that the ontologies produced by our method surpass an existing, BERT-based baseline when evaluating using an LLM-as-a-judge. Our investigation provides the groundwork for LLMs to be used more generally in (product or otherwise) ontology extraction.
- Paper ID: 2510.13839
- Title: Meronymic Ontology Extraction via Large Language Models
- Authors: Dekai Zhang (Imperial College London), Simone Conia (Sapienza University of Rome), Antonio Rago (Imperial College London & King's College London)
- Classification: cs.CL cs.AI
- Publication Date: October 11, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.13839
This paper leverages recent advances in Large Language Models (LLMs) to develop a fully automated method for extracting product ontologies (in the form of part-whole relationships) from raw review texts. The study demonstrates that the ontologies generated by this method surpass existing BERT-based baseline approaches in evaluations using LLMs as judges. This research establishes a foundation for broader applications of LLMs in ontology extraction tasks.
In the digital age, massive volumes of unstructured textual data require organization and structuring through ontologies. Particularly in e-commerce, countless product listings require appropriate product organizational structures. Part-whole relationships (meronymic relations) hold significant value in downstream tasks such as review aggregation, sentiment analysis, and product question-answering.
- High Manual Construction Costs: Manual ontology construction is a time-consuming, expensive, and labor-intensive process
- Insufficient Automation Methods: Previous research has primarily focused on extracting taxonomic relations rather than part-whole relationships
- Evaluation Difficulties: Lack of standard benchmark datasets makes it difficult to effectively evaluate the quality of part-whole ontologies
- Dependence on Manual Annotation: Existing methods such as the BERT approach by Oksanen et al. (2021) still require a certain degree of manual annotation
This paper aims to leverage the powerful capabilities of LLMs to develop a fully automated method for part-whole ontology extraction and propose a novel evaluation framework to validate the method's effectiveness.
- Proposes Fully Automated LLM Method: Develops a completely automated method using LLMs for part-whole ontology extraction that generalizes across different product categories
- Innovative Evaluation Framework: Introduces a novel approach using LLM-as-a-judge for empirical evaluation of various tasks in part-whole ontology extraction
- Performance Improvement Verification: Demonstrates through experiments that the LLM method significantly outperforms BERT-based baseline methods in relevance
- Open-Source Code: Provides complete implementation code to promote research reproducibility
Input: Product review texts
Output: Part-whole ontology graph containing concept nodes and "part-whole" relationships between them
Constraints: Relationships must be meaningful part-whole relations, and concepts must be product-relevant
The proposed method comprises a four-stage pipeline:
- Method: Fine-tuning Mistral-7B-Instruct-v0.2
- Training Data: SemEval-2014 Task 4 dataset (1,600 samples)
- Post-processing: POS tagging filtering to retain only nouns actually appearing in reviews
- Output Control: Selection of top 50 most frequent aspects
- Embedding Model: Fine-tuned FastText model (handles spelling errors and abbreviations)
- Clustering Algorithm: Equidistant Node Clustering (ENC) based on cosine similarity
- Advantage: Produces more precise clustering results compared to K-means
- Representative Selection: Selects the most frequently occurring term in each synset as representative
- Relevance Judgment: Uses LLM prompting to determine whether terms should be included in the ontology
- Filtering Criteria: Relevance, specificity, and hierarchical properties
- Input Processing: Extracts sentences containing two aspects from different synsets
- Task Design: Multiple-choice questions (aspect A is part of aspect B / aspect B is part of aspect A / unrelated)
- Model Training: Fine-tunes Mistral model through distillation on 1,000 synthetic samples
- End-to-End LLM Pipeline: Achieves higher automation compared to BERT methods
- Structured Output Constraints: Uses JSON syntax constraints to ensure consistent output formatting
- Multi-Stage Optimization: Each stage is optimized for specific tasks to improve overall performance
- Hallucination Mitigation: Reduces LLM hallucination issues through POS tagging filtering and fine-tuning
- Source: Amazon Reviews 2023 dataset
- Product Categories: 5 categories (video games, televisions, necklaces/watches, stand mixers)
- Data Scale: 100,000 reviews per product (26,464 for mixers)
- Processing Limitation: 1,000 reviews used for LLM tasks (considering processing time)
Term Evaluation Criteria:
- Relevance: Whether the term accurately represents a product part or component
- Specificity: Whether the term has an appropriate level of specificity
- Clarity: Whether the term clearly conveys intent and avoids ambiguity
- Product Match: Whether the term logically fits the given product
Relation Evaluation Criteria:
- Logical Hierarchy: Whether child nodes logically represent parts or features of parent nodes
- Contextual Match: Whether relationships are reasonable within Amazon product categories
- Clarity and Specificity: Whether relationships avoid ambiguity and clearly define part-whole relations
- Baseline Method: BERT-based method by Oksanen et al. (2021)
- Evaluation Method: Gemini 1.5 Flash as LLM judge
- Comparison Versions: Full version and shortened version (equal term count to baseline)
- Hardware: NVIDIA GeForce RTX 4090 GPU
- Optimizer: Adam (learning rate 10^-4)
- Fine-tuning Technique: LoRA (r=4, α=16)
- Training Epochs: 3, batch size 16
| Product Category | Proposed Method (Full) | Proposed Method (Shortened) | BERT Baseline |
|---|
| Video Games | 4.00 | 4.18 | 3.92 |
| Television | 4.06 | 4.05 | 3.95 |
| Necklace | 4.50 | 4.57 | 3.86 |
| Watch | 4.13 | 4.37 | 4.10 |
| Stand Mixer | 4.36 | 4.40 | 3.31 |
| Product Category | Proposed Method (Full) | Proposed Method (Shortened) | BERT Baseline |
|---|
| Video Games | 3.89 | 3.82 | 3.43 |
| Television | 3.99 | 4.56 | 3.21 |
| Necklace | 3.65 | 3.79 | 3.29 |
| Watch | 3.75 | 4.06 | 2.68 |
| Stand Mixer | 3.30 | 3.40 | 2.47 |
| Method | Average Score |
|---|
| Method A1 (Prompt Only) | 1.960 ± 0.006 |
| Method A2 (Prompt + Sentiment) | 2.259 ± 0.002 |
| Method A3 (Fine-tuning) | 2.662 ± 0.006 |
| Method | Video Games | Television | Necklace | Watch | Mixer |
|---|
| Full Reviews | 3.811 | 4.155 | 3.397 | 3.570 | 3.080 |
| Excerpts | 3.727 | 3.726 | 3.481 | 3.398 | 2.493 |
| Excerpts + Fine-tuning | 3.893 | 3.987 | 3.646 | 3.747 | 3.303 |
| Stage | Average Time (minutes) |
|---|
| Aspect Extraction | 32.05 |
| Synset Extraction | 0.78 |
| Concept Extraction | 1.52 |
| Relation Extraction | 4.53 |
| Total | 38.89 |
| Stage | Average Time (minutes) |
|---|
| Entity Extraction | 1.66 |
| Aspect Extraction | 2.79 |
| Synset Extraction | 0.82 |
| Ontology Extraction | 1.36 |
| Total | 6.62 |
- Quality Improvement: LLM method significantly outperforms BERT baseline in both term and relation quality
- Fine-tuning Importance: Fine-tuning brings significant performance improvements compared to pure prompting methods
- Computational Cost: LLM method achieves higher quality but at approximately 6 times the computational cost of BERT
- Clustering Algorithm Selection: ENC produces more precise synsets compared to K-means
Traditional ontology learning primarily relies on deep learning methods, but most focus on extracting taxonomic relations rather than part-whole relationships.
Recent research has begun exploring the application of LLMs in key ontology learning tasks such as term and relation extraction, but primarily focuses on taxonomic relations.
Due to the lack of standard benchmarks, ontology quality evaluation has been a persistent challenge. The LLM-as-a-judge method proposed in this paper provides a novel solution to this problem.
- LLM method significantly outperforms existing BERT methods in part-whole ontology extraction tasks
- Fine-tuning and structured output constraints are key factors for performance improvement
- LLM-as-a-judge provides a viable solution for ontology quality assessment
- Evaluation Dependency: Primarily relies on LLM-as-a-judge, lacking user study validation
- Computational Cost: Significantly increased computational cost compared to BERT methods
- Hallucination Issues: LLMs still exhibit hallucination problems in generating irrelevant aspects
- Benchmark Absence: Lack of standard benchmark datasets in the product ontology domain
- Standard Benchmark Construction: Establish standard benchmark datasets for this task
- User Study Validation: Verify the practical utility of ontologies through user studies
- Method Generalization: Explore application of the method to other ontology types (e.g., taxonomic ontologies)
- Hallucination Mitigation: Research methods integrating multiple LLMs to reduce single-model hallucinations
- Strong Innovation: First systematic application of LLMs to part-whole ontology extraction
- Complete Methodology: Provides an end-to-end complete pipeline solution
- Evaluation Innovation: Proposes the LLM-as-a-judge evaluation framework
- Comprehensive Experiments: Includes detailed ablation studies and efficiency analysis
- Open-Source Contribution: Provides complete open-source implementation
- Evaluation Limitations: Over-reliance on LLM evaluation, lacking human evaluation validation
- Cost Considerations: Significantly increased computational cost but insufficient discussion of cost-benefit tradeoffs
- Generalization: Validation on only 5 product categories; generalization requires further verification
- Baseline Comparison: Insufficient comparison with more existing methods
- Academic Value: Provides important reference for LLM applications in ontology construction
- Practical Value: Direct application potential in e-commerce and related domains
- Methodological Contribution: LLM-as-a-judge evaluation framework has broad applicability
- Reproducibility: Provides detailed implementation details and open-source code
- E-commerce Platforms: Product categorization and recommendation systems
- Knowledge Graph Construction: Automated ontology construction
- Information Extraction: Extracting structured relationships from unstructured text
- Review Analysis: Product feature and component identification
This paper cites important works in related fields, including:
- Oksanen et al. (2021): BERT-based product ontology extraction method
- Devlin et al. (2019): BERT model
- Jiang et al. (2023): Mistral model
- Pontiki et al. (2014): SemEval-2014 Task 4 dataset
Overall Assessment: This is an important contribution paper in the field of part-whole ontology extraction. The method demonstrates strong innovation, reasonable experimental design, and convincing results. While there are some limitations in evaluation methodology and computational cost, the paper provides valuable insights and tools for the development of this field.