2025-11-18T10:22:13.240299

Large Language Model-Driven Database for Thermoelectric Materials

Itani, Zhang, Zang
Thermoelectric materials provide a sustainable way to convert waste heat into electricity. However, data-driven discovery and optimization of these materials are challenging because of a lack of a reliable database. Here we developed a comprehensive database of 7,123 thermoelectric compounds, containing key information such as chemical composition, structural detail, seebeck coefficient, electrical and thermal conductivity, power factor, and figure of merit (ZT). We used the GPTArticleExtractor workflow, powered by large language models (LLM), to extract and curate data automatically from the scientific literature published in Elsevier journals. This process enabled the creation of a structured database that addresses the challenges of manual data collection. The open access database could stimulate data-driven research and advance thermoelectric material analysis and discovery.
academic

Large Language Model-Driven Database for Thermoelectric Materials

Basic Information

  • Paper ID: 2501.00564
  • Title: Large Language Model-Driven Database for Thermoelectric Materials
  • Authors: Suman Itani, Yibo Zhang, Jiadong Zang (University of New Hampshire)
  • Classification: cond-mat.mtrl-sci cs.DL
  • Publication Date: January 3, 2025 (Preprint)
  • Paper Link: https://arxiv.org/abs/2501.00564

Abstract

Thermoelectric materials provide a sustainable pathway for converting waste heat into electrical energy. However, data-driven discovery and optimization of these materials face challenges due to the lack of reliable databases. This study develops a comprehensive database containing 7,123 thermoelectric compounds with key information including chemical composition, structural details, Seebeck coefficient, electrical conductivity and thermal conductivity, power factor, and figure of merit (ZT). The research employs the GPTArticleExtractor workflow driven by large language models to automatically extract and organize data from scientific literature published in Elsevier journals. This process enables the creation of a structured database, addressing challenges of manual data collection. This open-access database can stimulate data-driven research and advance thermoelectric materials analysis and discovery.

Research Background and Motivation

Problem Definition

  1. Energy Conversion Requirements: With increasingly prominent global energy challenges and environmental concerns, thermoelectric materials are gaining attention as key technology for direct conversion of thermal energy to electrical energy
  2. Data Scarcity Issues: Existing thermoelectric materials databases have significant limitations:
    • Most are based on first-principles calculations, limited to ideal undoped crystal structures
    • Experimental databases are small in scale and require manual curation
    • Lack of structural property information, limiting structure-property relationship studies

Research Significance

Thermoelectric material performance is quantified by the dimensionless figure of merit ZT:

ZT = S²σT/κ

where S is the Seebeck coefficient, σ is electrical conductivity, T is absolute temperature, and κ is thermal conductivity. Optimizing ZT requires simultaneous consideration of these interrelated properties, making material design highly challenging.

Limitations of Existing Methods

  1. Traditional Approaches: Rely on experimental trial-and-error and theoretical simulations (DFT, MD), which are time-consuming and computationally expensive
  2. Existing Databases:
    • Computational databases cannot fully reflect actual material behavior
    • Experimental databases are limited in scale
    • Lack structural information for machine learning applications
  3. Automated Extraction: Tools like ChemDataExtractor show reduced accuracy when processing multi-compound articles

Core Contributions

  1. Large-Scale Database Construction: Created a comprehensive database containing 7,123 thermoelectric compounds, covering key thermoelectric properties and structural information
  2. Automated Data Extraction: Adopted the GPTArticleExtractor workflow utilizing large language models to automatically extract structured data from scientific literature
  3. Data Quality Assurance: Includes experimental and theoretical data identification, with approximately 66% experimental data, enhancing data reliability
  4. Open Access Resource: Provided open access at nemad.org, supporting data-driven thermoelectric materials research
  5. Structure-Property Relationships: First systematic inclusion of structural information in thermoelectric materials database, supporting advanced methods such as graph neural networks

Methodology Details

Task Definition

Automatically extract property data and structural information of thermoelectric materials from scientific literature to construct a standardized structured database, including:

  • Input: Thermoelectric-related scientific literature published in Elsevier journals
  • Output: Standardized JSON format data containing chemical formulas, thermoelectric properties, and structural parameters
  • Constraints: Ensure data accuracy and unit consistency

Workflow Architecture

1. DOI Collection Phase

  • Filter relevant articles using keywords ("Thermoelectric", "Seebeck Coefficient", "Figure of Merit")
  • Collect approximately 20,000 DOIs from Elsevier journal database through web scraping scripts

2. Article Retrieval Phase

  • Download full texts in XML format using Elsevier API keys
  • Develop customized text and table parsing tools to convert XML to plain text CSV format
  • Remove nested tags and extraneous metadata

3. Data Extraction and Compilation Phase

  • GPTArticleExtractor Core Technology:
    • Utilize GPT-4 model for data extraction through OpenAI API
    • Highly customizable prompt design targeting specific information extraction needs
    • Output structured JSON files conforming to predefined formats
    • Generate JSON object lists for multi-material articles

Technical Innovations

  1. LLM-Driven Automation: GPT-4 demonstrates superior performance in understanding complex scientific texts compared to traditional NLP tools
  2. Multi-Material Processing Capability: Accurately handles articles describing multiple compounds and their properties
  3. Data Standardization: Developed data cleaning scripts to unify unit systems across different literature sources
  4. Quality Control: Distinguish between experimental and theoretical data, enhancing database reliability

Experimental Setup

Data Sources

  • Source: Scientific literature published in Elsevier journals
  • Scale: Processing approximately 20,000 relevant literature articles
  • Time Span: Covers historical thermoelectric materials research literature
  • Language: English scientific literature

Data Processing Pipeline

  1. XML to CSV Conversion: Preserves core content from PDF versions
  2. GPT-4 Extraction: Uses carefully designed prompts for information extraction
  3. Data Cleaning: Unifies unit systems and data formats
  4. Quality Validation: Manual verification of critical data points

Extraction Targets

  • Chemical composition and compound types
  • Thermoelectric properties (S, σ, κ, PF, ZT) and measurement temperatures
  • Structural information (crystal structure, lattice parameters, space groups)
  • Data source identification (experimental/theoretical)

Experimental Results

Database Statistical Characteristics

Database Scale and Content

  • Total Compounds: 7,123 thermoelectric compounds
  • Data Source Proportion: 66% experimental data, 34% theoretical calculation data
  • Structuring Level: Complete JSON format supporting machine learning applications

Property Distribution Analysis

1. Seebeck Coefficient Distribution

  • Range: -200 μV/K to 3,000 μV/K
  • Characteristics: Includes n-type (negative) and p-type (positive) materials
  • High-value materials: Few compounds reach 3,000 μV/K, primarily from computational studies

2. Electrical Conductivity Distribution

  • Mean: 58,980.63 S/m
  • Median: 20,900.00 S/m
  • Maximum: Approximately 500,000 S/m
  • Distribution: Strongly right-skewed, with most materials showing lower conductivity

3. Thermal Conductivity Distribution

  • Mean: 2.17 W/mK
  • Median: 1.10 W/mK
  • Peak: Near 1 W/mK
  • Characteristics: Most materials exhibit low thermal conductivity suitable for thermoelectric applications

4. Power Factor Distribution

  • Calculation Formula: PF = S² × σ
  • Mean: 1,165.54 μW/mK²
  • Median: 526.86 μW/mK²
  • Maximum: Approximately 7,000 μW/mK²

5. Figure of Merit (ZT) Distribution

  • Mean: 0.75
  • Median: 0.72
  • Primary Range: 0.5-1.0
  • High-Performance Materials: Few reach ZT ≈ 4.0

Data Completeness Analysis

As shown in Figure 2, data coverage rates vary across different properties, reflecting the incompleteness of properties reported in literature, a common phenomenon in actual research.

Comparison of Existing Databases

  1. Computational Databases: Materials Project, JARVIS primarily based on DFT calculations
  2. Experimental Databases: Smaller scale, such as manually curated databases by Gaultois et al.
  3. Automated Extraction: Sierepeklis and Cole used ChemDataExtractor to construct a database of 10,641 compounds

Advantages of This Work

  1. Data Quality: Advanced LLM improves extraction accuracy
  2. Structural Information: First systematic inclusion of crystal structure, space groups, and other information
  3. Data Identification: Clear distinction between experimental and theoretical data
  4. Continuous Updates: Establishes scalable automated workflow

Conclusions and Discussion

Main Conclusions

  1. Successfully constructed one of the most comprehensive thermoelectric materials databases to date, containing 7,123 compounds
  2. GPTArticleExtractor demonstrates the effectiveness of LLMs in scientific data extraction
  3. Database covers a wide range of materials from low-performance to high-performance (ZT~4)
  4. Inclusion of structural information lays the foundation for future machine learning applications

Limitations

  1. Data Completeness: Not all compounds have complete property data
  2. Source Restrictions: Limited to Elsevier journals, potentially introducing publication bias
  3. Quality Control: While LLM improves accuracy, manual verification remains necessary
  4. Dynamic Updates: Requires continuous maintenance to include latest research findings

Future Directions

  1. Expand to more journals and data sources
  2. Develop machine learning models based on this database
  3. Integrate graph neural networks to leverage structural information
  4. Establish community contribution mechanisms

In-Depth Evaluation

Strengths

  1. Technical Innovation: Application of LLMs to scientific data extraction significantly improves automation and accuracy
  2. Data Value: Fills the gap of lacking large-scale experimental databases in thermoelectric materials field
  3. Practicality: Open access and standardized format facilitate use by research community
  4. Forward-Looking: Inclusion of structural information paves the way for advanced machine learning method applications
  5. Reproducibility: Detailed workflow description ensures good reproducibility

Weaknesses

  1. Verification Mechanism: Lacks systematic manual verification to quantify extraction accuracy
  2. Bias Issues: Using only Elsevier journals may introduce publication and selection bias
  3. Data Quality Assessment: No quantitative comparison of data quality from different sources
  4. Update Mechanism: Long-term maintenance and update strategy not detailed

Impact

  1. Academic Value: Provides important resources for data-driven thermoelectric materials research
  2. Methodological Exemplar: GPTArticleExtractor workflow can be extended to other materials science domains
  3. Industrial Application: Supports industrial development and optimization of thermoelectric devices
  4. Educational Value: Provides standardized datasets for relevant courses and research

Applicable Scenarios

  1. Machine Learning Research: Training models to predict thermoelectric properties
  2. Materials Screening: Rapidly identifying candidate materials with specific properties
  3. Structure-Property Relationship Studies: Exploring design principles using structural information
  4. Benchmark Testing: Providing validation datasets for new computational methods

References

The paper cites 40 relevant references covering fundamental thermoelectric theory, computational methods, existing databases, and machine learning applications, providing solid theoretical foundation and comprehensive background research.


Overall Assessment: This is a high-quality interdisciplinary research paper that successfully applies artificial intelligence technology to materials science data management, providing valuable resources to the thermoelectric materials research community. Despite some limitations, its innovative methodology and practical contributions make it of significant academic and practical value.