2025-11-23T02:07:17.015845

Generating CodeMeta using declarative mapping rules: An open-ended approach using ShExML

García-González
Nowadays, software is one of the cornerstones when conducting research in several scientific fields which employ computer-based methodologies to answer new research questions. However, for these experiments to be completely reproducible, research software should comply with the FAIR principles, yet its metadata can be represented following different data models and spread across different locations. In order to bring some cohesion to the field, CodeMeta was proposed as a vocabulary to represent research software metadata in a unified and standardised manner. While existing tools can help users to generate CodeMeta files for some specific use cases, they fall short on flexibility and adaptability. Hence, in this work, I propose the use of declarative mapping rules to generate CodeMeta files, illustrated through the implementation of three crosswalks in ShExML which are then expanded and merged to cover the generation of CodeMeta files for two existing research software artefacts. Moreover, the outputs are validated using SHACL and ShEx and the whole generation workflow is automated requiring minimal user intervention upon a new version release. This work can, therefore, be used as an example upon which other developers can include a CodeMeta generation workflow in their repositories, facilitating the adoption of CodeMeta and, ultimately, increasing research software FAIRness.
academic

Generating CodeMeta using declarative mapping rules: An open-ended approach using ShExML

Basic Information

  • Paper ID: 2510.09172
  • Title: Generating CodeMeta using declarative mapping rules: An open-ended approach using ShExML
  • Author: Herminio García-González (Kazerne Dossin, Mechelen, Belgium)
  • Classification: cs.DL (Digital Libraries), cs.SE (Software Engineering)
  • Publication Date: October 10, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.09172v1

Abstract

Today, software serves as a cornerstone for research in multiple scientific domains that employ computational methods to address novel research questions. However, to ensure complete reproducibility of these experiments, research software should comply with the FAIR principles, yet its metadata may follow different data models and be dispersed across various locations. To bring some coherence to this field, CodeMeta has been proposed as a vocabulary for representing research software metadata in a unified and standardized manner. While existing tools can assist users in generating CodeMeta files for certain specific use cases, they fall short in terms of flexibility and adaptability. Therefore, this paper proposes using declarative mapping rules to generate CodeMeta files, illustrated through the implementation of three cross-platform mappings in ShExML, which are subsequently extended and merged to cover CodeMeta file generation for two existing research software artifacts. Furthermore, using SHACL and ShEx for output validation, the entire generation workflow is automated, requiring minimal user intervention upon new version releases.

Research Background and Motivation

Problem Definition

  1. FAIR Compliance Issues for Research Software: Although research software is crucial support for scientific research, its metadata is scattered across different platforms (GitHub, Zenodo, Maven, etc.), using different data models, lacking uniformity.
  2. Limitations of Existing Tools:
    • Most tools support only one-to-one conversion (single metadata source to CodeMeta)
    • Lack of flexibility and adaptability
    • Require manual user intervention for data reconciliation
    • Insufficient automation capabilities
  3. Barriers to CodeMeta Adoption: Although CodeMeta provides a unified representation standard for research software metadata, limitations of existing tools hinder its widespread adoption.

Research Significance

  • Advancing Open Science: Research software compliant with FAIR principles is crucial for achieving open science
  • Ensuring Reproducibility: Unified metadata standards facilitate reproducibility of research results
  • Cross-Platform Interoperability: Addressing metadata format incompatibilities between different platforms

Core Contributions

  1. Proposing a Declarative Mapping Rule Approach: Creating flexible and maintainable CodeMeta generation rules using the ShExML language
  2. Implementing Three Key Cross-Platform Mappings: Developing complete ShExML mapping implementations for GitHub, Maven, and Zenodo platforms
  3. Building a Unified Mapping Framework: Demonstrating how to merge multiple heterogeneous metadata sources to generate a single CodeMeta file
  4. Developing a Complete Automated Workflow: Including JSON-LD framing, SHACL/ShEx validation, and GitHub Actions integration
  5. Providing Practical Application Cases: Successfully deployed in ShExML Engine and DMAOG open-source projects

Methodology Details

Task Definition

Input: Data from multiple heterogeneous metadata providers (GitHub API, Maven POM files, Zenodo records, etc.) Output: Standardized JSON-LD files compliant with CodeMeta 3.0 specification Constraints: Maintain semantic data integrity, support automated updates, ensure output validation passes

Core Method Architecture

1. ShExML Declarative Mapping Language

ShExML comprises two main components:

  • Declaration Section:
    • Prefix definitions (IRI shortcuts)
    • Data source definitions (input file locations)
    • Function definitions (extending basic functionality)
    • Iterator definitions (data extraction methods)
    • Expression definitions (merging data from different sources)
  • Generation Section:
    • Shape definitions (RDF graph generation rules)
    • Subject-predicate-object triple construction

2. Three Core Cross-Platform Mapping Implementations

GitHub Mapping (example code):

PREFIX codemeta: <https://w3id.org/codemeta/3.0/>
PREFIX schema: <http://schema.org/>
SOURCE repo_info <https://api.github.com/repos/herminiogg/ShExML>
ITERATOR gh <jsonpath: $> {
    FIELD id <id>
    FIELD name <name>
    FIELD description <description>
    // ... additional fields
}
schema:SoftwareSourceCode ex:[md.name] {
    a schema:SoftwareSourceCode ;
    schema:identifier [md.id] ;
    schema:name [md.name] ;
    // ... additional property mappings
}

Maven Mapping: Uses XPath queries on XML-formatted POM files, handling namespace and dependency relationship mappings.

Zenodo Mapping: Processes nested JSON structures, including multi-level entity relationships such as authors and institutions.

3. Unified Mapping Strategy

  • Intelligent Source Selection: When multiple sources contain the same attribute, selecting the optimal source based on semantic relevance and maintenance convenience
  • Hardcoded Value Supplementation: For data that cannot be obtained from external sources, allowing direct definition in the mapping file
  • Data Transformation Functions: Handling data cleaning tasks such as date format conversion and URL normalization

Technical Innovations

  1. Multi-Source Data Fusion: Unlike existing tools' one-to-one conversion, supporting flexible merging of arbitrary numbers of heterogeneous sources
  2. Declarative Rules: Compared to programmatic approaches, providing better readability, maintainability, and shareability
  3. Fine-Grained Control: Allowing precise mapping control at the attribute level, rather than simple priority-based overrides
  4. Automated Integration: Complete CI/CD workflow integration supporting automatic updates upon version releases

Experimental Setup

Test Projects

  1. ShExML Engine: Heterogeneous data mapping tool written in Scala
  2. DMAOG Library: Scala library related to data mapping

Data Sources

  • GitHub API: Repository basic information, release records, issue tracking, etc.
  • Maven Central: Project metadata and dependency information in POM files
  • Zenodo: DOI, funding information, detailed author information, etc.

Validation Methods

  • SHACL Validation: Using W3C recommended standards for structural validation
  • ShEx Validation: Using Shape Expressions for pattern validation
  • CodeMeta Generator: Using official validation tools for final confirmation

Automated Toolchain

  • GitHub Actions: CI/CD pipeline
  • Groovy Scripts: JSON-LD framing processing
  • Bash Scripts: Workflow orchestration

Experimental Results

Main Achievements

1. Mapping Coverage

PlatformSupported AttributesCodeMeta Class Coverage
GitHub12 core attributesSoftwareSourceCode, Person
Maven8 core attributesSoftwareSourceCode, Dependencies
Zenodo15 core attributesSoftwareSourceCode, Person, Organization

2. Automation Effectiveness

  • Maintenance Cost: Each version update requires modifying only 2 lines of code (input source URLs)
  • Processing Time: Complete workflow execution time < 2 minutes
  • Success Rate: Successfully generated valid CodeMeta files for both tested projects

3. Adaptability Verification

Adapting from ShExML Engine to DMAOG project:

  • Required modification of only 6 lines of code (4 API calls, removing 1 contributor)
  • Maintained the same technology stack support (Scala + SBT + Maven Central)
  • Generated CodeMeta files passed all validation tests

Output Quality Analysis

Generated CodeMeta files contain:

  • Basic Metadata: Name, description, version, license, etc.
  • Development Information: Programming language, runtime platform, continuous integration, etc.
  • Personnel Information: Authors, contributors, institutional affiliations, etc.
  • Associated Resources: Code repositories, download links, references, etc.
  • Dependency Relationships: Software requirements and version information

Validation Results

All generated CodeMeta files passed:

  • SHACL structural validation
  • ShEx pattern validation
  • CodeMeta Generator official validation

Classification of Existing CodeMeta Tools

1. Conversion Tools

  • Bolognese: Ruby library supporting conversion of multiple DOI metadata formats
  • codemetar: R package-specific CodeMeta generation tool
  • codemetapy: Python implementation supporting multiple package managers
  • cffconvert: Citation File Format conversion tool

2. Management Tools

  • codemeta-server: Tool directory service based on CodeMeta
  • HERMES: Research software release platform with CI/CD integration

3. Auxiliary Tools

  • CodeMeta Generator: Web-based interactive generator
  • SMECS: Software metadata extraction and curation system

Advantages of This Work

  1. Flexibility: Supporting arbitrary numbers and types of metadata sources
  2. Fine-Grained Control: Precise mapping at the attribute level, rather than simple priority-based approaches
  3. Declarative Approach: More understandable and maintainable compared to programmatic implementations
  4. Complete Automation: End-to-end automated process from generation to validation

Conclusions and Discussion

Main Conclusions

  1. Feasibility of Declarative Mapping Rules: Demonstrating the technical feasibility of generating CodeMeta using ShExML
  2. Advantages of Multi-Source Fusion: Showcasing the value and effectiveness of integrating heterogeneous metadata sources
  3. Successful Automated Deployment: Achieving low-maintenance automated workflows in real projects
  4. Scalability Verification: Proving the generalizability of the method through successful adaptation of two projects

Limitations

  1. Technology Stack Dependency: Current implementation primarily targets the Scala/JVM ecosystem
  2. Learning Curve: Users need to learn ShExML syntax and concepts
  3. Platform Coverage: Only three major platforms' cross-platform mappings have been implemented
  4. Complex Project Adaptation: May require more customization for complex projects using multiple technology stacks

Future Directions

  1. Extending Cross-Platform Mappings: Implementing mappings for all platforms officially supported by CodeMeta
  2. Visualization Interface: Developing a graphical mapping rule editor
  3. AI-Assisted Generation: Leveraging large language models to automatically generate mapping rules
  4. Template Library Development: Providing predefined templates for different technology stacks and project types

In-Depth Evaluation

Strengths

  1. Strong Method Innovation: First application of declarative mapping rules to CodeMeta generation, providing a novel technical pathway
  2. High Practical Value: Addressing actual pain points in research software metadata management
  3. Complete Implementation: Providing comprehensive solutions from concept to deployment
  4. Good Reproducibility: Offering detailed implementation code and deployment guidelines

Weaknesses

  1. Limited Evaluation Scope: Testing on only two similar projects, lacking large-scale validation
  2. Missing Performance Analysis: No performance assessment for handling large projects or large-scale data
  3. Insufficient Error Handling: Inadequate robustness analysis for data source unavailability or format changes
  4. Lack of User Studies: No user acceptance and usability evaluation conducted

Impact

  1. Academic Contribution: Providing new technical solutions for research software metadata management
  2. Practical Value: Directly applicable to CodeMeta adoption in open-source projects
  3. Ecosystem Promotion: Helping increase CodeMeta adoption rates in the research software community
  4. Standards Advancement: Supporting FAIR principles implementation in the research software domain

Applicable Scenarios

  1. Open-Source Software Projects: Particularly suitable for research software requiring publication across multiple platforms
  2. Academic Institutions: Applicable to institutional-level research software metadata management
  3. CI/CD Integration: Suitable for projects with existing automated release processes
  4. Metadata Standardization: Applicable to research organizations requiring unified metadata formats

References

The paper includes 37 references covering important works in FAIR principles, semantic web technologies, CodeMeta specifications, declarative mapping languages, and related fields, providing solid theoretical foundation and technical support for the research.


Overall Assessment: This is a technically practical paper in the research software metadata management domain with innovative declarative mapping methods, complete and reproducible implementation, and positive significance for promoting CodeMeta standard adoption. While there is room for improvement in evaluation scope and depth, it provides valuable technical contributions to the field.