2025-11-23T02:07:17.015845

Generating CodeMeta using declarative mapping rules: An open-ended approach using ShExML

GarcÃa-GonzÃ¡lez

Nowadays, software is one of the cornerstones when conducting research in several scientific fields which employ computer-based methodologies to answer new research questions. However, for these experiments to be completely reproducible, research software should comply with the FAIR principles, yet its metadata can be represented following different data models and spread across different locations. In order to bring some cohesion to the field, CodeMeta was proposed as a vocabulary to represent research software metadata in a unified and standardised manner. While existing tools can help users to generate CodeMeta files for some specific use cases, they fall short on flexibility and adaptability. Hence, in this work, I propose the use of declarative mapping rules to generate CodeMeta files, illustrated through the implementation of three crosswalks in ShExML which are then expanded and merged to cover the generation of CodeMeta files for two existing research software artefacts. Moreover, the outputs are validated using SHACL and ShEx and the whole generation workflow is automated requiring minimal user intervention upon a new version release. This work can, therefore, be used as an example upon which other developers can include a CodeMeta generation workflow in their repositories, facilitating the adoption of CodeMeta and, ultimately, increasing research software FAIRness.

academic

Generating CodeMeta using declarative mapping rules: An open-ended approach using ShExML

Basic Information

Paper ID: 2510.09172
Title: Generating CodeMeta using declarative mapping rules: An open-ended approach using ShExML
Author: Herminio García-González (Kazerne Dossin, Mechelen, Belgium)
Classification: cs.DL (Digital Libraries), cs.SE (Software Engineering)
Publication Date: October 10, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.09172v1

Abstract

Today, software serves as a cornerstone for research in multiple scientific domains that employ computational methods to address novel research questions. However, to ensure complete reproducibility of these experiments, research software should comply with the FAIR principles, yet its metadata may follow different data models and be dispersed across various locations. To bring some coherence to this field, CodeMeta has been proposed as a vocabulary for representing research software metadata in a unified and standardized manner. While existing tools can assist users in generating CodeMeta files for certain specific use cases, they fall short in terms of flexibility and adaptability. Therefore, this paper proposes using declarative mapping rules to generate CodeMeta files, illustrated through the implementation of three cross-platform mappings in ShExML, which are subsequently extended and merged to cover CodeMeta file generation for two existing research software artifacts. Furthermore, using SHACL and ShEx for output validation, the entire generation workflow is automated, requiring minimal user intervention upon new version releases.

Research Background and Motivation

Problem Definition

FAIR Compliance Issues for Research Software: Although research software is crucial support for scientific research, its metadata is scattered across different platforms (GitHub, Zenodo, Maven, etc.), using different data models, lacking uniformity.
Limitations of Existing Tools:
- Most tools support only one-to-one conversion (single metadata source to CodeMeta)
- Lack of flexibility and adaptability
- Require manual user intervention for data reconciliation
- Insufficient automation capabilities
Barriers to CodeMeta Adoption: Although CodeMeta provides a unified representation standard for research software metadata, limitations of existing tools hinder its widespread adoption.

Research Significance

Advancing Open Science: Research software compliant with FAIR principles is crucial for achieving open science
Ensuring Reproducibility: Unified metadata standards facilitate reproducibility of research results
Cross-Platform Interoperability: Addressing metadata format incompatibilities between different platforms

Core Contributions

Proposing a Declarative Mapping Rule Approach: Creating flexible and maintainable CodeMeta generation rules using the ShExML language
Implementing Three Key Cross-Platform Mappings: Developing complete ShExML mapping implementations for GitHub, Maven, and Zenodo platforms
Building a Unified Mapping Framework: Demonstrating how to merge multiple heterogeneous metadata sources to generate a single CodeMeta file
Developing a Complete Automated Workflow: Including JSON-LD framing, SHACL/ShEx validation, and GitHub Actions integration
Providing Practical Application Cases: Successfully deployed in ShExML Engine and DMAOG open-source projects

Methodology Details

Task Definition

Input: Data from multiple heterogeneous metadata providers (GitHub API, Maven POM files, Zenodo records, etc.) Output: Standardized JSON-LD files compliant with CodeMeta 3.0 specification Constraints: Maintain semantic data integrity, support automated updates, ensure output validation passes

Core Method Architecture

1. ShExML Declarative Mapping Language

ShExML comprises two main components:

Declaration Section:
- Prefix definitions (IRI shortcuts)
- Data source definitions (input file locations)
- Function definitions (extending basic functionality)
- Iterator definitions (data extraction methods)
- Expression definitions (merging data from different sources)
Generation Section:
- Shape definitions (RDF graph generation rules)
- Subject-predicate-object triple construction

2. Three Core Cross-Platform Mapping Implementations

GitHub Mapping (example code):

PREFIX codemeta: <https://w3id.org/codemeta/3.0/>
PREFIX schema: <http://schema.org/>
SOURCE repo_info <https://api.github.com/repos/herminiogg/ShExML>
ITERATOR gh <jsonpath: $> {
    FIELD id <id>
    FIELD name <name>
    FIELD description <description>
    // ... additional fields
}
schema:SoftwareSourceCode ex:[md.name] {
    a schema:SoftwareSourceCode ;
    schema:identifier [md.id] ;
    schema:name [md.name] ;
    // ... additional property mappings
}

Maven Mapping: Uses XPath queries on XML-formatted POM files, handling namespace and dependency relationship mappings.

Zenodo Mapping: Processes nested JSON structures, including multi-level entity relationships such as authors and institutions.

3. Unified Mapping Strategy

Intelligent Source Selection: When multiple sources contain the same attribute, selecting the optimal source based on semantic relevance and maintenance convenience
Hardcoded Value Supplementation: For data that cannot be obtained from external sources, allowing direct definition in the mapping file
Data Transformation Functions: Handling data cleaning tasks such as date format conversion and URL normalization

Technical Innovations

Multi-Source Data Fusion: Unlike existing tools' one-to-one conversion, supporting flexible merging of arbitrary numbers of heterogeneous sources
Declarative Rules: Compared to programmatic approaches, providing better readability, maintainability, and shareability
Fine-Grained Control: Allowing precise mapping control at the attribute level, rather than simple priority-based overrides
Automated Integration: Complete CI/CD workflow integration supporting automatic updates upon version releases

Experimental Setup

Test Projects

ShExML Engine: Heterogeneous data mapping tool written in Scala
DMAOG Library: Scala library related to data mapping

Data Sources

GitHub API: Repository basic information, release records, issue tracking, etc.
Maven Central: Project metadata and dependency information in POM files
Zenodo: DOI, funding information, detailed author information, etc.

Validation Methods

SHACL Validation: Using W3C recommended standards for structural validation
ShEx Validation: Using Shape Expressions for pattern validation
CodeMeta Generator: Using official validation tools for final confirmation

Automated Toolchain

GitHub Actions: CI/CD pipeline
Groovy Scripts: JSON-LD framing processing
Bash Scripts: Workflow orchestration

Experimental Results

Main Achievements

1. Mapping Coverage

Platform	Supported Attributes	CodeMeta Class Coverage
GitHub	12 core attributes	SoftwareSourceCode, Person
Maven	8 core attributes	SoftwareSourceCode, Dependencies
Zenodo	15 core attributes	SoftwareSourceCode, Person, Organization

2. Automation Effectiveness

Maintenance Cost: Each version update requires modifying only 2 lines of code (input source URLs)
Processing Time: Complete workflow execution time < 2 minutes
Success Rate: Successfully generated valid CodeMeta files for both tested projects

3. Adaptability Verification

Adapting from ShExML Engine to DMAOG project:

Required modification of only 6 lines of code (4 API calls, removing 1 contributor)
Maintained the same technology stack support (Scala + SBT + Maven Central)
Generated CodeMeta files passed all validation tests

Output Quality Analysis

Generated CodeMeta files contain:

Basic Metadata: Name, description, version, license, etc.
Development Information: Programming language, runtime platform, continuous integration, etc.
Personnel Information: Authors, contributors, institutional affiliations, etc.
Associated Resources: Code repositories, download links, references, etc.
Dependency Relationships: Software requirements and version information

Validation Results

All generated CodeMeta files passed:

SHACL structural validation
ShEx pattern validation
CodeMeta Generator official validation

Classification of Existing CodeMeta Tools

1. Conversion Tools

Bolognese: Ruby library supporting conversion of multiple DOI metadata formats
codemetar: R package-specific CodeMeta generation tool
codemetapy: Python implementation supporting multiple package managers
cffconvert: Citation File Format conversion tool

2. Management Tools

codemeta-server: Tool directory service based on CodeMeta
HERMES: Research software release platform with CI/CD integration

3. Auxiliary Tools

CodeMeta Generator: Web-based interactive generator
SMECS: Software metadata extraction and curation system

Advantages of This Work

Flexibility: Supporting arbitrary numbers and types of metadata sources
Fine-Grained Control: Precise mapping at the attribute level, rather than simple priority-based approaches
Declarative Approach: More understandable and maintainable compared to programmatic implementations
Complete Automation: End-to-end automated process from generation to validation

Conclusions and Discussion

Main Conclusions

Feasibility of Declarative Mapping Rules: Demonstrating the technical feasibility of generating CodeMeta using ShExML
Advantages of Multi-Source Fusion: Showcasing the value and effectiveness of integrating heterogeneous metadata sources
Successful Automated Deployment: Achieving low-maintenance automated workflows in real projects
Scalability Verification: Proving the generalizability of the method through successful adaptation of two projects

Limitations

Technology Stack Dependency: Current implementation primarily targets the Scala/JVM ecosystem
Learning Curve: Users need to learn ShExML syntax and concepts
Platform Coverage: Only three major platforms' cross-platform mappings have been implemented
Complex Project Adaptation: May require more customization for complex projects using multiple technology stacks

Future Directions

Extending Cross-Platform Mappings: Implementing mappings for all platforms officially supported by CodeMeta
Visualization Interface: Developing a graphical mapping rule editor
AI-Assisted Generation: Leveraging large language models to automatically generate mapping rules
Template Library Development: Providing predefined templates for different technology stacks and project types

In-Depth Evaluation

Strengths

Strong Method Innovation: First application of declarative mapping rules to CodeMeta generation, providing a novel technical pathway
High Practical Value: Addressing actual pain points in research software metadata management
Complete Implementation: Providing comprehensive solutions from concept to deployment
Good Reproducibility: Offering detailed implementation code and deployment guidelines

Weaknesses

Limited Evaluation Scope: Testing on only two similar projects, lacking large-scale validation
Missing Performance Analysis: No performance assessment for handling large projects or large-scale data
Insufficient Error Handling: Inadequate robustness analysis for data source unavailability or format changes
Lack of User Studies: No user acceptance and usability evaluation conducted

Impact

Academic Contribution: Providing new technical solutions for research software metadata management
Practical Value: Directly applicable to CodeMeta adoption in open-source projects
Ecosystem Promotion: Helping increase CodeMeta adoption rates in the research software community
Standards Advancement: Supporting FAIR principles implementation in the research software domain

Applicable Scenarios

Open-Source Software Projects: Particularly suitable for research software requiring publication across multiple platforms
Academic Institutions: Applicable to institutional-level research software metadata management
CI/CD Integration: Suitable for projects with existing automated release processes
Metadata Standardization: Applicable to research organizations requiring unified metadata formats

References

The paper includes 37 references covering important works in FAIR principles, semantic web technologies, CodeMeta specifications, declarative mapping languages, and related fields, providing solid theoretical foundation and technical support for the research.

Overall Assessment: This is a technically practical paper in the research software metadata management domain with innovative declarative mapping methods, complete and reproducible implementation, and positive significance for promoting CodeMeta standard adoption. While there is room for improvement in evaluation scope and depth, it provides valuable technical contributions to the field.