2025-11-23T21:58:17.757337

Towards Richer Challenge Problems for Scientific Computing Correctness

Sottile, Tekriwal, Sarracino
Correctness in scientific computing (SC) is gaining increasing attention in the formal methods (FM) and programming languages (PL) community. Existing PL/FM verification techniques struggle with the complexities of realistic SC applications. Part of the problem is a lack of a common understanding between the SC and PL/FM communities of machine-verifiable correctness challenges and dimensions of correctness in SC applications. To address this gap, we call for specialized challenge problems to inform the development and evaluation of FM/PL verification techniques for correctness in SC. These specialized challenges are intended to augment existing problems studied by FM/PL researchers for general programs to ensure the needs of SC applications can be met. We propose several dimensions of correctness relevant to scientific computing, and discuss some guidelines and criteria for designing challenge problems to evaluate correctness in scientific computing.
academic

Towards Richer Challenge Problems for Scientific Computing Correctness

Basic Information

  • Paper ID: 2510.13423
  • Title: Towards Richer Challenge Problems for Scientific Computing Correctness
  • Authors: Matthew Sottile, Mohit Tekriwal, John Sarracino (Lawrence Livermore National Laboratory)
  • Classification: cs.SE cs.MS
  • Published Conference: International Workshop on Verification of Scientific Software (VSS 2025), EPTCS 432
  • Paper Link: https://arxiv.org/abs/2510.13423

Abstract

The correctness of scientific computing (SC) has received increasing attention from the formal methods (FM) and programming languages (PL) communities. Existing PL/FM verification techniques face difficulties in handling the complexity of real-world scientific computing applications. Part of the problem lies in the lack of shared understanding between the SC and PL/FM communities regarding machine-verifiable correctness challenges and the dimensions of correctness in SC applications. To address this gap, the authors call for the establishment of specialized challenge problems to guide the development and evaluation of FM/PL verification techniques in SC. These specialized challenges aim to augment existing generic program problems studied by FM/PL researchers to ensure they meet the needs of SC applications.

Research Background and Motivation

Problems to Address

  1. Gap in Inter-community Understanding: Lack of shared understanding between the scientific computing community and the formal methods/programming languages community regarding correctness challenges
  2. Limitations of Existing Verification Techniques: Current PL/FM verification techniques struggle to handle the complexity of real-world scientific computing applications
  3. Insufficient Challenge Problems: Lack of standardized challenge problem sets specifically designed for scientific computing correctness

Problem Significance

Scientific computing applications involve complex numerical computations, parallel processing, physical modeling, and multiple other dimensions. Their correctness directly impacts the reliability of scientific research results. Traditional software verification methods often fail to adequately cover the correctness requirements unique to scientific computing.

Limitations of Existing Approaches

  • Existing formal verification challenge problems primarily target generic programs and lack the complexity specific to scientific computing
  • While the numerical verification community has relevant work, there is no unified challenge problem set
  • Existing benchmark suites primarily focus on performance rather than correctness

Research Motivation

Drawing on the successful experience of performance benchmark suites in high-performance computing (such as NAS Parallel Benchmarks and Mantevo), establish a similar challenge problem framework for scientific computing correctness.

Core Contributions

  1. Proposes Six Dimensions of Scientific Computing Correctness: Numerics, data structures, domain-modeling structures, differential equations, concurrency and parallelism, and approximation schemes
  2. Identifies Key Pitfalls in Challenge Problem Design: Over-specialization, "toy" problems, neglecting scientific computing uniqueness, etc.
  3. Establishes Distinction Between Challenge Problems and Benchmarks: Challenge problems define objectives and evaluation criteria, while benchmarks provide objective measurements
  4. Provides Design Guidance Principles: Consider uncertainty, separate mathematics from implementation, allow unchecked assumptions, etc.

Methodology Details

Task Definition

This paper is a position paper aimed at establishing a comprehensive challenge problem framework for scientific computing correctness, rather than proposing specific technical methods.

Framework Design

Correctness Dimension Classification

The authors categorize scientific computing correctness into three abstraction levels:

  • Low-level: Numerics and traditional data structures
  • Mid-level: Model-specific data structures and computations
  • High-level: Mathematical abstractions and physical system invariants

Six Core Dimensions

  1. Numerics
    • Correct correspondence between mathematical operations and hardware/software implementations
    • Precision issues in floating-point arithmetic
    • Challenges in mixed-precision algorithms
  2. Data Structures
    • Correctness of standard data structures
    • Structure transformations due to performance optimization (e.g., SOA to AOS conversion)
    • Semantic equivalence guarantees
  3. Domain-modeling Structures
    • Complex data structures such as grids and graphs
    • Satisfaction of physical system constraints
    • High-level invariants such as conservation laws
  4. Differential Equations
    • Consistency between PDEs and physical modeling
    • Numerical stability and boundary condition compatibility
    • Well-posedness
  5. Concurrency and Parallelism
    • Combinations of multiple parallel programming models
    • Shared memory, vectorization, and distributed memory parallelism
    • Balance between performance and correctness
  6. Approximation Schemes
    • Algorithmic heuristic methods
    • Interpolation methods
    • Distinction from numerical methods

Technical Innovations

  1. Multi-level Abstraction Integration: First systematic framework unifying scientific computing correctness issues from low-level implementation details to high-level physical constraints
  2. Community Bridge Role: Establishes common language between formal methods and scientific computing communities
  3. Practical Orientation: Avoids over-theorization, focusing on real-world correctness requirements

Experimental Setup

As a position paper, this work does not include traditional experimental setup but supports its arguments through analysis of existing benchmark suites and challenge problems.

Analysis Objects

  • Performance Benchmarks: NAS Parallel Benchmarks, Mantevo, Salishan problems, Shonan challenge
  • Correctness Challenges: VerifyThis, NSV-3 benchmarks, Gallery of Verified Programs
  • Specialized Benchmarks: FPbench, DataRaceBench, SPEC

Evaluation Criteria

The authors propose that challenge problems should possess the following characteristics:

  • Cover multiple correctness dimensions
  • Avoid over-specialization
  • Possess real-world relevance
  • Address unique scientific computing needs

Experimental Results

Current State Analysis

The authors' analysis reveals the following deficiencies in existing challenge problems:

  1. Insufficient Coverage: Lack of complex algorithm categories such as graph algorithms and sparse matrix representations
  2. Simple Data Structures: Inadequate coverage of complex sparse matrix representations beyond basic CSR
  3. Incomplete Mathematical Domains: Insufficient coverage of fundamental mathematical domains

Lessons from Success Cases

Evolution of performance benchmarks:

  • Linpack → Livermore Loops → NAS Parallel Benchmarks → Mantevo
  • Gradually increasing complexity, from simple linear algebra to complete application code
  • Evaluation metrics expanding from pure performance to correctness and scalability

Development of Performance Benchmarks

  • Early Benchmarks: Linpack focused on floating-point arithmetic performance
  • Loop Kernels: Livermore Loops tested specific computational patterns
  • Parallel Benchmarks: NAS Parallel Benchmarks introduced parallelism considerations
  • Application-oriented: Mantevo provides mini-apps close to real applications

Correctness Verification Challenges

  • Generic Verification: VerifyThis competitions provide foundational challenges
  • Numerical Verification: coq-num-analysis, Mathematical Components Library
  • Specialized Domains: FPbench (floating-point), DataRaceBench (data races)

Verification and Validation (V&V)

  • ASME V&V guidance principles provide verification and validation standards for engineering disciplines
  • Distinguishes between verification and validation problems

Conclusions and Discussion

Main Conclusions

  1. Scientific computing correctness requires specialized challenge problems to advance formal methods development
  2. Challenge problems must span computational abstraction levels, integrating low-level implementation and high-level domain constraints
  3. Need to avoid over-specialization and "toy" problems, focusing on real-world application relevance

Limitations

  1. Theoretical Nature: As a position paper, lacks concrete challenge problem examples
  2. Implementation Complexity: Establishing a comprehensive challenge problem set requires interdisciplinary collaboration
  3. Evaluation Standards: How to objectively assess the quality of challenge problems requires further research

Future Directions

  1. Collaborate with scientific computing and formal methods experts to design concrete challenge problems
  2. Establish standardized evaluation frameworks and metrics
  3. Consider integration of uncertainty quantification and statistical modeling
  4. Extend to verification and validation (V&V) problems

In-depth Evaluation

Strengths

  1. Accurate Problem Identification: Accurately identifies key challenges in scientific computing correctness verification
  2. Systematic Framework: The proposed correctness dimension framework demonstrates excellent systematicity and completeness
  3. Practical Orientation: Avoids pure theoretical discussion, focusing on real-world application requirements
  4. Interdisciplinary Perspective: Effectively bridges formal methods and scientific computing communities
  5. Historical Insights: Draws valuable lessons from the evolution of performance benchmarks

Weaknesses

  1. Lack of Concrete Examples: Provides no concrete challenge problem examples to validate framework feasibility
  2. Unclear Implementation Path: Lacks detailed planning for transitioning from theoretical framework to practical challenge problem sets
  3. Missing Evaluation Mechanisms: Lacks specific mechanisms for assessing challenge problem quality and effectiveness
  4. Unknown Community Acceptance: Uncertain about the two communities' acceptance and willingness to participate in this framework

Impact

  1. Academic Value: Provides important theoretical framework for scientific computing correctness research
  2. Practical Potential: May promote development of more practical formal verification techniques
  3. Community Building: May foster deeper collaboration between scientific computing and formal methods communities
  4. Long-term Significance: Provides new research directions for scientific computing software quality assurance

Applicable Scenarios

  1. Research Guidance: Provides formal methods researchers with scientific computing application research directions
  2. Tool Development: Guides design and evaluation of scientific computing verification tools
  3. Education and Training: Provides systematic framework for scientific computing correctness education
  4. Standard Development: Offers reference for scientific computing software quality standard formulation

References

The paper cites 26 important references, covering:

  • Performance Benchmarks: NAS Parallel Benchmarks 7,8, Mantevo 11, Linpack 14
  • Formal Verification: Gallery of Verified Programs 1,17, VerifyThis 20
  • Numerical Verification: coq-num-analysis 6, Mathematical Components 2
  • Specialized Benchmarks: FPbench 12, DataRaceBench 21, SPEC 13
  • V&V Standards: ASME Guidance Principles 18

Although this paper is a position paper, it provides an important theoretical framework and development direction for scientific computing correctness verification. The proposed six-dimensional correctness framework and design guidance principles are significant for advancing the field, but require subsequent work to concretely implement and validate these concepts.