2025-11-13T19:19:11.174126

Formal Verification of COO to CSR Sparse Matrix Conversion (Invited Paper)

Appel

We describe a machine-checked correctness proof of a C program that converts a coordinate-form (COO) sparse matrix to a compressed-sparse-row (CSR) matrix. The classic algorithm (sort the COO entries in lexicographic order by row,column; fill in the CSR arrays left to right) is concise but has rather intricate invariants. We illustrate a bottom-up methodology for deriving the invariants from the program.

academic

Formal Verification of COO to CSR Sparse Matrix Conversion (Invited Paper)

Basic Information

Paper ID: 2510.13412
Title: Formal Verification of COO to CSR Sparse Matrix Conversion (Invited Paper)
Author: Andrew W. Appel (Princeton University)
Classification: math.NA cs.LO cs.NA
Publication Venue/Conference: VSS 2025 (International Workshop on Verification of Scientific Software), EPTCS 432, 2025
Paper Link: https://arxiv.org/abs/2510.13412

Abstract

This paper describes a machine-checked correctness proof for verifying a C program that converts sparse matrices from Coordinate (COO) format to Compressed Sparse Row (CSR) format. Although the classical algorithm—sorting COO entries in lexicographic order by row and column, then filling CSR arrays from left to right—is concise, it exhibits considerably complex invariants. The paper demonstrates a bottom-up methodology for deriving invariants from the program.

Research Background and Motivation

Problem Definition

Core Problem: Sparse matrices are widely used in numerical computation, but existing sparse matrix conversion algorithms lack formal verification, potentially harboring difficult-to-detect errors
Significance: Sparse matrix-vector multiplication is a fundamental operation in numerical methods; errors in conversion can cause failure throughout the entire computational pipeline
Existing Limitations: Traditional verification methods rely on regression testing and similar techniques, which cannot provide mathematical correctness guarantees
Research Motivation: Through machine-checked formal proofs, ensure absolute correctness of COO to CSR conversion, establishing a foundation for verifiable numerical software

Application Context

Sparse Matrix Representations: COO format facilitates construction; CSR format facilitates multiplication operations
Finite Element Analysis: Each internal vertex in a mesh generates multiple coordinate tuples, naturally producing duplicate entries
Numerical Precision: The non-associativity of floating-point arithmetic makes the summation order of duplicate entries affect results

Core Contributions

First Machine-Checked Correctness Proof for COO to CSR Conversion: Using the Verifiable Software Toolchain (VST) and Coq proof assistant
Innovative Loop Invariant Derivation Methodology: Proposes a bottom-up approach to derive complex loop invariants from properties the program must satisfy
Separation of Data Structure and Approximation Reasoning: Decouples data structure reasoning in C programs from floating-point approximation reasoning, simplifying verification complexity
Composable Verification Components: Provides verified sparse matrix conversion modules reusable in larger verification systems
Practical Error Discovery: Identified and fixed 5 program errors during proof (4 off-by-one errors and 1 complex unsigned integer handling error)

Methodology Details

Task Definition

Input: COO sparse matrix - containing dimensions rows×cols and n coordinate triples (row_indk, col_indk, valk) Output: CSR sparse matrix - containing three one-dimensional arrays (val, col_ind, row_ptr) Constraints: Maintain mathematical matrix semantic equivalence; correctly handle floating-point summation of duplicate entries

Core Algorithm

struct csr_matrix *coo_to_csr_matrix(struct coo_matrix *p) {
    // 1. Sort COO entries in lexicographic order by (row, column)
    coo_quicksort(p, 0, n);
    
    // 2. Count the number of unique coordinate pairs
    k = coo_count(p);
    
    // 3. Allocate CSR array space
    // 4. Traverse sorted COO entries and construct CSR structure
    for (i=0; i<n; i++) {
        // Handle duplicate entries, new columns, new rows, etc.
    }
}

Formal Specification Architecture

1. Mathematical Object Definitions

Record coo_matrix (t: type) := {
    coo_rows: Z;
    coo_cols: Z;
    coo_entries: list (Z * Z * ftype t)
}.

Record csr_matrix {t: type} := {
    csr_cols: Z;
    csr_vals: list (ftype t);
    csr_col_ind: list Z;
    csr_row_ptr: list Z;
    csr_rows: Z := Zlength (csr_row_ptr) - 1
}.

2. Memory Representation Relations

coo_rep (sh: share) (coo: coo_matrix Tdouble) (p: val) : mpred
csr_rep (sh: share) (csr: csr_matrix Tdouble) (p: val) : mpred

3. Matrix Semantic Relations

Definition coo_to_matrix {t: type} (coo: coo_matrix t) (m: matrix t) : Prop :=
    coo_rows coo = matrix_rows m ∧
    matrix_cols m (coo_cols coo) ∧
    ∀ i j, sum_any (list of duplicate entry values) (matrix_index m i j).

Technical Innovations

1. Non-Deterministic Modeling of Floating-Point Summation

Inductive sum_any {t}: list (ftype t) → ftype t → Prop :=
| Sum_Any_0: sum_any nil (Zconst t 0)
| Sum_Any_1: ∀ x, sum_any [x] x  
| Sum_Any_split: ∀ al bl a b, sum_any al a → sum_any bl b →
    sum_any (al++bl) (BPLUS a b)
| Sum_Any_perm: ∀ al bl s, Permutation al bl → sum_any al s → sum_any bl s.

2. Partial CSR Relation Definition

A key innovation is the partial_CSR i r coo ROWPTR COLIND VAL relation, representing the partial CSR state when processing the i-th COO entry at row r.

3. Systematic Loop Invariant Derivation

By analyzing critical state transition points in the program, systematically derive required invariant properties:

partial_CSR_0: Initial state
partial_CSR_duplicate: Processing duplicate entries
partial_CSR_newcol: New column entry
partial_CSR_newrow: New row entry
partial_CSR_skiprow: Skipping empty rows

Experimental Setup

Verification Toolchain

Coq Proof Assistant: For formal specification and proof
Verifiable Software Toolchain (VST): For Hoare logic verification of C programs
Verifiable C: Program logic in VST, embedded in Coq

Verification Scale

Definitions and Lemmas: 1571 lines of Coq code for coo_csr and partial_CSR definitions and properties
VST Proof: 412 lines for main Hoare logic proof
Trusted Base: Core specification approximately 39 lines, requiring manual inspection of critical components

Verification Methodology

Layered Verification: First prove C program implements low-level specification, then prove mathematical correctness of low-level specification
Modular Design: Separate data structure reasoning from floating-point approximation reasoning
Bottom-Up Derivation: Reverse-engineer loop invariants from Hoare logic proof requirements

Experimental Results

Main Achievements

Complete Correctness Proof: Successfully proved complete correctness of COO to CSR conversion
Error Discovery: Identified 5 actual errors during verification:
- 4 off-by-one errors
- 1 complex unsigned integer initialization error (initializing r to -1)
Composability: Verified modules can be combined with other verified sparse matrix operations

Verification Coverage

Function Specifications: Complete preconditions and postconditions
Loop Invariants: Complex invariants for three nested loops
Memory Safety: Array bounds and memory allocation safety
Mathematical Correctness: Preservation of matrix semantics

Performance Characteristics

Compile-Time Verification: No runtime overhead
Trustworthiness: Machine-checked proof based on Coq kernel
Maintainability: Modular specification design facilitates subsequent modifications and extensions

Formal Verification Domain

Numerical Software Verification: This paper is an important case study in end-to-end verification of numerical algorithms
VST Toolchain: Extends existing C program verification framework to sparse matrix applications
Floating-Point Arithmetic Verification: Incorporates tools like VCFloat2 for floating-point precision analysis

Sparse Matrix Algorithms

Classical Algorithms: COO to CSR conversion is a standard algorithm from decades of practice
Numerical Linear Algebra: Consistent with algorithms in classical literature like Templates for Linear Systems
High-Performance Computing: Provides foundational components for verifiable scientific computing software

Program Verification Methodology

Loop Invariant Derivation: Proposed bottom-up method has general applicability
Separation Logic: Effectively handles complex memory data structures
Specification Engineering: Demonstrates specification design principles for large-scale verification projects

Conclusions and Discussion

Main Conclusions

Feasibility Demonstration: Complete formal verification of complex numerical algorithms is feasible
Methodological Contribution: Bottom-up invariant derivation method has broad applicability
Practical Value: Actual errors discovered during verification demonstrate the value of formal methods
Infrastructure: Establishes foundation for larger-scale scientific computing software verification

Limitations

Bit-for-Bit Reproducibility: Current specification allows different floating-point summation orders, not guaranteeing bit-level reproducibility
Performance Considerations: Verified algorithm not optimized for performance
Reassembly Functionality: Optimized version reusing CSR structures not implemented
Verification Cost: Relatively short program requires substantial verification effort

Future Directions

Stronger Specifications: Specification versions supporting bit-for-bit reproducibility
Performance Optimization: Verify high-performance sparse matrix algorithm variants
Larger Systems: Integrate this module into complete PDE solver verification
Automation: Develop better tools supporting automatic loop invariant derivation

In-Depth Evaluation

Strengths

Technical Depth: Addresses multiple technical challenges in sparse matrix algorithms, including floating-point arithmetic, memory management, and complex control flow
Methodological Innovation: Bottom-up invariant derivation method provides a replicable paradigm for similar verifications
Practical Value: Discovery of actual errors demonstrates real benefits of formal verification
Engineering Quality: Modular design and clear specification structure reflect high-quality verification engineering
Completeness: End-to-end verification from C code to mathematical specification

Weaknesses

Verification Cost: 1983 lines of verification code for relatively short C program represents high cost
Limited Generality: Specifically targets COO to CSR conversion with limited generalization capability
Insufficient Performance Consideration: Does not address performance optimization needs in practical applications
Tool Dependency: Highly dependent on VST and Coq ecosystem

Impact

Academic Contribution: Provides important case study and methodology for numerical software verification
Practical Impact: Can serve as foundational component for highly trustworthy scientific computing software
Methodology Promotion: Loop invariant derivation method applicable to verification of other complex algorithms
Tool Development: Promotes application of VST and similar verification tools in numerical computing domain

Applicable Scenarios

Critical Scientific Computing: Numerical simulations and analyses requiring high trustworthiness
Safety-Critical Systems: Software involving numerical computation in safety-critical applications
Verification Research: Reference case study for formal verification of complex algorithms
Educational Use: Demonstrates capabilities and methods of modern program verification techniques

References

Key references cited in this paper include:

Barrett et al. (1994): Templates for the Solution of Linear Systems - Classical reference for sparse matrix algorithms
Appel & Kellison (2024): VCFloat2 - Floating-point error analysis tool
Cao et al. (2018): VST-Floyd - Separation logic verification tool
Kellison et al. (2023): LAProof - Formal proof library for linear algebra programs

Summary: This paper demonstrates the feasibility of complete formal verification of complex numerical algorithms, proposes practical verification methodologies, and makes important contributions to the development of trustworthy scientific computing software. Despite high verification costs, its value in discovering actual errors and providing methodological guidance makes it an important work in the field.