2025-11-20T05:01:15.151274

LOOPerSet: A Large-Scale Dataset for Data-Driven Polyhedral Compiler Optimization

Merouani, Boudaoud, Baghdadi

The advancement of machine learning for compiler optimization, particularly within the polyhedral model, is constrained by the scarcity of large-scale, public performance datasets. This data bottleneck forces researchers to undertake costly data generation campaigns, slowing down innovation and hindering reproducible research learned code optimization. To address this gap, we introduce LOOPerSet, a new public dataset containing 28 million labeled data points derived from 220,000 unique, synthetically generated polyhedral programs. Each data point maps a program and a complex sequence of semantics-preserving transformations (such as fusion, skewing, tiling, and parallelism)to a ground truth performance measurement (execution time). The scale and diversity of LOOPerSet make it a valuable resource for training and evaluating learned cost models, benchmarking new model architectures, and exploring the frontiers of automated polyhedral scheduling. The dataset is released under a permissive license to foster reproducible research and lower the barrier to entry for data-driven compiler optimization.

academic

LOOPerSet: A Large-Scale Dataset for Data-Driven Polyhedral Compiler Optimization

Basic Information

Paper ID: 2510.10209
Title: LOOPerSet: A Large-Scale Dataset for Data-Driven Polyhedral Compiler Optimization
Authors: Massinissa Merouani, Afif Boudaoud, Riyadh Baghdadi (New York University Abu Dhabi)
Categories: cs.PL (Programming Languages), cs.LG (Machine Learning), cs.PF (Performance)
Publication Date: October 11, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.10209

Abstract

The development of machine learning-based compiler optimization in the polyhedral model has been constrained by the scarcity of large-scale public performance datasets. This data bottleneck forces researchers to conduct expensive data generation activities, slowing innovation and hindering reproducible code optimization research. To address this issue, the authors introduce LOOPerSet, a new public dataset containing 28 million labeled data points sourced from 220,000 unique synthetically generated polyhedral programs. Each data point maps a program and complex semantic-preserving transformation sequences (such as fusion, skewing, tiling, and parallelization) to real performance measurements (execution time). The scale and diversity of LOOPerSet make it a valuable resource for training and evaluating learning cost models, benchmarking new model architectures, and exploring the frontiers of automatic polyhedral scheduling.

Research Background and Motivation

Core Problem

The polyhedral model provides a powerful framework for expressing and applying complex loop transformations, which is critical for optimizing scientific computing and high-performance applications. However, the main challenge lies in navigating the enormous search space of legal transformation sequences to find those that provide optimal performance on a given hardware target.

Problem Significance

Limitations of Traditional Approaches: Existing analytical cost models and heuristics, while general and tractable, struggle to capture subtle nonlinear interactions between optimizations and underlying systems
Potential of Data-Driven Methods: Machine learning approaches can develop more nuanced understanding of transformation cost-effectiveness on real hardware by training on large volumes of performance data
Data Scarcity Bottleneck: The lack of large-scale public performance datasets severely constrains data-driven compiler optimization research

Limitations of Existing Methods

High Data Generation Costs: Research teams must conduct expensive and time-consuming data generation activities
Poor Reproducibility: The absence of public datasets impedes rigorous method comparison
High Research Barriers: High data collection costs hinder potential contributors from entering the field

Core Contributions

Large-Scale Public Dataset: Construction of the LOOPerSet dataset containing 28 million labeled data points sourced from 220,000 unique synthetic polyhedral programs
Diversity Assurance: Multi-stage randomized program generator ensures structural diversity, avoiding bias toward specific benchmarks
Relevance-Guided Sampling: Employs relevance-guided transformation space sampling strategy to ensure the dataset contains practically useful optimization sequences
Rigorous Validation: Verifies dataset diversity and novelty through quantitative methods such as normalized tree edit distance
Open Access: Released under permissive licensing to promote reproducible research and lower barriers to data-driven compiler optimization

Methodology Details

Task Definition

Construct a large-scale, diverse dataset where each data point includes:

Input: Polyhedral program representation + transformation sequence
Output: Real performance measurements on hardware (execution time)
Constraints: All transformations must preserve semantic correctness

Data Generation Pipeline

1. Program Space Sampling: Synthetic Program Generator

Multi-Stage Randomization Process:

Loop Structure Generation:

Probabilistically determine the number of top-level loop nestings
Recursively construct the structure of each nesting
Generate rectangular and non-rectangular (triangular, trapezoidal) iteration domains
Loop bounds can be constants or functions of outer loop iterators

Computation Placement and Ordering:

Randomly place computations within loop nestings
Can interleave computations and sub-nestings at the same level
Assign data types (32/64-bit floating-point or integer) to each computation

Memory Access and Expression Generation:

Memory Patterns: Create diverse memory access patterns, from simple identity mappings to complex multi-dimensional templates (star, cross patterns) and constant offset accesses
Arithmetic Expressions: Construct computation logic through random combination of expression trees, incorporating memory accesses and scalar values, using common arithmetic operators and mathematical functions

Consistency and Validation Checks:

Detect and prevent trivial work (redundant loop computations, dead writes, etc.)
Ensure synthetic programs are meaningful both syntactically and computationally

2. Transformation Space Sampling: Relevance-Guided Exploration

Uses the execution-guided search mechanism of the LOOPer automatic scheduler for beam search, exploring promising sequences of key polyhedral optimizations:

Loop Fusion
Skewing
Interchange
Reversal
Tiling
Parallelization
Unrolling

Legality Verification: Uses standard polyhedral dependence analysis to ensure all transformation sequences preserve semantic correctness.

3. Performance Label Generation

Use the Tiramisu compiler framework to generate executables
Execute on a dual-socket Intel Xeon E5-2695 v2 processor system
Execute each program version up to 30 times to ensure measurement stability
Record complete execution time lists to account for system noise

Technical Innovations

Maximized Structural Diversity: Ensures broad coverage of program structures through recursive probabilistic generation process
Relevance-Guided Sampling: Avoids inefficiency of random sampling, focusing on transformation sequences that actual compilers would consider
Quantified Diversity Verification: Uses formal methods such as normalized tree edit distance to verify dataset quality
Hardware Adaptability Design: Supports pre-training and transfer learning, reducing adaptation costs for new architectures

Experimental Setup

Dataset Scale

Total Programs: Approximately 220,000 unique programs
Total Data Points: Over 28 million labeled examples
Schedules per Program: Median of 70
Data Generation Workload: Approximately 71,000 CPU hours
Speedup Range: 0.0004× to 1230×

Hardware Platform

Target Architecture: Dual-socket Intel Xeon E5-2695 v2 processor system
Measurement Method: Execute each program version up to 30 times, recording execution time distribution

Validation Methods

Structural Similarity: Measure structural similarity between programs using normalized tree edit distance (nTED)
Benchmark Comparison: Quantitative comparative analysis with PolyBench suite
Feature Space Analysis: Visualization of 20-dimensional feature space using Principal Component Analysis (PCA)

Experimental Results

Dataset Statistical Characteristics

Structural Diversity:

14% of programs contain at least one non-rectangular iteration domain
Loop depth, memory reference patterns, and branching factors exhibit long-tail distributions
Memory footprint, baseline execution time, and total iteration domain volume span multiple orders of magnitude

Performance Distribution:

Measured speedups exhibit a peaked distribution, concentrated around 1.0×
Right tail demonstrates the existence of effective transformation sequences
Left tail captures harmful schedules

Diversity Verification Results

Comparison with PolyBench:

No Duplication Confirmed: Minimum nTED distance never equals zero, with the most similar being seidel-2d (nTED=0.022)
Broader Structural Space: Median distance between synthetic programs and benchmarks (0.537) exceeds median distance within PolyBench (0.467)
Feature Space Coverage: PCA visualization shows PolyBench programs lie within the dense region of LOOPerSet's feature cloud

Distribution Comparison:

Cumulative distribution functions show the distance distribution between synthetic programs and benchmarks consistently falls below the distance distribution within benchmarks
Indicates LOOPerSet explores a broader and more diverse structural space than existing benchmarks

Polyhedral Compiler Optimization

Traditional Methods: PLUTO, PolyOpt, GRAPHITE and other analytical cost model-based approaches
Learning Methods: Tiramisu automatic scheduler, TVM/Ansor, Halide optimizer and other data-driven methods

Performance Datasets

Existing Limitations: Lack of large-scale public polyhedral optimization performance datasets
Related Resources: TpuGraphs and other tensor computation graph performance prediction datasets

Program Synthesis

Benchmarks: Limitations of standard benchmark suites like PolyBench
Synthesis Methods: Applications of random program generation in compiler research

Conclusions and Discussion

Main Conclusions

Data Bottleneck Resolution: LOOPerSet effectively addresses the data scarcity problem in polyhedral compiler optimization research
Quality Assurance: Ensures dataset quality through rigorous diversity analysis and relevance-guided sampling
Community Resource: Provides the research community with an immediately usable large-scale benchmarking platform

Limitations

Hardware Specificity: Performance labels are specific to the Intel Xeon E5-2695 v2 architecture
Synthetic Program Limitations: While diverse, may not fully cover all real-world program patterns
Transformation Space: Limited to transformation types supported by the LOOPer system

Future Directions

Cross-Architecture Extension: Generate performance labels on GPUs and other CPU microarchitectures
Transfer Learning Research: Leverage the dataset to study zero-shot or few-shot generalization
New Model Architectures: Explore new cost model architectures such as GNNs and Transformers
Interpretability Research: Analyze model failure modes to improve generalization capability

In-Depth Evaluation

Strengths

Unprecedented Scale: The scale of 28 million data points is unprecedented in this field
Rigorous Methodology: Multi-stage generation pipeline and quantitative validation methods are scientifically sound
High Practical Value: Relevance-guided sampling ensures practical applicability of the dataset
Strong Openness: CC-BY 4.0 licensing and Hugging Face platform ensure accessibility
Reproducibility: Detailed data format specifications and tool support

Weaknesses

Architecture Dependence: Performance labels are limited to a single hardware platform
Limited Validation: Lacks validation in real-world applications
Generation Bias: Synthetic programs may exhibit systematic biases
Transformation Coverage: Transformation types are limited by existing tool support

Impact

Academic Contribution: Provides infrastructure for data-driven compiler optimization research
Practical Value: Significantly lowers entry barriers for new researchers
Reproducibility: Promotes method comparison and result reproduction
Long-Term Impact: May drive the field toward a more data-driven direction

Applicable Scenarios

Cost Model Training: Train and evaluate various machine learning cost models
Architecture Comparison: Benchmark different model architectures and characterization methods
Transfer Learning: Serve as pre-training dataset supporting new architecture adaptation
Heuristic Discovery: Discover new compiler heuristics through data mining
Interpretability Research: Analyze model behavior and failure modes

Dataset Access Information

Access URL: https://huggingface.co/datasets/Mascinissa/LOOPerSet
Data Format: JSON Lines (.jsonl)
License: Creative Commons Attribution 4.0 International (CC-BY 4.0)
Version Options:
- Full version: 28 million data points
- Lite version: 10 million data points (consistent with LOOPer paper experiments)

The LOOPerSet dataset represents an important milestone in polyhedral compiler optimization research. By providing a large-scale, high-quality public dataset, it is expected to significantly advance the field and lower research barriers. Its rigorous construction methodology and open access approach make it a valuable resource for future related research.