The advancement of machine learning for compiler optimization, particularly within the polyhedral model, is constrained by the scarcity of large-scale, public performance datasets. This data bottleneck forces researchers to undertake costly data generation campaigns, slowing down innovation and hindering reproducible research learned code optimization. To address this gap, we introduce LOOPerSet, a new public dataset containing 28 million labeled data points derived from 220,000 unique, synthetically generated polyhedral programs. Each data point maps a program and a complex sequence of semantics-preserving transformations (such as fusion, skewing, tiling, and parallelism)to a ground truth performance measurement (execution time). The scale and diversity of LOOPerSet make it a valuable resource for training and evaluating learned cost models, benchmarking new model architectures, and exploring the frontiers of automated polyhedral scheduling. The dataset is released under a permissive license to foster reproducible research and lower the barrier to entry for data-driven compiler optimization.
- Paper ID: 2510.10209
- Title: LOOPerSet: A Large-Scale Dataset for Data-Driven Polyhedral Compiler Optimization
- Authors: Massinissa Merouani, Afif Boudaoud, Riyadh Baghdadi (New York University Abu Dhabi)
- Categories: cs.PL (Programming Languages), cs.LG (Machine Learning), cs.PF (Performance)
- Publication Date: October 11, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.10209
The development of machine learning-based compiler optimization in the polyhedral model has been constrained by the scarcity of large-scale public performance datasets. This data bottleneck forces researchers to conduct expensive data generation activities, slowing innovation and hindering reproducible code optimization research. To address this issue, the authors introduce LOOPerSet, a new public dataset containing 28 million labeled data points sourced from 220,000 unique synthetically generated polyhedral programs. Each data point maps a program and complex semantic-preserving transformation sequences (such as fusion, skewing, tiling, and parallelization) to real performance measurements (execution time). The scale and diversity of LOOPerSet make it a valuable resource for training and evaluating learning cost models, benchmarking new model architectures, and exploring the frontiers of automatic polyhedral scheduling.
The polyhedral model provides a powerful framework for expressing and applying complex loop transformations, which is critical for optimizing scientific computing and high-performance applications. However, the main challenge lies in navigating the enormous search space of legal transformation sequences to find those that provide optimal performance on a given hardware target.
- Limitations of Traditional Approaches: Existing analytical cost models and heuristics, while general and tractable, struggle to capture subtle nonlinear interactions between optimizations and underlying systems
- Potential of Data-Driven Methods: Machine learning approaches can develop more nuanced understanding of transformation cost-effectiveness on real hardware by training on large volumes of performance data
- Data Scarcity Bottleneck: The lack of large-scale public performance datasets severely constrains data-driven compiler optimization research
- High Data Generation Costs: Research teams must conduct expensive and time-consuming data generation activities
- Poor Reproducibility: The absence of public datasets impedes rigorous method comparison
- High Research Barriers: High data collection costs hinder potential contributors from entering the field
- Large-Scale Public Dataset: Construction of the LOOPerSet dataset containing 28 million labeled data points sourced from 220,000 unique synthetic polyhedral programs
- Diversity Assurance: Multi-stage randomized program generator ensures structural diversity, avoiding bias toward specific benchmarks
- Relevance-Guided Sampling: Employs relevance-guided transformation space sampling strategy to ensure the dataset contains practically useful optimization sequences
- Rigorous Validation: Verifies dataset diversity and novelty through quantitative methods such as normalized tree edit distance
- Open Access: Released under permissive licensing to promote reproducible research and lower barriers to data-driven compiler optimization
Construct a large-scale, diverse dataset where each data point includes:
- Input: Polyhedral program representation + transformation sequence
- Output: Real performance measurements on hardware (execution time)
- Constraints: All transformations must preserve semantic correctness
Multi-Stage Randomization Process:
Loop Structure Generation:
- Probabilistically determine the number of top-level loop nestings
- Recursively construct the structure of each nesting
- Generate rectangular and non-rectangular (triangular, trapezoidal) iteration domains
- Loop bounds can be constants or functions of outer loop iterators
Computation Placement and Ordering:
- Randomly place computations within loop nestings
- Can interleave computations and sub-nestings at the same level
- Assign data types (32/64-bit floating-point or integer) to each computation
Memory Access and Expression Generation:
- Memory Patterns: Create diverse memory access patterns, from simple identity mappings to complex multi-dimensional templates (star, cross patterns) and constant offset accesses
- Arithmetic Expressions: Construct computation logic through random combination of expression trees, incorporating memory accesses and scalar values, using common arithmetic operators and mathematical functions
Consistency and Validation Checks:
- Detect and prevent trivial work (redundant loop computations, dead writes, etc.)
- Ensure synthetic programs are meaningful both syntactically and computationally
Uses the execution-guided search mechanism of the LOOPer automatic scheduler for beam search, exploring promising sequences of key polyhedral optimizations:
- Loop Fusion
- Skewing
- Interchange
- Reversal
- Tiling
- Parallelization
- Unrolling
Legality Verification: Uses standard polyhedral dependence analysis to ensure all transformation sequences preserve semantic correctness.
- Use the Tiramisu compiler framework to generate executables
- Execute on a dual-socket Intel Xeon E5-2695 v2 processor system
- Execute each program version up to 30 times to ensure measurement stability
- Record complete execution time lists to account for system noise
- Maximized Structural Diversity: Ensures broad coverage of program structures through recursive probabilistic generation process
- Relevance-Guided Sampling: Avoids inefficiency of random sampling, focusing on transformation sequences that actual compilers would consider
- Quantified Diversity Verification: Uses formal methods such as normalized tree edit distance to verify dataset quality
- Hardware Adaptability Design: Supports pre-training and transfer learning, reducing adaptation costs for new architectures
- Total Programs: Approximately 220,000 unique programs
- Total Data Points: Over 28 million labeled examples
- Schedules per Program: Median of 70
- Data Generation Workload: Approximately 71,000 CPU hours
- Speedup Range: 0.0004× to 1230×
- Target Architecture: Dual-socket Intel Xeon E5-2695 v2 processor system
- Measurement Method: Execute each program version up to 30 times, recording execution time distribution
- Structural Similarity: Measure structural similarity between programs using normalized tree edit distance (nTED)
- Benchmark Comparison: Quantitative comparative analysis with PolyBench suite
- Feature Space Analysis: Visualization of 20-dimensional feature space using Principal Component Analysis (PCA)
Structural Diversity:
- 14% of programs contain at least one non-rectangular iteration domain
- Loop depth, memory reference patterns, and branching factors exhibit long-tail distributions
- Memory footprint, baseline execution time, and total iteration domain volume span multiple orders of magnitude
Performance Distribution:
- Measured speedups exhibit a peaked distribution, concentrated around 1.0×
- Right tail demonstrates the existence of effective transformation sequences
- Left tail captures harmful schedules
Comparison with PolyBench:
- No Duplication Confirmed: Minimum nTED distance never equals zero, with the most similar being seidel-2d (nTED=0.022)
- Broader Structural Space: Median distance between synthetic programs and benchmarks (0.537) exceeds median distance within PolyBench (0.467)
- Feature Space Coverage: PCA visualization shows PolyBench programs lie within the dense region of LOOPerSet's feature cloud
Distribution Comparison:
- Cumulative distribution functions show the distance distribution between synthetic programs and benchmarks consistently falls below the distance distribution within benchmarks
- Indicates LOOPerSet explores a broader and more diverse structural space than existing benchmarks
- Traditional Methods: PLUTO, PolyOpt, GRAPHITE and other analytical cost model-based approaches
- Learning Methods: Tiramisu automatic scheduler, TVM/Ansor, Halide optimizer and other data-driven methods
- Existing Limitations: Lack of large-scale public polyhedral optimization performance datasets
- Related Resources: TpuGraphs and other tensor computation graph performance prediction datasets
- Benchmarks: Limitations of standard benchmark suites like PolyBench
- Synthesis Methods: Applications of random program generation in compiler research
- Data Bottleneck Resolution: LOOPerSet effectively addresses the data scarcity problem in polyhedral compiler optimization research
- Quality Assurance: Ensures dataset quality through rigorous diversity analysis and relevance-guided sampling
- Community Resource: Provides the research community with an immediately usable large-scale benchmarking platform
- Hardware Specificity: Performance labels are specific to the Intel Xeon E5-2695 v2 architecture
- Synthetic Program Limitations: While diverse, may not fully cover all real-world program patterns
- Transformation Space: Limited to transformation types supported by the LOOPer system
- Cross-Architecture Extension: Generate performance labels on GPUs and other CPU microarchitectures
- Transfer Learning Research: Leverage the dataset to study zero-shot or few-shot generalization
- New Model Architectures: Explore new cost model architectures such as GNNs and Transformers
- Interpretability Research: Analyze model failure modes to improve generalization capability
- Unprecedented Scale: The scale of 28 million data points is unprecedented in this field
- Rigorous Methodology: Multi-stage generation pipeline and quantitative validation methods are scientifically sound
- High Practical Value: Relevance-guided sampling ensures practical applicability of the dataset
- Strong Openness: CC-BY 4.0 licensing and Hugging Face platform ensure accessibility
- Reproducibility: Detailed data format specifications and tool support
- Architecture Dependence: Performance labels are limited to a single hardware platform
- Limited Validation: Lacks validation in real-world applications
- Generation Bias: Synthetic programs may exhibit systematic biases
- Transformation Coverage: Transformation types are limited by existing tool support
- Academic Contribution: Provides infrastructure for data-driven compiler optimization research
- Practical Value: Significantly lowers entry barriers for new researchers
- Reproducibility: Promotes method comparison and result reproduction
- Long-Term Impact: May drive the field toward a more data-driven direction
- Cost Model Training: Train and evaluate various machine learning cost models
- Architecture Comparison: Benchmark different model architectures and characterization methods
- Transfer Learning: Serve as pre-training dataset supporting new architecture adaptation
- Heuristic Discovery: Discover new compiler heuristics through data mining
- Interpretability Research: Analyze model behavior and failure modes
- Access URL: https://huggingface.co/datasets/Mascinissa/LOOPerSet
- Data Format: JSON Lines (.jsonl)
- License: Creative Commons Attribution 4.0 International (CC-BY 4.0)
- Version Options:
- Full version: 28 million data points
- Lite version: 10 million data points (consistent with LOOPer paper experiments)
The LOOPerSet dataset represents an important milestone in polyhedral compiler optimization research. By providing a large-scale, high-quality public dataset, it is expected to significantly advance the field and lower research barriers. Its rigorous construction methodology and open access approach make it a valuable resource for future related research.