Due to the increasing diversity of high-performance computing architectures, researchers and practitioners are increasingly interested in comparing a code's performance and scalability across different platforms. However, there is a lack of available guidance on how to actually set up and analyze such cross-platform studies. In this paper, we contend that the natural base unit of computing for such studies is a single compute node on each platform and offer guidance in setting up, running, and analyzing node-to-node scaling studies. We propose templates for presenting scaling results of these studies and provide several case studies highlighting the benefits of this approach.
- Paper ID: 2510.12166
- Title: Comparing Cross-Platform Performance via Node-to-Node Scaling Studies
- Authors: Kenneth Weiss, Thomas M. Stitt, Daryl Hawkins, Olga Pearce, Stephanie Brink, Robert N. Rieben
- Classification: cs.DC (Distributed, Parallel, and Cluster Computing)
- Publication Date: October 15, 2025 (Preprint)
- Paper Link: https://arxiv.org/abs/2510.12166
With the increasing diversity of high-performance computing architectures, researchers and practitioners are increasingly concerned with comparing code performance and scalability across different platforms. However, there is a lack of available guidance on how to actually set up and analyze such cross-platform studies. This paper argues that the natural fundamental computational unit for such studies is a single compute node on each platform, and provides guidance for setting up, running, and analyzing node-to-node scaling studies. We propose templates for presenting scaling results from these studies and provide several case studies to highlight the advantages of this approach.
- Growing Architectural Diversity: With the completion of the Exascale Computing Project (ECP) and successful deployment of the first exascale machines (such as Lawrence Livermore National Laboratory's El Capitan system achieving 1.7 exaflops), supercomputer node architectures have exhibited considerable diversity.
- Platform Selection Challenges: In the November 2024 Top500 list, 29.2% of systems have both GPUs and CPUs, accounting for 41.3% of total performance share. Faced with numerous computing platform choices, researchers often find it unclear how to select appropriate platforms for solving problems under practical constraints such as cluster availability and project budgets.
- Performance Portability Requirements: Large codebases must support various existing and upcoming architectures along with new features. Developing, managing, testing, and maintaining platform-specific code versions is infeasible. Many teams address this challenge through single-source performance portability using abstraction libraries such as RAJA, Kokkos, SYCL, and OpenMP.
- Lack of Guidance: Literature lacks guidance on how to practically compare heterogeneous system performance
- Non-uniform Benchmarking Units: Traditional single-processor benchmarks encounter difficulties when comparing across heterogeneous computing types
- Scattered Analysis Tools: Existing performance analysis tools typically focus on single architectures or single aspects of performance
This paper aims to provide systematic guidance for cross-platform performance comparison, particularly in cloud computing environments where users must select from a range of compute node architectures and pay accordingly.
- Proposes Node-to-Node Comparison Paradigm: Establishes individual compute nodes as the relevant computational unit for cross-platform studies
- Systematizes Scaling Study Methodology: Describes in detail four types of node-to-node scaling study approaches
- Standardizes Visualization Templates: Proposes chart templates for analyzing and comparing cross-platform performance
- Provides Practical Workflow Guidance: Offers complete workflows for setting up, running, and analyzing node-to-node scaling studies
- Validates with Real Case Studies: Verifies method effectiveness through multiple case studies using the MARBL code
This paper addresses establishing a standardized cross-platform performance comparison methodology, with inputs being computational tasks on different platforms and outputs being comparable performance analysis results and visualization charts.
- Definition: Maintains fixed total problem size while varying the number of computational resources
- Metric: Strong scaling speedup = t_P(1)/t_P(N), where t_P(1) is single-node runtime and t_P(N) is N-node runtime
- Ideal Case: Runtime decreases linearly with node count (slope of -1 in log₂-log₂ coordinates)
- Definition: Maintains fixed local problem size per compute node while increasing total problem size with node count
- Metric: Weak scaling efficiency = t_P(1)/t_P(N)
- Ideal Case: Runtime remains constant (slope of 0 in log₂-log₂ coordinates)
- Definition: Displays both strong and weak scaling results in a single chart
- Purpose: Helps determine the "optimal point" for running computations
- Visualization: Solid lines connect strong scaling data points; dashed lines connect weak scaling data points
- Definition: Compares per-node throughput on fixed resources while varying the number of degrees of freedom in the problem
- Metric: Throughput = ⟨DOFs-processed⟩/compute_node × cycles/second
- Objective: Identifies resource saturation points and performance bottlenecks
- Unified Benchmarking Unit: Uses compute nodes as the basic comparison unit, effectively normalizing differences across different node architectures
- Standardized Visualization: Employs log₂-log₂ coordinates, making ideal scaling performance appear as lines with specific slopes
- Cross-Platform Analysis: Compares relative performance at identical node counts via vertical lines and compares nodes needed for similar performance via horizontal lines
- Comprehensive Evaluation Framework: Combines multiple scaling types to provide comprehensive performance profiles
- Sierra (ATS-2): 125 petaflop system with 4,320 compute nodes, each equipped with two 20-core POWER9 processors, four NVIDIA Volta V100 16GB GPUs, and 256GB memory
- Astra: 2.3 petaflop system with 2,592 compute nodes, each equipped with two 28-core Cavium ThunderX2 ARM processors and 128GB memory
- CTS-1: Commercial system with 1,302 compute nodes, dual 18-core Intel Xeon E5-2695 processors, 128GB memory
- CTS-2: Commercial system with 1,496 compute nodes, dual 56-core Intel Xeon Platinum 8480+ processors, 256GB memory
- EAS-3: El Capitan Early Access System with 36 compute nodes, single 64-core AMD Trento processor, four AMD MI-250X 128GB GPUs, 512GB memory
Uses MARBL (Multiphysics on Advanced Platforms), a next-generation performance-portable multiphysics simulation code developed by Lawrence Livermore National Laboratory, specifically designed for simulating high-energy-density physics (HEDP).
- Maestro: Orchestrates scaling study execution
- Caliper and Adiak: Perform code annotation and metadata collection
- Thicket: Reads and filters Caliper data, generates scaling charts
In the Triple-Pt 3D hydrodynamics benchmark:
- Strong Scaling Performance: GPU platform Sierra achieves approximately 15x speedup over CPU platforms on single nodes, but advantages gradually diminish with increasing node count (approximately 8x at 8 nodes, 4x at 32 nodes)
- Weak Scaling Performance: Astra exhibits excellent weak scaling (only 1.49x slowdown at 2,048 nodes), and Sierra shows reasonable weak scaling (1.8x slowdown)
- CPU Platform Limitations: CTS-1 and CTS-2 saturate quickly with relatively flat throughput curves
- GPU Platform Advantages: ATS-2 and EAS-3 achieve significantly higher throughput
- Memory Capacity Impact: EAS-3 nodes can run problems an order of magnitude larger than ATS-2
- Polynomial Order Effects: Across all platforms, throughput increases as polynomial order increases from linear to quadratic to cubic
In the Shaped-Charge 3D problem:
- Memory Pool Sharing Benefits: On GPU platforms, host code MARBL and equation-of-state library LEOS sharing pre-allocated memory pools show significant advantages compared to independent memory allocation across all scales (2x-4x improvement)
- Minimal Performance Loss: Containerized MARBL (cMARBL) shows negligible performance loss compared to native MARBL binaries
- Cloud Deployment Feasibility: Provides opportunities to leverage cloud resources for various MARBL workloads
Traditional strong and weak scaling studies typically use single processors as baselines, an approach that encounters difficulties when comparing across heterogeneous computing types. This paper's node-to-node approach provides a more practical foundation for cross-platform comparison.
Existing tools such as PAPI counters, ARM Forge, Intel VTune, and NVIDIA Nsight typically focus on single architectures. In contrast, the Ubiquitous Performance Analysis paradigm and related tools (Caliper, Adiak, Hatchet, Thicket) provide better support for cross-platform performance analysis.
Tools such as Maestro, Merlin, and Ramble help manage simulation collections, but not all have built-in support for running simulations across different clusters and comparing results.
- Validity of Node-Level Comparison: Individual compute nodes serve as a reasonable and practical fundamental unit for cross-platform comparison
- Value of Standardized Visualization: The proposed chart templates effectively display different types of scaling performance
- Success in Practical Application: Multiple real case studies validate the method's effectiveness and practicality
- Intra-Node Communication Costs: Node-to-node scaling studies incorporate some intra-node communication costs into initial single-node measurements
- High Manual Effort: Actual setup of these studies and tracking data/metadata across runs requires substantial manual work
- Limited Data Points: Weak scaling using uniform refinement results in few data points
- Framework Development: Develop frameworks that make setting up such studies easier
- Cloud Computing Exploration: Explore more "what-if" scenarios using diverse compute nodes in cloud clusters
- Energy Analysis: Extend to cross-platform comparison of energy/power consumption
- High Practicality: The proposed method directly addresses practical problems faced by the HPC community
- Systematic Completeness: Provides complete coverage from theoretical framework to practical workflows
- Sufficient Validation: Verifies method effectiveness through multiple real large-scale case studies
- Clear Visualization: Proposed chart templates are intuitive and easy to understand for analysis and comparison
- Tool Support: Provides complete tool chain support
- Limited Theoretical Depth: Primarily methodology and practical guidance with limited deep theoretical analysis
- Generalizability Needs Verification: Mainly based on MARBL code case studies; applicability to other application types requires further verification
- Low Automation Level: Current workflows still require substantial manual configuration and management
- Fills a Gap: Provides systematic solutions to the HPC community's lack of cross-platform performance comparison guidance
- Standardization Potential: Proposed methods and visualization templates have potential to become community standards
- High Practical Value: Important for actual decisions such as system procurement and cloud computing resource selection
- System Procurement Evaluation: Helps decision-makers compare performance of different architecture systems
- Cloud Computing Resource Selection: Guides users in selecting the most suitable compute instance types in cloud environments
- Code Porting Evaluation: Helps developers assess code porting effectiveness across different platforms
- Performance Optimization Guidance: Provides benchmarks and targets for performance optimization efforts
This paper cites 52 relevant references covering multiple aspects including HPC scaling studies, performance analysis tools, workflow management, and related applications, providing solid theoretical foundation and technical support for the research.
This paper provides much-needed cross-platform performance comparison guidance for the HPC community with strong practical value. While relatively limited in theoretical innovation, its systematic methodology and comprehensive experimental validation make it an important contribution to the field.