2025-11-21T09:31:15.798794

Comparing Cross-Platform Performance via Node-to-Node Scaling Studies

Weiss, Stitt, Hawkins et al.
Due to the increasing diversity of high-performance computing architectures, researchers and practitioners are increasingly interested in comparing a code's performance and scalability across different platforms. However, there is a lack of available guidance on how to actually set up and analyze such cross-platform studies. In this paper, we contend that the natural base unit of computing for such studies is a single compute node on each platform and offer guidance in setting up, running, and analyzing node-to-node scaling studies. We propose templates for presenting scaling results of these studies and provide several case studies highlighting the benefits of this approach.
academic

Comparing Cross-Platform Performance via Node-to-Node Scaling Studies

Basic Information

  • Paper ID: 2510.12166
  • Title: Comparing Cross-Platform Performance via Node-to-Node Scaling Studies
  • Authors: Kenneth Weiss, Thomas M. Stitt, Daryl Hawkins, Olga Pearce, Stephanie Brink, Robert N. Rieben
  • Classification: cs.DC (Distributed, Parallel, and Cluster Computing)
  • Publication Date: October 15, 2025 (Preprint)
  • Paper Link: https://arxiv.org/abs/2510.12166

Abstract

With the increasing diversity of high-performance computing architectures, researchers and practitioners are increasingly concerned with comparing code performance and scalability across different platforms. However, there is a lack of available guidance on how to actually set up and analyze such cross-platform studies. This paper argues that the natural fundamental computational unit for such studies is a single compute node on each platform, and provides guidance for setting up, running, and analyzing node-to-node scaling studies. We propose templates for presenting scaling results from these studies and provide several case studies to highlight the advantages of this approach.

Research Background and Motivation

Problem Context

  1. Growing Architectural Diversity: With the completion of the Exascale Computing Project (ECP) and successful deployment of the first exascale machines (such as Lawrence Livermore National Laboratory's El Capitan system achieving 1.7 exaflops), supercomputer node architectures have exhibited considerable diversity.
  2. Platform Selection Challenges: In the November 2024 Top500 list, 29.2% of systems have both GPUs and CPUs, accounting for 41.3% of total performance share. Faced with numerous computing platform choices, researchers often find it unclear how to select appropriate platforms for solving problems under practical constraints such as cluster availability and project budgets.
  3. Performance Portability Requirements: Large codebases must support various existing and upcoming architectures along with new features. Developing, managing, testing, and maintaining platform-specific code versions is infeasible. Many teams address this challenge through single-source performance portability using abstraction libraries such as RAJA, Kokkos, SYCL, and OpenMP.

Limitations of Existing Approaches

  1. Lack of Guidance: Literature lacks guidance on how to practically compare heterogeneous system performance
  2. Non-uniform Benchmarking Units: Traditional single-processor benchmarks encounter difficulties when comparing across heterogeneous computing types
  3. Scattered Analysis Tools: Existing performance analysis tools typically focus on single architectures or single aspects of performance

Research Motivation

This paper aims to provide systematic guidance for cross-platform performance comparison, particularly in cloud computing environments where users must select from a range of compute node architectures and pay accordingly.

Core Contributions

  1. Proposes Node-to-Node Comparison Paradigm: Establishes individual compute nodes as the relevant computational unit for cross-platform studies
  2. Systematizes Scaling Study Methodology: Describes in detail four types of node-to-node scaling study approaches
  3. Standardizes Visualization Templates: Proposes chart templates for analyzing and comparing cross-platform performance
  4. Provides Practical Workflow Guidance: Offers complete workflows for setting up, running, and analyzing node-to-node scaling studies
  5. Validates with Real Case Studies: Verifies method effectiveness through multiple case studies using the MARBL code

Methodology Details

Task Definition

This paper addresses establishing a standardized cross-platform performance comparison methodology, with inputs being computational tasks on different platforms and outputs being comparable performance analysis results and visualization charts.

Types of Node-to-Node Scaling Studies

1. Strong Scaling Studies

  • Definition: Maintains fixed total problem size while varying the number of computational resources
  • Metric: Strong scaling speedup = t_P(1)/t_P(N), where t_P(1) is single-node runtime and t_P(N) is N-node runtime
  • Ideal Case: Runtime decreases linearly with node count (slope of -1 in log₂-log₂ coordinates)

2. Weak Scaling Studies

  • Definition: Maintains fixed local problem size per compute node while increasing total problem size with node count
  • Metric: Weak scaling efficiency = t_P(1)/t_P(N)
  • Ideal Case: Runtime remains constant (slope of 0 in log₂-log₂ coordinates)

3. Strong-Weak Scaling Studies

  • Definition: Displays both strong and weak scaling results in a single chart
  • Purpose: Helps determine the "optimal point" for running computations
  • Visualization: Solid lines connect strong scaling data points; dashed lines connect weak scaling data points

4. Throughput Scaling Studies

  • Definition: Compares per-node throughput on fixed resources while varying the number of degrees of freedom in the problem
  • Metric: Throughput = ⟨DOFs-processed⟩/compute_node × cycles/second
  • Objective: Identifies resource saturation points and performance bottlenecks

Technical Innovations

  1. Unified Benchmarking Unit: Uses compute nodes as the basic comparison unit, effectively normalizing differences across different node architectures
  2. Standardized Visualization: Employs log₂-log₂ coordinates, making ideal scaling performance appear as lines with specific slopes
  3. Cross-Platform Analysis: Compares relative performance at identical node counts via vertical lines and compares nodes needed for similar performance via horizontal lines
  4. Comprehensive Evaluation Framework: Combines multiple scaling types to provide comprehensive performance profiles

Experimental Setup

Test Platforms

  1. Sierra (ATS-2): 125 petaflop system with 4,320 compute nodes, each equipped with two 20-core POWER9 processors, four NVIDIA Volta V100 16GB GPUs, and 256GB memory
  2. Astra: 2.3 petaflop system with 2,592 compute nodes, each equipped with two 28-core Cavium ThunderX2 ARM processors and 128GB memory
  3. CTS-1: Commercial system with 1,302 compute nodes, dual 18-core Intel Xeon E5-2695 processors, 128GB memory
  4. CTS-2: Commercial system with 1,496 compute nodes, dual 56-core Intel Xeon Platinum 8480+ processors, 256GB memory
  5. EAS-3: El Capitan Early Access System with 36 compute nodes, single 64-core AMD Trento processor, four AMD MI-250X 128GB GPUs, 512GB memory

Test Code

Uses MARBL (Multiphysics on Advanced Platforms), a next-generation performance-portable multiphysics simulation code developed by Lawrence Livermore National Laboratory, specifically designed for simulating high-energy-density physics (HEDP).

Workflow Tools

  • Maestro: Orchestrates scaling study execution
  • Caliper and Adiak: Perform code annotation and metadata collection
  • Thicket: Reads and filters Caliper data, generates scaling charts

Experimental Results

Case Study 1: FY20 Project Milestone

In the Triple-Pt 3D hydrodynamics benchmark:

  • Strong Scaling Performance: GPU platform Sierra achieves approximately 15x speedup over CPU platforms on single nodes, but advantages gradually diminish with increasing node count (approximately 8x at 8 nodes, 4x at 32 nodes)
  • Weak Scaling Performance: Astra exhibits excellent weak scaling (only 1.49x slowdown at 2,048 nodes), and Sierra shows reasonable weak scaling (1.8x slowdown)

Case Study 2: Node-to-Node Throughput Study for High-Order Runs

  • CPU Platform Limitations: CTS-1 and CTS-2 saturate quickly with relatively flat throughput curves
  • GPU Platform Advantages: ATS-2 and EAS-3 achieve significantly higher throughput
  • Memory Capacity Impact: EAS-3 nodes can run problems an order of magnitude larger than ATS-2
  • Polynomial Order Effects: Across all platforms, throughput increases as polynomial order increases from linear to quadratic to cubic

Case Study 3: Cross-Platform Comparison of Different Library Characteristics

In the Shaped-Charge 3D problem:

  • Memory Pool Sharing Benefits: On GPU platforms, host code MARBL and equation-of-state library LEOS sharing pre-allocated memory pools show significant advantages compared to independent memory allocation across all scales (2x-4x improvement)

Case Study 4: Containerized MARBL Performance Comparison

  • Minimal Performance Loss: Containerized MARBL (cMARBL) shows negligible performance loss compared to native MARBL binaries
  • Cloud Deployment Feasibility: Provides opportunities to leverage cloud resources for various MARBL workloads

Traditional Scaling Studies

Traditional strong and weak scaling studies typically use single processors as baselines, an approach that encounters difficulties when comparing across heterogeneous computing types. This paper's node-to-node approach provides a more practical foundation for cross-platform comparison.

Performance Analysis Tools

Existing tools such as PAPI counters, ARM Forge, Intel VTune, and NVIDIA Nsight typically focus on single architectures. In contrast, the Ubiquitous Performance Analysis paradigm and related tools (Caliper, Adiak, Hatchet, Thicket) provide better support for cross-platform performance analysis.

Workflow Management

Tools such as Maestro, Merlin, and Ramble help manage simulation collections, but not all have built-in support for running simulations across different clusters and comparing results.

Conclusions and Discussion

Main Conclusions

  1. Validity of Node-Level Comparison: Individual compute nodes serve as a reasonable and practical fundamental unit for cross-platform comparison
  2. Value of Standardized Visualization: The proposed chart templates effectively display different types of scaling performance
  3. Success in Practical Application: Multiple real case studies validate the method's effectiveness and practicality

Limitations

  1. Intra-Node Communication Costs: Node-to-node scaling studies incorporate some intra-node communication costs into initial single-node measurements
  2. High Manual Effort: Actual setup of these studies and tracking data/metadata across runs requires substantial manual work
  3. Limited Data Points: Weak scaling using uniform refinement results in few data points

Future Directions

  1. Framework Development: Develop frameworks that make setting up such studies easier
  2. Cloud Computing Exploration: Explore more "what-if" scenarios using diverse compute nodes in cloud clusters
  3. Energy Analysis: Extend to cross-platform comparison of energy/power consumption

In-Depth Evaluation

Strengths

  1. High Practicality: The proposed method directly addresses practical problems faced by the HPC community
  2. Systematic Completeness: Provides complete coverage from theoretical framework to practical workflows
  3. Sufficient Validation: Verifies method effectiveness through multiple real large-scale case studies
  4. Clear Visualization: Proposed chart templates are intuitive and easy to understand for analysis and comparison
  5. Tool Support: Provides complete tool chain support

Weaknesses

  1. Limited Theoretical Depth: Primarily methodology and practical guidance with limited deep theoretical analysis
  2. Generalizability Needs Verification: Mainly based on MARBL code case studies; applicability to other application types requires further verification
  3. Low Automation Level: Current workflows still require substantial manual configuration and management

Impact

  1. Fills a Gap: Provides systematic solutions to the HPC community's lack of cross-platform performance comparison guidance
  2. Standardization Potential: Proposed methods and visualization templates have potential to become community standards
  3. High Practical Value: Important for actual decisions such as system procurement and cloud computing resource selection

Applicable Scenarios

  1. System Procurement Evaluation: Helps decision-makers compare performance of different architecture systems
  2. Cloud Computing Resource Selection: Guides users in selecting the most suitable compute instance types in cloud environments
  3. Code Porting Evaluation: Helps developers assess code porting effectiveness across different platforms
  4. Performance Optimization Guidance: Provides benchmarks and targets for performance optimization efforts

References

This paper cites 52 relevant references covering multiple aspects including HPC scaling studies, performance analysis tools, workflow management, and related applications, providing solid theoretical foundation and technical support for the research.


This paper provides much-needed cross-platform performance comparison guidance for the HPC community with strong practical value. While relatively limited in theoretical innovation, its systematic methodology and comprehensive experimental validation make it an important contribution to the field.