2025-11-18T13:16:13.501063

Adapting Atmospheric Chemistry Components for Efficient GPU Accelerators

Ruiz, Dawson, Acosta et al.
Atmospheric models demand a lot of computational power and solving the chemical processes is one of its most computationally intensive components. This work shows how to improve the computational performance of the Multiscale Online Nonhydrostatic AtmospheRe CHemistry model (MONARCH), a chemical weather prediction system developed by the Barcelona Supercomputing Center. The model implements the new flexible external package Chemistry Across Multiple Phases (CAMP) for the solving of gas- and aerosol-phase chemical processes, that allows multiple chemical processes to be solved simultaneously as a single system. We introduce a novel strategy to simultaneously solve multiple instances of a chemical mechanism, represented in the model as grid-cells, obtaining a speedup up to 9x using thousands of cells. In addition, we present a GPU strategy for the most time-consuming function of CAMP. The GPU version achieves up to 1.2x speedup compared to CPU. Also, we optimize the memory access in the GPU to increase its speedup up to 1.7x.
academic

Adapting Atmospheric Chemistry Components for Efficient GPU Accelerators

Basic Information

  • Paper ID: 2501.00011
  • Title: Adapting Atmospheric Chemistry Components for Efficient GPU Accelerators
  • Authors: Christian Guzman Ruiz, Matthew Dawson, Mario C. Acosta, Oriol Jorba, Eduardo Cesar Galobardes, Carlos Pérez García-Pando, Kim Serradell
  • Classification: physics.comp-ph cs.AR
  • Publication Date: December 13, 2024 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2501.00011

Abstract

Atmospheric models require substantial computational resources, with chemical process solving being one of the most computationally intensive components. This study demonstrates how to improve the computational performance of the MONARCH (Multiscale Online Nonhydrostatic Atmospheric Chemistry) model developed at the Barcelona Supercomputing Center. The model implements a new flexible external package "Chemistry Across Multiple Phases" (CAMP) to solve gas-phase and aerosol-phase chemical processes, allowing multiple chemical processes to be solved simultaneously as a single system. The research proposes a novel strategy to simultaneously solve multiple instances of chemical mechanisms (represented as grid cells in the model), achieving up to 9× speedup using thousands of cells. Additionally, GPU strategies are proposed for CAMP's most time-consuming functions, with the GPU version achieving up to 1.2× speedup compared to CPU implementation, further improving the speedup to 1.7× through GPU memory access optimization.

Research Background and Motivation

Problem Definition

  1. Computational Challenges: Atmospheric models are mathematical representations of atmospheric dynamics, physics, chemistry, and radiation processes; their complexity results in enormous computational costs
  2. Chemical Process Bottleneck: Chemical process solving can consume 80% of model execution time, representing a performance bottleneck
  3. Parallelization Requirements: Existing models parallelize through domain decomposition, but individual chemical solvers remain serial

Significance

  • Atmospheric chemistry modeling is crucial for climate prediction and air quality forecasting applications
  • Improved computational efficiency enables higher-resolution and more complex atmospheric chemistry simulations
  • GPU acceleration can significantly reduce computation time and costs

Limitations of Existing Approaches

  1. CPU-based Solvers: Parallelization through domain decomposition requires thousands of grid cells for significant acceleration
  2. GPU-specific Methods: While offering better performance (e.g., 59× speedup), they are difficult to adapt to atmospheric models and typically target only specific types of chemical equations
  3. Data Transfer Overhead: CPU-GPU data transfer becomes a performance bottleneck in GPU implementations

Core Contributions

  1. Multi-cells Strategy: Proposes a novel method for simultaneously solving multiple grid cells, avoiding repeated ODE solver initialization, achieving up to 9× speedup
  2. GPU Chemical Solving: Develops GPU implementation of the Derivative function in the CAMP framework, achieving 1.2× speedup
  3. Memory Access Optimization: Reorganizes reaction data structures to improve GPU memory access patterns, increasing speedup to 1.7×
  4. Hybrid Parallel Strategy: Combines CPU-based solvers with GPU-specific techniques

Methodology Details

Task Definition

  • Input: Chemical species concentrations, temperature, pressure, and other state variables for multiple atmospheric grid cells
  • Output: Predicted future chemical species concentrations
  • Constraints: Maintain conservation laws for chemical equations and ensure numerical stability

MONARCH-CAMP Architecture

System Components

  1. MONARCH: Multiscale Online Nonhydrostatic Atmospheric Chemistry model
  2. CAMP: Chemistry Across Multiple Phases framework, handling gas-phase and aerosol-phase reactions
  3. CVODE: External ODE solver using sparse Jacobian matrices

Chemical Reaction Modeling

General form of chemical reactions:

c₁y₁ + ⋯ + cₘyₘ ↔ cₘ₊₁yₘ₊₁ + ⋯ + cₙyₙ

Rate of change for participating species yᵢ relative to reaction j:

(dyᵢ/dt)ⱼ = {
  -cᵢrⱼ(y,T,P,…)  for i ≤ m
   cᵢrⱼ(y,T,P,…)  for m < i ≤ n
}

Overall rate of change:

fᵢ ≡ dyᵢ/dt = Σⱼ(dyᵢ/dt)ⱼ

Multi-cells Implementation

Core Concept

  • Combines data from multiple grid cells into a single data structure for computation
  • Avoids repeated initialization overhead from calling the solver for each cell individually
  • Moves the cell loop inside CAMP's internal solving functions

Updated Equations

fᵢ ≡ dyᵢₖ/dt = Σⱼ(dyᵢₖ/dt)ⱼ

where yᵢₖ represents species yᵢ from cell k

GPU Implementation Strategy

Parallelization Scheme

  • Parallel Units: Each reaction data package
  • Thread Configuration: GPU thread count equals reaction count, maximum 1024 threads/block
  • Synchronization Mechanism: Uses CUDA's atomicAdd operation to avoid thread conflicts

Memory Management

  1. Reaction Data: Stored in global memory
  2. State Arrays:
    • Small data volumes: Passed through constant memory
    • Large data volumes: Transferred directly to global memory

Data Structure Optimization

  • Problem: Original structure causes GPU threads to access non-contiguous memory
  • Solution: Reorganize reaction data structure to enable sequential data access by GPU threads
  • Effect: Improves memory access patterns, achieving 1.3× performance improvement

Experimental Setup

Hardware Environment

  • Cluster: CTE-POWER (Barcelona Supercomputing Center)
  • CPU: IBM Power9 8335-GTH @ 2.4GHz
  • GPU: NVIDIA V100 (Volta) 16GB HBM2
  • Compilers: GCC 6.4.0, NVCC 9.1

Test Configuration

  • Chemical Mechanism: Basic mechanism with 3 species (A → B + C)
  • Reactions: 2 Arrhenius reactions
  • Initial Conditions:
    • Species A: 1.0
    • Species B, C: 0.0
    • 0.1 concentration offset per cell
  • Grid Cell Count: From small scale to 10,000 cells

Evaluation Metrics

  • Speedup: GPU performance improvement relative to CPU
  • Iteration Count: Number of iterations by ODE solver
  • Execution Time: Total computation time and component-wise time

Experimental Results

Multi-cells Performance

  • Speedup: Achieves approximately 8× speedup across various cell counts, reaching maximum 9×
  • Iteration Optimization:
    • Single-cell method: Iteration count grows linearly with cell count (6×10⁶ iterations for 10,000 cells)
    • Multi-cells method: Iteration count independent of cell count (approximately 700 iterations)

GPU Implementation Results

  • Basic GPU Version: Achieves 1.2× speedup for 10,000 cells
  • Optimized Version: Memory access optimization improves by 1.3×, reaching total 1.7× speedup
  • Scale Dependency: GPU performance underperforms CPU for fewer than 10,000 cells

Data Transfer Analysis

  • Bottleneck Identification: CPU-GPU data transfer accounts for 90% of GPU execution time
  • Computational Performance: Pure GPU computation time is 3.5× faster than 40-process MPI
  • Overall Performance: GPU is 3× slower than MPI overall due to data transfer overhead

GPU Chemical Kinetics Research

  1. EMAC Model: CUDA version of KPP library achieves 20.4× speedup
  2. Specialized Solvers: RKCK and RKC methods achieve 59× speedup
  3. Parallelization Strategies:
    • Domain decomposition: Each GPU thread solves independent small systems
    • Equation parallelization: Direct parallelization of chemical equation solving

Innovations in This Work

  • Hybrid approach combining CPU-based solvers with GPU-specific techniques
  • Multi-cells strategy reduces repeated solver initialization
  • Customized optimization for CAMP framework

Conclusions and Discussion

Main Conclusions

  1. Multi-cells Strategy is Effective: Achieves significant acceleration by reducing repeated solver calls
  2. GPU Parallelization is Feasible: GPU implementation outperforms CPU at sufficient scale
  3. Data Transfer is Critical Bottleneck: Requires further optimization to fully leverage GPU potential

Limitations

  1. Scale Dependency: GPU advantages only manifest for large-scale problems (>10,000 cells)
  2. Data Transfer Overhead: Limits actual GPU performance gains
  3. Partial GPU Implementation: Only Derivative function optimized; other components remain on CPU

Future Directions

  1. Extended GPU Implementation: Port Jacobian and ODE solver to GPU
  2. Asynchronous Communication: Implement CPU-GPU work overlap to hide data transfer latency
  3. Load Balancing: Explore CPU-GPU cooperative computing strategies
  4. MONARCH Integration: Evaluate GPU chemical solver in complete atmospheric model

In-Depth Evaluation

Strengths

  1. High Practical Value: Performance optimization targeting real atmospheric chemistry models
  2. Methodological Innovation: Multi-cells strategy is simple, effective, and easy to implement
  3. Systematic Analysis: Comprehensive optimization from algorithm to memory access levels
  4. Detailed Performance Analysis: Clear identification of performance bottlenecks and improvement directions

Weaknesses

  1. Limited GPU Utilization: Only partial function GPU implementation; GPU potential not fully exploited
  2. Simplified Test Cases: Uses only 3-species basic mechanism; actual applications are more complex
  3. Data Transfer Issues: Critical performance bottleneck not fundamentally resolved
  4. Scalability Limitations: GPU advantages require large-scale problems to manifest

Impact

  1. Academic Contribution: Provides practical methods for GPU acceleration of atmospheric chemistry models
  2. Practical Application: Directly applicable to operational models like MONARCH
  3. Technical Demonstration: Showcases GPU porting strategies for traditional scientific computing code
  4. Foundation for Future Work: Establishes basis for further GPU optimization efforts

Applicable Scenarios

  1. Large-scale Atmospheric Simulation: Suitable for applications requiring thousands of grid cells
  2. Chemical Weather Forecasting: Applicable to operational air quality forecasting systems
  3. Climate Modeling: Supports chemical process computation in long-term climate change research
  4. Scientific Computing Optimization: Provides reference for other ODE-intensive scientific applications

References

The paper cites 12 related references, primarily including:

  • Technical documentation for CAMP framework and MONARCH model
  • Prior research on GPU-accelerated chemical kinetics
  • Foundational literature on atmospheric modeling and parallel computing
  • Technical resources for numerical solving libraries such as CVODE

Overall Assessment: This is a high-quality technical paper targeting practical applications. The proposed multi-cells strategy is simple and effective; while GPU implementation is limited by data transfer, it demonstrates good computational potential. The research provides valuable technical pathways for performance optimization of atmospheric chemistry models with significant practical value.