2025-11-20T19:31:15.361383

Domain decomposition of the modified Born series approach for large-scale wave propagation simulations

Mache, Vellekoop
The modified Born series (MBS) is a fast and accurate method for simulating wave propagation in complex structures. In the current implementation of the MBS, the simulation size is limited by the working memory of a single computer or graphics processing unit (GPU). Here, we present a domain decomposition method that enhances the scalability of the MBS by distributing the computations over multiple GPUs, while maintaining its accuracy, memory efficiency, and guaranteed monotonic convergence. With this new method, the computations can be performed in parallel, and a larger simulation size is possible as it is no longer limited to the memory size of a single computer or GPU. We show how to decompose large problems over subdomains and demonstrate our approach by solving the Helmholtz problem for a complex structure of $3.28\cdot 10^7$ cubic wavelengths ($320 \times 320 \times 320$ wavelengths) in just $45$ minutes with a dual-GPU simulation.
academic

Domain decomposition of the modified Born series approach for large-scale wave propagation simulations

Basic Information

  • Paper ID: 2410.02395
  • Title: Domain decomposition of the modified Born series approach for large-scale wave propagation simulations
  • Authors: Swapnil Mache, Ivo M. Vellekoop (University of Twente)
  • Classification: physics.comp-ph
  • Publication Date: October 2024 (arXiv v3: October 16, 2025)
  • Paper Link: https://arxiv.org/abs/2410.02395

Abstract

The modified Born series (MBS) is a fast and accurate method for simulating wave propagation in complex structures. In current MBS implementations, simulation scale is limited by the working memory of a single computer or graphics processing unit (GPU). This paper proposes a domain decomposition method that enhances MBS scalability by distributing computation across multiple GPUs while maintaining its accuracy, memory efficiency, and guaranteed monotonic convergence. Using this novel approach, computations can be executed in parallel, enabling larger simulation scales no longer constrained by single-machine or GPU memory. The authors demonstrate how large problems can be decomposed onto subdomains and solve a Helmholtz problem for a complex structure of 3.28×1073.28 \times 10^7 cubic wavelengths (320×320×320320 \times 320 \times 320 wavelengths) in just 45 minutes using dual-GPU simulation.

Research Background and Motivation

Problem Background

  1. Importance of Wave Propagation Simulation: Wave propagation simulation has broad applications from nanophotonics to geophysics, but computing accurate solutions to wave equations in large heterogeneous media is computationally intensive.
  2. Limitations of Existing Methods:
    • FDTD Method: Relies on finite difference approximations, introducing cumulative errors with phase velocity errors reaching several percent
    • PSTD Method: Cumulative errors in time derivatives limit simulation distances to far less than 100 wavelengths
    • Traditional MBS: While highly accurate with fast convergence, it is limited by single GPU memory capacity
  3. Advantages of MBS:
    • Does not depend on finite difference approximations, avoiding numerical dispersion
    • Only requires satisfaction of Nyquist sampling limits
    • Possesses "pseudo-propagation" characteristics, traversing multiple wavelengths per iteration
    • More than three orders of magnitude faster than FDTD

Research Motivation

Although GPUs provide significant performance improvements, their limited working memory severely constrains simulation scale. Existing FDTD methods have addressed this through domain decomposition, but MBS lacks comparable parallelization schemes.

Core Contributions

  1. Proposed Domain Decomposition for MBS: Developed a non-overlapping domain decomposition strategy directly based on block operator decomposition of the Helmholtz equation
  2. Preserved Key MBS Advantages: Maintained low memory usage, high accuracy, and guaranteed monotonic convergence
  3. Eliminated Boundary Condition Dependency: No need to explicitly specify subdomain boundary conditions, avoiding complexity of traditional methods
  4. Enabled Large-Scale Parallel Computing: Demonstrated 3D simulation of 3.27×1073.27 \times 10^7 cubic wavelengths, increasing single-GPU maximum capacity by 1.95 times
  5. Provided Open-Source Implementation: Python open-source implementation available on GitHub

Methodology Details

Problem Definition

Solve the inhomogeneous Helmholtz equation: (2+k2)ψ=S(\nabla^2 + k^2)\psi = -S

where 2\nabla^2 is the Laplacian operator, kk is the spatially-varying wavenumber, ψ\psi is the field, and SS is the source term.

Model Architecture

1. Fundamental MBS Method

Decompose operator A:=c(2+k2)A := c(\nabla^2 + k^2) into A=L+VA = L + V, where:

  • L:=c[2+k02]L := c[\nabla^2 + k_0^2]: Wave propagation in homogeneous medium
  • V=c[k2k02]V = c[k^2 - k_0^2]: Scattering potential

Using preconditioned Richardson iteration: x(n+1)=x(n)+αΓ1(yAx(n))x^{(n+1)} = x^{(n)} + \alpha\Gamma^{-1}(y - Ax^{(n)})

2. Domain Decomposition Strategy

For 1D problem decomposed into two subdomains, the block decomposition of the operator is: [A11A12A21A22][x1x2]=[y1y2]\begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix}

Key innovation in redefining the decomposition: L=[L1100L22],V=[V11A12A21V22]L = \begin{bmatrix} L_{11} & 0 \\ 0 & L_{22} \end{bmatrix}, \quad V = \begin{bmatrix} V_{11} & A_{12} \\ A_{21} & V_{22} \end{bmatrix}

3. Off-Diagonal Block Treatment

  • Communication Blocks A12,A21A_{12}, A_{21}: Represent inter-subdomain communication, computed through differences in angular spectrum kernels
  • Truncation Strategy: Retain only tNt \ll N points near boundaries, significantly reducing computational overhead
  • Wraparound Artifact Elimination: Automatically eliminates wraparound artifacts produced by FFT convolution

Technical Innovations

  1. Flexibility in Operator Decomposition: Exploits MBS's freedom to choose arbitrary A=L+VA = L + V decomposition
  2. Implicit Boundary Condition Handling: Avoids explicit boundary conditions by ensuring L+VL + V exactly equals the original system
  3. Truncation Optimization: Leverages fast decay properties of kernel functions to dramatically reduce communication overhead
  4. Scaling Factor Adjustment: c=0.95ik2k02+(d=13ad)A12c = -\frac{0.95i}{\|k^2 - k_0^2\|_\infty + \left(\sum_{d=1}^3 a_d\right)\|A_{12}\|}

Experimental Setup

Simulation Configuration

  • Structure: Densely packed spheres with refractive index 1.33 + 0.01i, randomly distributed in medium with refractive index 1
  • Sampling: 4 samples per wavelength
  • Boundary Conditions: 5-wavelength-thick absorbing boundary in x-direction, periodic boundaries in y and z directions
  • Convergence Criterion: Relative residual 10610^{-6}
  • Truncation Parameter: t=8t = 8 (default value)

Computing Platform

  • CPU: Dual Silver-4216 2.10 GHz, 128 GB RAM
  • GPU: Four A40 48GB GPUs
  • Software: Python open-source implementation

Evaluation Metrics

  1. Accuracy: Relative error xxref22/xref22\|x - x_{ref}\|_2^2 / \|x_{ref}\|_2^2 compared to single-domain simulation
  2. Convergence: Iteration count and monotonic convergence property
  3. Performance: Simulation time and memory usage
  4. Scalability: Performance with different numbers of GPUs

Experimental Results

Main Results

1. Method Verification (50×50×50 wavelengths)

  • Accuracy: Domain decomposition relative error to single-domain simulation only 2×1042 \times 10^{-4}
  • Convergence: Maintains monotonic convergence property
  • Iteration Overhead: 3-domain decomposition requires 1751 iterations vs. 584 for single-domain (3× increase)

2. Large-Scale Simulation (320×320×320 wavelengths)

  • Simulation Scale: 3.27×1073.27 \times 10^7 cubic wavelengths, 2.16 Gigavoxels
  • Dual-GPU Performance: Completed in 45 minutes, 4697 iterations
  • CPU Comparison: Single-domain CPU requires 15.5 hours, 1316 iterations
  • Speedup: 20× performance improvement
  • Accuracy: Relative error 2.9×1042.9 \times 10^{-4}

3. Scalability Analysis

GPU CountTime (seconds)Total GPU Time (seconds)IterationsSpeedup
2273054604697Baseline
32022606646971.35×
41600640046971.71×

Ablation Studies

1. Truncation Parameter Impact

  • Accuracy: Relative error below 0.1% at t=4t = 4
  • Computational Overhead: Iteration count independent of tt, but communication time grows linearly with tt
  • Recommended Value: t=8t = 8 achieves good balance between accuracy and efficiency

2. Subdomain Count Impact

  • Iteration Count: Increases only when adding subdomains along new axes; increasing subdomains along same axis does not affect convergence
  • Communication Overhead: Increases with subdomain count, but with limited growth
  • Memory Overhead: Approximately 128 bytes/voxel per subdomain interface

Experimental Findings

  1. Convergence Preservation: Domain decomposition does not affect MBS's monotonic convergence
  2. Excellent Scalability: Iteration count independent of subdomain count, consistent with scalability definition
  3. Memory Efficiency: Domain decomposition overhead occupies only ~0.2% of total memory
  4. Activation Strategy: On-demand subdomain activation provides additional 12% performance improvement

Main Research Directions

  1. Traditional Methods: FDTD, PSTD and other finite difference-based methods
  2. Frequency Domain Methods: Various Helmholtz equation solvers
  3. Parallelization Techniques: Traditional domain decomposition methods (Schwarz methods, etc.)
  4. GPU Acceleration: Various GPU implementations of wave propagation simulation

Advantages of This Work

  1. Accuracy Advantage: Does not depend on finite difference approximations; accuracy limited only by machine precision
  2. Efficiency Advantage: Three orders of magnitude faster than FDTD; pseudo-propagation distance can span multiple wavelengths
  3. Memory Advantage: Only 40 bytes per voxel, far below traditional methods
  4. Boundary Treatment: No explicit boundary conditions required, simplifying implementation

Conclusions and Discussion

Main Conclusions

  1. Successfully implemented domain decomposition parallelization for MBS, preserving all original method advantages
  2. Achieved unprecedented 3203320^3 wavelength scale simulation in just 45 minutes
  3. Method demonstrates good scalability, supporting parallel computation with arbitrary numbers of GPUs
  4. Establishes foundation for optical simulation reaching cubic millimeter scales

Limitations

  1. Iteration Overhead: Domain decomposition increases iteration count 3-4 fold
  2. Communication Overhead: GPU synchronization and data transfer introduce ~40% time overhead
  3. Lockstep Execution: Requires waiting for all GPUs to complete before proceeding to next step
  4. Memory Constraints: Still limited by single GPU memory; requires reasonable subdomain partitioning

Future Directions

  1. Algorithm Optimization: Further reduce iteration and communication overhead
  2. Extended Applications: Generalize to Maxwell equations and birefringent media
  3. Cluster Computing: Extend to multi-node computing clusters
  4. Hardware Development: Leverage next-generation GPU hardware with larger memory and computational capacity

In-Depth Evaluation

Strengths

  1. Strong Technical Innovation: First effective parallelization of MBS; novel technical approach
  2. Solid Theoretical Foundation: Based on rigorous mathematical derivations, ensuring method correctness
  3. Comprehensive Experiments: From small-scale verification to large-scale demonstration; well-designed experiments
  4. High Engineering Value: Significantly expands simulable problem scale; clear practical value
  5. Open-Source Contribution: Complete open-source implementation promotes field development

Weaknesses

  1. Convergence Speed: Iteration count increase from domain decomposition is significant drawback
  2. Communication Overhead: GPU inter-communication becomes performance bottleneck, limiting further scaling
  3. Applicable Scope: Primarily suitable for GPU cluster environments; single-machine applications limited
  4. Parameter Tuning: Truncation parameters require adjustment based on specific problems

Impact

  1. Academic Contribution: Provides new perspective for wave propagation simulation parallelization
  2. Application Prospects: Broad application potential in nanophotonics, seismology, and other fields
  3. Technology Advancement: Promotes migration of large-scale scientific computing to GPU clusters
  4. Reproducibility: Open-source implementation ensures method reproducibility and generalizability

Applicable Scenarios

  1. Large-Scale Optical Simulation: Particularly suitable for complex optical devices and metamaterial design
  2. Seismic Wave Propagation: Applicable to large-scale seismic wave propagation simulation
  3. Acoustic Modeling: Suitable for complex acoustic environment modeling
  4. GPU Cluster Computing: Environments requiring multi-GPU or GPU cluster high-performance computing

References

The paper cites 55 important references covering multiple domains including wave propagation simulation, domain decomposition methods, and GPU parallel computing, providing solid theoretical foundation and technical support for this research.


Overall Assessment: This is a high-quality computational physics paper with outstanding contributions in technical innovation, experimental verification, and engineering applications. Despite some performance overhead, its pioneering parallelization scheme and significant scale improvements make it valuable in the wave propagation simulation field.