2025-11-20T19:31:15.361383

Domain decomposition of the modified Born series approach for large-scale wave propagation simulations

Mache, Vellekoop

The modified Born series (MBS) is a fast and accurate method for simulating wave propagation in complex structures. In the current implementation of the MBS, the simulation size is limited by the working memory of a single computer or graphics processing unit (GPU). Here, we present a domain decomposition method that enhances the scalability of the MBS by distributing the computations over multiple GPUs, while maintaining its accuracy, memory efficiency, and guaranteed monotonic convergence. With this new method, the computations can be performed in parallel, and a larger simulation size is possible as it is no longer limited to the memory size of a single computer or GPU. We show how to decompose large problems over subdomains and demonstrate our approach by solving the Helmholtz problem for a complex structure of $3.28\cdot 10^7$ cubic wavelengths ($320 \times 320 \times 320$ wavelengths) in just $45$ minutes with a dual-GPU simulation.

academic

Domain decomposition of the modified Born series approach for large-scale wave propagation simulations

Basic Information

Paper ID: 2410.02395
Title: Domain decomposition of the modified Born series approach for large-scale wave propagation simulations
Authors: Swapnil Mache, Ivo M. Vellekoop (University of Twente)
Classification: physics.comp-ph
Publication Date: October 2024 (arXiv v3: October 16, 2025)
Paper Link: https://arxiv.org/abs/2410.02395

Abstract

The modified Born series (MBS) is a fast and accurate method for simulating wave propagation in complex structures. In current MBS implementations, simulation scale is limited by the working memory of a single computer or graphics processing unit (GPU). This paper proposes a domain decomposition method that enhances MBS scalability by distributing computation across multiple GPUs while maintaining its accuracy, memory efficiency, and guaranteed monotonic convergence. Using this novel approach, computations can be executed in parallel, enabling larger simulation scales no longer constrained by single-machine or GPU memory. The authors demonstrate how large problems can be decomposed onto subdomains and solve a Helmholtz problem for a complex structure of $3.28 \times 10^7$ cubic wavelengths ( $320 \times 320 \times 320$ wavelengths) in just 45 minutes using dual-GPU simulation.

Research Background and Motivation

Problem Background

Importance of Wave Propagation Simulation: Wave propagation simulation has broad applications from nanophotonics to geophysics, but computing accurate solutions to wave equations in large heterogeneous media is computationally intensive.
Limitations of Existing Methods:
- FDTD Method: Relies on finite difference approximations, introducing cumulative errors with phase velocity errors reaching several percent
- PSTD Method: Cumulative errors in time derivatives limit simulation distances to far less than 100 wavelengths
- Traditional MBS: While highly accurate with fast convergence, it is limited by single GPU memory capacity
Advantages of MBS:
- Does not depend on finite difference approximations, avoiding numerical dispersion
- Only requires satisfaction of Nyquist sampling limits
- Possesses "pseudo-propagation" characteristics, traversing multiple wavelengths per iteration
- More than three orders of magnitude faster than FDTD

Research Motivation

Although GPUs provide significant performance improvements, their limited working memory severely constrains simulation scale. Existing FDTD methods have addressed this through domain decomposition, but MBS lacks comparable parallelization schemes.

Core Contributions

Proposed Domain Decomposition for MBS: Developed a non-overlapping domain decomposition strategy directly based on block operator decomposition of the Helmholtz equation
Preserved Key MBS Advantages: Maintained low memory usage, high accuracy, and guaranteed monotonic convergence
Eliminated Boundary Condition Dependency: No need to explicitly specify subdomain boundary conditions, avoiding complexity of traditional methods
Enabled Large-Scale Parallel Computing: Demonstrated 3D simulation of $3.27 \times 10^7$ cubic wavelengths, increasing single-GPU maximum capacity by 1.95 times
Provided Open-Source Implementation: Python open-source implementation available on GitHub

Methodology Details

Problem Definition

Solve the inhomogeneous Helmholtz equation: $(\nabla^2 + k^2)\psi = -S$

where $\nabla^2$ is the Laplacian operator, $k$ is the spatially-varying wavenumber, $\psi$ is the field, and $S$ is the source term.

Model Architecture

1. Fundamental MBS Method

Decompose operator $A := c(\nabla^2 + k^2)$ into $A = L + V$ , where:

$L := c[\nabla^2 + k_0^2]$ : Wave propagation in homogeneous medium
$V = c[k^2 - k_0^2]$ : Scattering potential

Using preconditioned Richardson iteration: $x^{(n+1)} = x^{(n)} + \alpha\Gamma^{-1}(y - Ax^{(n)})$

2. Domain Decomposition Strategy

For 1D problem decomposed into two subdomains, the block decomposition of the operator is: $\begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix}$

Key innovation in redefining the decomposition: $L = \begin{bmatrix} L_{11} & 0 \\ 0 & L_{22} \end{bmatrix}, \quad V = \begin{bmatrix} V_{11} & A_{12} \\ A_{21} & V_{22} \end{bmatrix}$

3. Off-Diagonal Block Treatment

Communication Blocks $A_{12}, A_{21}$ : Represent inter-subdomain communication, computed through differences in angular spectrum kernels
Truncation Strategy: Retain only $t \ll N$ points near boundaries, significantly reducing computational overhead
Wraparound Artifact Elimination: Automatically eliminates wraparound artifacts produced by FFT convolution

Technical Innovations

Flexibility in Operator Decomposition: Exploits MBS's freedom to choose arbitrary $A = L + V$ decomposition
Implicit Boundary Condition Handling: Avoids explicit boundary conditions by ensuring $L + V$ exactly equals the original system
Truncation Optimization: Leverages fast decay properties of kernel functions to dramatically reduce communication overhead
Scaling Factor Adjustment: $c = -\frac{0.95i}{\|k^2 - k_0^2\|_\infty + \left(\sum_{d=1}^3 a_d\right)\|A_{12}\|}$

Experimental Setup

Simulation Configuration

Structure: Densely packed spheres with refractive index 1.33 + 0.01i, randomly distributed in medium with refractive index 1
Sampling: 4 samples per wavelength
Boundary Conditions: 5-wavelength-thick absorbing boundary in x-direction, periodic boundaries in y and z directions
Convergence Criterion: Relative residual $10^{-6}$
Truncation Parameter: $t = 8$ (default value)

Computing Platform

CPU: Dual Silver-4216 2.10 GHz, 128 GB RAM
GPU: Four A40 48GB GPUs
Software: Python open-source implementation

Evaluation Metrics

Accuracy: Relative error $\|x - x_{ref}\|_2^2 / \|x_{ref}\|_2^2$ compared to single-domain simulation
Convergence: Iteration count and monotonic convergence property
Performance: Simulation time and memory usage
Scalability: Performance with different numbers of GPUs

Experimental Results

Main Results

1. Method Verification (50×50×50 wavelengths)

Accuracy: Domain decomposition relative error to single-domain simulation only $2 \times 10^{-4}$
Convergence: Maintains monotonic convergence property
Iteration Overhead: 3-domain decomposition requires 1751 iterations vs. 584 for single-domain (3× increase)

2. Large-Scale Simulation (320×320×320 wavelengths)

Simulation Scale: $3.27 \times 10^7$ cubic wavelengths, 2.16 Gigavoxels
Dual-GPU Performance: Completed in 45 minutes, 4697 iterations
CPU Comparison: Single-domain CPU requires 15.5 hours, 1316 iterations
Speedup: 20× performance improvement
Accuracy: Relative error $2.9 \times 10^{-4}$

3. Scalability Analysis

GPU Count	Time (seconds)	Total GPU Time (seconds)	Iterations	Speedup
2	2730	5460	4697	Baseline
3	2022	6066	4697	1.35×
4	1600	6400	4697	1.71×

Ablation Studies

1. Truncation Parameter Impact

Accuracy: Relative error below 0.1% at $t = 4$
Computational Overhead: Iteration count independent of $t$ , but communication time grows linearly with $t$
Recommended Value: $t = 8$ achieves good balance between accuracy and efficiency

2. Subdomain Count Impact

Iteration Count: Increases only when adding subdomains along new axes; increasing subdomains along same axis does not affect convergence
Communication Overhead: Increases with subdomain count, but with limited growth
Memory Overhead: Approximately 128 bytes/voxel per subdomain interface

Experimental Findings

Convergence Preservation: Domain decomposition does not affect MBS's monotonic convergence
Excellent Scalability: Iteration count independent of subdomain count, consistent with scalability definition
Memory Efficiency: Domain decomposition overhead occupies only ~0.2% of total memory
Activation Strategy: On-demand subdomain activation provides additional 12% performance improvement

Main Research Directions

Traditional Methods: FDTD, PSTD and other finite difference-based methods
Frequency Domain Methods: Various Helmholtz equation solvers
Parallelization Techniques: Traditional domain decomposition methods (Schwarz methods, etc.)
GPU Acceleration: Various GPU implementations of wave propagation simulation

Advantages of This Work

Accuracy Advantage: Does not depend on finite difference approximations; accuracy limited only by machine precision
Efficiency Advantage: Three orders of magnitude faster than FDTD; pseudo-propagation distance can span multiple wavelengths
Memory Advantage: Only 40 bytes per voxel, far below traditional methods
Boundary Treatment: No explicit boundary conditions required, simplifying implementation

Conclusions and Discussion

Main Conclusions

Successfully implemented domain decomposition parallelization for MBS, preserving all original method advantages
Achieved unprecedented $320^3$ wavelength scale simulation in just 45 minutes
Method demonstrates good scalability, supporting parallel computation with arbitrary numbers of GPUs
Establishes foundation for optical simulation reaching cubic millimeter scales

Limitations

Iteration Overhead: Domain decomposition increases iteration count 3-4 fold
Communication Overhead: GPU synchronization and data transfer introduce ~40% time overhead
Lockstep Execution: Requires waiting for all GPUs to complete before proceeding to next step
Memory Constraints: Still limited by single GPU memory; requires reasonable subdomain partitioning

Future Directions

Algorithm Optimization: Further reduce iteration and communication overhead
Extended Applications: Generalize to Maxwell equations and birefringent media
Cluster Computing: Extend to multi-node computing clusters
Hardware Development: Leverage next-generation GPU hardware with larger memory and computational capacity

In-Depth Evaluation

Strengths

Strong Technical Innovation: First effective parallelization of MBS; novel technical approach
Solid Theoretical Foundation: Based on rigorous mathematical derivations, ensuring method correctness
Comprehensive Experiments: From small-scale verification to large-scale demonstration; well-designed experiments
High Engineering Value: Significantly expands simulable problem scale; clear practical value
Open-Source Contribution: Complete open-source implementation promotes field development

Weaknesses

Convergence Speed: Iteration count increase from domain decomposition is significant drawback
Communication Overhead: GPU inter-communication becomes performance bottleneck, limiting further scaling
Applicable Scope: Primarily suitable for GPU cluster environments; single-machine applications limited
Parameter Tuning: Truncation parameters require adjustment based on specific problems

Impact

Academic Contribution: Provides new perspective for wave propagation simulation parallelization
Application Prospects: Broad application potential in nanophotonics, seismology, and other fields
Technology Advancement: Promotes migration of large-scale scientific computing to GPU clusters
Reproducibility: Open-source implementation ensures method reproducibility and generalizability

Applicable Scenarios

Large-Scale Optical Simulation: Particularly suitable for complex optical devices and metamaterial design
Seismic Wave Propagation: Applicable to large-scale seismic wave propagation simulation
Acoustic Modeling: Suitable for complex acoustic environment modeling
GPU Cluster Computing: Environments requiring multi-GPU or GPU cluster high-performance computing

References

The paper cites 55 important references covering multiple domains including wave propagation simulation, domain decomposition methods, and GPU parallel computing, providing solid theoretical foundation and technical support for this research.

Overall Assessment: This is a high-quality computational physics paper with outstanding contributions in technical innovation, experimental verification, and engineering applications. Despite some performance overhead, its pioneering parallelization scheme and significant scale improvements make it valuable in the wave propagation simulation field.