The modified Born series (MBS) is a fast and accurate method for simulating wave propagation in complex structures. In the current implementation of the MBS, the simulation size is limited by the working memory of a single computer or graphics processing unit (GPU). Here, we present a domain decomposition method that enhances the scalability of the MBS by distributing the computations over multiple GPUs, while maintaining its accuracy, memory efficiency, and guaranteed monotonic convergence. With this new method, the computations can be performed in parallel, and a larger simulation size is possible as it is no longer limited to the memory size of a single computer or GPU. We show how to decompose large problems over subdomains and demonstrate our approach by solving the Helmholtz problem for a complex structure of $3.28\cdot 10^7$ cubic wavelengths ($320 \times 320 \times 320$ wavelengths) in just $45$ minutes with a dual-GPU simulation.
Domain decomposition of the modified Born series approach for large-scale wave propagation simulations
- Paper ID: 2410.02395
- Title: Domain decomposition of the modified Born series approach for large-scale wave propagation simulations
- Authors: Swapnil Mache, Ivo M. Vellekoop (University of Twente)
- Classification: physics.comp-ph
- Publication Date: October 2024 (arXiv v3: October 16, 2025)
- Paper Link: https://arxiv.org/abs/2410.02395
The modified Born series (MBS) is a fast and accurate method for simulating wave propagation in complex structures. In current MBS implementations, simulation scale is limited by the working memory of a single computer or graphics processing unit (GPU). This paper proposes a domain decomposition method that enhances MBS scalability by distributing computation across multiple GPUs while maintaining its accuracy, memory efficiency, and guaranteed monotonic convergence. Using this novel approach, computations can be executed in parallel, enabling larger simulation scales no longer constrained by single-machine or GPU memory. The authors demonstrate how large problems can be decomposed onto subdomains and solve a Helmholtz problem for a complex structure of 3.28×107 cubic wavelengths (320×320×320 wavelengths) in just 45 minutes using dual-GPU simulation.
- Importance of Wave Propagation Simulation: Wave propagation simulation has broad applications from nanophotonics to geophysics, but computing accurate solutions to wave equations in large heterogeneous media is computationally intensive.
- Limitations of Existing Methods:
- FDTD Method: Relies on finite difference approximations, introducing cumulative errors with phase velocity errors reaching several percent
- PSTD Method: Cumulative errors in time derivatives limit simulation distances to far less than 100 wavelengths
- Traditional MBS: While highly accurate with fast convergence, it is limited by single GPU memory capacity
- Advantages of MBS:
- Does not depend on finite difference approximations, avoiding numerical dispersion
- Only requires satisfaction of Nyquist sampling limits
- Possesses "pseudo-propagation" characteristics, traversing multiple wavelengths per iteration
- More than three orders of magnitude faster than FDTD
Although GPUs provide significant performance improvements, their limited working memory severely constrains simulation scale. Existing FDTD methods have addressed this through domain decomposition, but MBS lacks comparable parallelization schemes.
- Proposed Domain Decomposition for MBS: Developed a non-overlapping domain decomposition strategy directly based on block operator decomposition of the Helmholtz equation
- Preserved Key MBS Advantages: Maintained low memory usage, high accuracy, and guaranteed monotonic convergence
- Eliminated Boundary Condition Dependency: No need to explicitly specify subdomain boundary conditions, avoiding complexity of traditional methods
- Enabled Large-Scale Parallel Computing: Demonstrated 3D simulation of 3.27×107 cubic wavelengths, increasing single-GPU maximum capacity by 1.95 times
- Provided Open-Source Implementation: Python open-source implementation available on GitHub
Solve the inhomogeneous Helmholtz equation:
(∇2+k2)ψ=−S
where ∇2 is the Laplacian operator, k is the spatially-varying wavenumber, ψ is the field, and S is the source term.
Decompose operator A:=c(∇2+k2) into A=L+V, where:
- L:=c[∇2+k02]: Wave propagation in homogeneous medium
- V=c[k2−k02]: Scattering potential
Using preconditioned Richardson iteration:
x(n+1)=x(n)+αΓ−1(y−Ax(n))
For 1D problem decomposed into two subdomains, the block decomposition of the operator is:
[A11A21A12A22][x1x2]=[y1y2]
Key innovation in redefining the decomposition:
L=[L1100L22],V=[V11A21A12V22]
- Communication Blocks A12,A21: Represent inter-subdomain communication, computed through differences in angular spectrum kernels
- Truncation Strategy: Retain only t≪N points near boundaries, significantly reducing computational overhead
- Wraparound Artifact Elimination: Automatically eliminates wraparound artifacts produced by FFT convolution
- Flexibility in Operator Decomposition: Exploits MBS's freedom to choose arbitrary A=L+V decomposition
- Implicit Boundary Condition Handling: Avoids explicit boundary conditions by ensuring L+V exactly equals the original system
- Truncation Optimization: Leverages fast decay properties of kernel functions to dramatically reduce communication overhead
- Scaling Factor Adjustment:
c=−∥k2−k02∥∞+(∑d=13ad)∥A12∥0.95i
- Structure: Densely packed spheres with refractive index 1.33 + 0.01i, randomly distributed in medium with refractive index 1
- Sampling: 4 samples per wavelength
- Boundary Conditions: 5-wavelength-thick absorbing boundary in x-direction, periodic boundaries in y and z directions
- Convergence Criterion: Relative residual 10−6
- Truncation Parameter: t=8 (default value)
- CPU: Dual Silver-4216 2.10 GHz, 128 GB RAM
- GPU: Four A40 48GB GPUs
- Software: Python open-source implementation
- Accuracy: Relative error ∥x−xref∥22/∥xref∥22 compared to single-domain simulation
- Convergence: Iteration count and monotonic convergence property
- Performance: Simulation time and memory usage
- Scalability: Performance with different numbers of GPUs
- Accuracy: Domain decomposition relative error to single-domain simulation only 2×10−4
- Convergence: Maintains monotonic convergence property
- Iteration Overhead: 3-domain decomposition requires 1751 iterations vs. 584 for single-domain (3× increase)
- Simulation Scale: 3.27×107 cubic wavelengths, 2.16 Gigavoxels
- Dual-GPU Performance: Completed in 45 minutes, 4697 iterations
- CPU Comparison: Single-domain CPU requires 15.5 hours, 1316 iterations
- Speedup: 20× performance improvement
- Accuracy: Relative error 2.9×10−4
| GPU Count | Time (seconds) | Total GPU Time (seconds) | Iterations | Speedup |
|---|
| 2 | 2730 | 5460 | 4697 | Baseline |
| 3 | 2022 | 6066 | 4697 | 1.35× |
| 4 | 1600 | 6400 | 4697 | 1.71× |
- Accuracy: Relative error below 0.1% at t=4
- Computational Overhead: Iteration count independent of t, but communication time grows linearly with t
- Recommended Value: t=8 achieves good balance between accuracy and efficiency
- Iteration Count: Increases only when adding subdomains along new axes; increasing subdomains along same axis does not affect convergence
- Communication Overhead: Increases with subdomain count, but with limited growth
- Memory Overhead: Approximately 128 bytes/voxel per subdomain interface
- Convergence Preservation: Domain decomposition does not affect MBS's monotonic convergence
- Excellent Scalability: Iteration count independent of subdomain count, consistent with scalability definition
- Memory Efficiency: Domain decomposition overhead occupies only ~0.2% of total memory
- Activation Strategy: On-demand subdomain activation provides additional 12% performance improvement
- Traditional Methods: FDTD, PSTD and other finite difference-based methods
- Frequency Domain Methods: Various Helmholtz equation solvers
- Parallelization Techniques: Traditional domain decomposition methods (Schwarz methods, etc.)
- GPU Acceleration: Various GPU implementations of wave propagation simulation
- Accuracy Advantage: Does not depend on finite difference approximations; accuracy limited only by machine precision
- Efficiency Advantage: Three orders of magnitude faster than FDTD; pseudo-propagation distance can span multiple wavelengths
- Memory Advantage: Only 40 bytes per voxel, far below traditional methods
- Boundary Treatment: No explicit boundary conditions required, simplifying implementation
- Successfully implemented domain decomposition parallelization for MBS, preserving all original method advantages
- Achieved unprecedented 3203 wavelength scale simulation in just 45 minutes
- Method demonstrates good scalability, supporting parallel computation with arbitrary numbers of GPUs
- Establishes foundation for optical simulation reaching cubic millimeter scales
- Iteration Overhead: Domain decomposition increases iteration count 3-4 fold
- Communication Overhead: GPU synchronization and data transfer introduce ~40% time overhead
- Lockstep Execution: Requires waiting for all GPUs to complete before proceeding to next step
- Memory Constraints: Still limited by single GPU memory; requires reasonable subdomain partitioning
- Algorithm Optimization: Further reduce iteration and communication overhead
- Extended Applications: Generalize to Maxwell equations and birefringent media
- Cluster Computing: Extend to multi-node computing clusters
- Hardware Development: Leverage next-generation GPU hardware with larger memory and computational capacity
- Strong Technical Innovation: First effective parallelization of MBS; novel technical approach
- Solid Theoretical Foundation: Based on rigorous mathematical derivations, ensuring method correctness
- Comprehensive Experiments: From small-scale verification to large-scale demonstration; well-designed experiments
- High Engineering Value: Significantly expands simulable problem scale; clear practical value
- Open-Source Contribution: Complete open-source implementation promotes field development
- Convergence Speed: Iteration count increase from domain decomposition is significant drawback
- Communication Overhead: GPU inter-communication becomes performance bottleneck, limiting further scaling
- Applicable Scope: Primarily suitable for GPU cluster environments; single-machine applications limited
- Parameter Tuning: Truncation parameters require adjustment based on specific problems
- Academic Contribution: Provides new perspective for wave propagation simulation parallelization
- Application Prospects: Broad application potential in nanophotonics, seismology, and other fields
- Technology Advancement: Promotes migration of large-scale scientific computing to GPU clusters
- Reproducibility: Open-source implementation ensures method reproducibility and generalizability
- Large-Scale Optical Simulation: Particularly suitable for complex optical devices and metamaterial design
- Seismic Wave Propagation: Applicable to large-scale seismic wave propagation simulation
- Acoustic Modeling: Suitable for complex acoustic environment modeling
- GPU Cluster Computing: Environments requiring multi-GPU or GPU cluster high-performance computing
The paper cites 55 important references covering multiple domains including wave propagation simulation, domain decomposition methods, and GPU parallel computing, providing solid theoretical foundation and technical support for this research.
Overall Assessment: This is a high-quality computational physics paper with outstanding contributions in technical innovation, experimental verification, and engineering applications. Despite some performance overhead, its pioneering parallelization scheme and significant scale improvements make it valuable in the wave propagation simulation field.