The quadratic complexity of self-attention limits its applicability and scalability on large unstructured meshes. We introduce Fast Low-rank Attention Routing Engine (FLARE), a linear complexity self-attention mechanism that routes attention through fixed-length latent sequences. Each attention head performs global communication among $N$ tokens by projecting the input sequence onto a fixed length latent sequence of $M \ll N$ tokens using learnable query tokens. By routing attention through a bottleneck sequence, FLARE learns a low-rank form of attention that can be applied at $O(NM)$ cost. FLARE not only scales to unprecedented problem sizes, but also delivers superior accuracy compared to state-of-the-art neural PDE surrogates across diverse benchmarks. We also release a new additive manufacturing dataset to spur further research. Our code is available at https://github.com/vpuri3/FLARE.py.
The quadratic complexity of traditional self-attention mechanisms limits their applicability and scalability on large-scale unstructured meshes. This paper proposes FLARE (Fast Low-rank Attention Routing Engine), a linear-complexity self-attention mechanism that routes attention through fixed-length latent sequences. Each attention head projects the input sequence onto a fixed-length latent sequence of length M≪N using learnable query tokens, enabling global communication among N tokens. Through bottleneck sequence routing attention, FLARE learns low-rank forms of attention that can be applied at O(NM) cost. FLARE not only scales to unprecedented problem sizes but also provides superior accuracy compared to state-of-the-art neural PDE surrogate models across multiple benchmarks.
Core Issue: The self-attention mechanism in traditional Transformers exhibits O(N²) time and memory complexity, severely limiting its application on large-scale unstructured meshes (such as point clouds and grids in physical simulations).
Application Significance: In partial differential equation (PDE) surrogate modeling, each point in a 3D point cloud is treated as a token containing geometric and physical quantities (such as coordinates, normal vectors, material properties). High-fidelity physical system simulations are computationally expensive, and machine learning surrogate models provide fast approximation alternatives.
Limitations of Existing Methods:
PerceiverIO: Performs only single encoding and decoding; potential bottlenecks may limit accuracy
Transolver: Shares projection weights across heads, unable to leverage existing GPU kernels for scaled dot-product attention
LNO: Applies only single projection, lacking deep model capacity
Research Motivation: Develop an attention mechanism that maintains global communication capability while achieving linear complexity, enabling Transformers to handle geometries with millions of points.
Linear-Complexity Token Mixing: Proposes FLARE self-attention mechanism that achieves linear complexity by replacing full self-attention with low-rank projection and reconstruction.
Superior Accuracy: Achieves prediction accuracy superior to leading neural surrogate models across multiple PDE benchmarks with fewer parameters and lower computational complexity.
Unprecedented Scalability: FLARE is built entirely on standard fused attention primitives, ensuring high GPU utilization and supporting end-to-end training on unstructured meshes with millions of points.
New Benchmark Dataset: Releases a large-scale high-resolution metal additive manufacturing dataset for residual displacement prediction research.
Given an input sequence X ∈ R^(N×C), where N is the number of tokens and C is the feature dimension, FLARE aims to learn a linear-complexity attention mechanism that enables efficient global token communication.
Per-Head Independent Projection: Unlike Transolver's shared projection weights, FLARE assigns different latent token slices to each head, allowing each head to learn independent attention relationships.
Deep Residual MLP: Uses deep residual networks for key/value projection, learning higher-order feature interactions compared to simple linear layers.
Symmetric Encoder-Decoder Design: The symmetry of encoding and decoding operations promotes stable information flow.
Fused Kernel Compatibility: Built entirely on standard SDPA operations, compatible with optimized algorithms like Flash Attention.
FLARE successfully trains on million-point DrivAerML dataset on a single H100 GPU, marking the first attention-based neural surrogate model to handle million-point scales without memory offloading or distributed computing.
The paper cites important works in Transformers, neural operators, and efficient attention mechanisms, providing solid theoretical foundations and comparison baselines.
Overall Assessment: This is a high-quality research paper that proposes an innovative solution to the Transformer scalability problem. The FLARE method is not only theoretically elegant with its low-rank factorization interpretation but also demonstrates excellent practical performance. The paper features comprehensive experimental design, rigorous theoretical analysis, and significant implications for advancing large-scale geometric deep learning and scientific computing.