Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond
Oncescu, Purandare, Idreos et al.
While transformers have been at the core of most recent advancements in sequence generative models, their computational cost remains quadratic in sequence length. Several subquadratic architectures have been proposed to address this computational issue. Some of them, including long convolution sequence models (LCSMs), such as Hyena, address this issue at training time but remain quadratic during inference. We propose a method for speeding up LCSMs' exact inference to quasilinear $O(L\log^2L)$ time, identify the key properties that make this possible, and propose a general framework that exploits these. Our approach, inspired by previous work on relaxed polynomial interpolation, is based on a tiling which helps decrease memory movement and share computation. It has the added benefit of allowing for almost complete parallelization across layers of the position-mixing part of the architecture. Empirically, we provide a proof of concept implementation for Hyena, which gets up to $7.8\times$ end-to-end improvement over standard inference by improving $110\times$ within the position-mixing part.
academic
Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond
This paper addresses the quadratic time complexity problem in the inference phase of long convolutional sequence models (LCSMs), proposing the Flash Inference framework that reduces the time complexity of exact inference to near-linear O(Llog2L). The method is inspired by relaxed polynomial interpolation and employs a tiling strategy to reduce memory movement and share computation. Experiments on the Hyena architecture demonstrate end-to-end inference speedup of 7.8×, with 110× speedup in the position mixing component.
Although Transformers have achieved tremendous success in sequence generation models, their computational cost grows quadratically with sequence length (O(L2)), becoming a bottleneck in both training and inference phases. To address this, researchers have proposed various subquadratic architectures, including state space models (SSMs) and long convolutional sequence models (LCSMs, such as Hyena).
Training efficiency resolved: LCSMs can achieve O(LlogL) complexity during training via FFT
Inference efficiency unresolved: During autoregressive inference, since input sequences are generated incrementally, FFT cannot be directly applied, causing complexity to degrade to O(L2)
Long context demands: As large language models process increasingly longer contexts, inference efficiency becomes more critical
Approximate methods (Massaroli et al. 2024): Project convolution filters to low-dimensional LTI SSMs, but this is only approximate and requires expensive distillation precomputation, without supporting data-dependent filters
Recursive perspective: May be efficient for low-dimensional SSMs but remains inefficient for high-dimensional SSMs (dimensions close to sequence length)
Structure-exploiting methods: Require filters to have specific structures (e.g., low-rank LTI SSM), limiting model expressiveness
This work aims to provide an exact and general-purpose inference acceleration framework that does not depend on specific filter structures while supporting data-dependent filters.
First near-linear exact inference algorithm: Proposes an O(Llog2L) time complexity exact inference algorithm for LCSMs, achieving exact simulation compared to previous approximate methods
General framework identification: Identifies key architectural properties enabling fast inference (contribution-based, query-independent), proposing the Flash Inference framework applicable to a broader class of architectures
Cross-layer parallelization: Leverages tiling strategy to enable nearly complete cross-layer parallel computation of position mixing components
Memory optimization: Significantly reduces data movement from Ω(L2) to O(LlogL) through tiling, saving 2× activation storage for data-independent filters
Empirical validation: Achieves 7.8× end-to-end speedup on Hyena architecture with 110× speedup in convolution components
Autoregressive sequence generation: Given a prompt sequence x1,…,xp, the model generates subsequent tokens one by one. At each position i, the model computes activations ai[1,M] through all layers, finally sampling xi+1 from aiM.
Core computational bottleneck: For each layer ℓ and each dimension, computing:
zt=∑i=1tyi⋅ρt−i
where y is the input sequence and ρ is a convolution filter of length L. Naive implementation requires Ω(L2) time.
for i = 1 to L-1:
U ← largest power of 2 dividing i
z_i += y_i * ρ_0 # Red cell: direct dependency
z[i+1:i+U] += τ(y, [i-U+1, i], ρ, [i+1, i+U]) # Gray block: eager computation
return z_i
unlock y_{i+1}
Key characteristics:
At iteration i, compute gray block with edge length U (largest power of 2 dividing i)
Red cell handles direct dependency at current position
Extends Algorithm 1 to multiple layers and dimensions:
for i = 1 to L-1:
U ← largest power of 2 dividing i
for ℓ = 1 to M: # iterate through layers
b^ℓ_i += a^{ℓ-1}_i ⊙ ρ^ℓ_0 # Red cell
a^ℓ_i = block^ℓ(b^ℓ_i)
b^ℓ[i+1:i+U] += τ(a^{ℓ-1}, [i-U+1, i], ρ^ℓ, [i+1, i+U]) # Gray block
a^0_{i+1} = sampler(a^M_i)
van der Hoeven, J. (1997). Lazy multiplication of formal power series. ISSAC. Theoretical foundation
Poli, M. et al. (2023). Hyena hierarchy: Towards larger convolutional language models. Primary application target
Massaroli, S. et al. (2024). Laughing hyena distillery: Extracting compact recurrences from convolutions. NeurIPS. Approximate method comparison
Gu, A. & Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces. SSM related work
Fu, D. Y. et al. (2023). FlashFFTConv: Efficient convolutions for long sequences with tensor cores. Implementation foundation
Agarwal, N. et al. (2024). FutureFill: Fast generation from convolutional sequence models. Concurrent work
Overall Assessment: This is an excellent paper tightly combining theory and practice. Theoretically, it provides the first near-linear exact algorithm for LCSM inference and abstracts a general framework; practically, it achieves significant speedup through system-level optimization. Main limitations are that LCSMs themselves are less prevalent than Transformers in practice, and data-dependent filter experimental validation is insufficient. This work provides new perspectives on sequence model inference optimization, particularly valuable for guiding future architecture design. Recommended for researchers interested in model efficiency, sequence modeling, and system optimization.