2025-11-24T16:10:17.960735

Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

Oncescu, Purandare, Idreos et al.

While transformers have been at the core of most recent advancements in sequence generative models, their computational cost remains quadratic in sequence length. Several subquadratic architectures have been proposed to address this computational issue. Some of them, including long convolution sequence models (LCSMs), such as Hyena, address this issue at training time but remain quadratic during inference. We propose a method for speeding up LCSMs' exact inference to quasilinear $O(L\log^2L)$ time, identify the key properties that make this possible, and propose a general framework that exploits these. Our approach, inspired by previous work on relaxed polynomial interpolation, is based on a tiling which helps decrease memory movement and share computation. It has the added benefit of allowing for almost complete parallelization across layers of the position-mixing part of the architecture. Empirically, we provide a proof of concept implementation for Hyena, which gets up to $7.8\times$ end-to-end improvement over standard inference by improving $110\times$ within the position-mixing part.

academic

Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

Basic Information

Paper ID: 2410.12982
Title: Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond
Authors: Costin-Andrei Oncescu, Sanket Purandare, Stratos Idreos, Sham Kakade (Harvard University)
Classification: cs.LG, cs.AI
Publication Date: arXiv preprint, submitted October 2024, updated November 2025 (v2)
Paper Link: https://arxiv.org/abs/2410.12982

Abstract

This paper addresses the quadratic time complexity problem in the inference phase of long convolutional sequence models (LCSMs), proposing the Flash Inference framework that reduces the time complexity of exact inference to near-linear $O(L\log^2L)$ . The method is inspired by relaxed polynomial interpolation and employs a tiling strategy to reduce memory movement and share computation. Experiments on the Hyena architecture demonstrate end-to-end inference speedup of 7.8×, with 110× speedup in the position mixing component.

Research Background and Motivation

1. Core Problem

Although Transformers have achieved tremendous success in sequence generation models, their computational cost grows quadratically with sequence length ( $O(L^2)$ ), becoming a bottleneck in both training and inference phases. To address this, researchers have proposed various subquadratic architectures, including state space models (SSMs) and long convolutional sequence models (LCSMs, such as Hyena).

2. Problem Significance

Training efficiency resolved: LCSMs can achieve $O(L\log L)$ complexity during training via FFT
Inference efficiency unresolved: During autoregressive inference, since input sequences are generated incrementally, FFT cannot be directly applied, causing complexity to degrade to $O(L^2)$
Long context demands: As large language models process increasingly longer contexts, inference efficiency becomes more critical

3. Limitations of Existing Methods

Approximate methods (Massaroli et al. 2024): Project convolution filters to low-dimensional LTI SSMs, but this is only approximate and requires expensive distillation precomputation, without supporting data-dependent filters
Recursive perspective: May be efficient for low-dimensional SSMs but remains inefficient for high-dimensional SSMs (dimensions close to sequence length)
Structure-exploiting methods: Require filters to have specific structures (e.g., low-rank LTI SSM), limiting model expressiveness

4. Research Motivation

This work aims to provide an exact and general-purpose inference acceleration framework that does not depend on specific filter structures while supporting data-dependent filters.

Core Contributions

First near-linear exact inference algorithm: Proposes an $O(L\log^2 L)$ time complexity exact inference algorithm for LCSMs, achieving exact simulation compared to previous approximate methods
General framework identification: Identifies key architectural properties enabling fast inference (contribution-based, query-independent), proposing the Flash Inference framework applicable to a broader class of architectures
Cross-layer parallelization: Leverages tiling strategy to enable nearly complete cross-layer parallel computation of position mixing components
Memory optimization: Significantly reduces data movement from $\Omega(L^2)$ to $O(L\log L)$ through tiling, saving 2× activation storage for data-independent filters
Empirical validation: Achieves 7.8× end-to-end speedup on Hyena architecture with 110× speedup in convolution components

Detailed Methodology

Task Definition

Autoregressive sequence generation: Given a prompt sequence $x_1, \ldots, x_p$ , the model generates subsequent tokens one by one. At each position $i$ , the model computes activations $a^{[1,M]}_i$ through all layers, finally sampling $x_{i+1}$ from $a^M_i$ .

Core computational bottleneck: For each layer $\ell$ and each dimension, computing: $z_t = \sum_{i=1}^{t} y_i \cdot \rho_{t-i}$

where $y$ is the input sequence and $\rho$ is a convolution filter of length $L$ . Naive implementation requires $\Omega(L^2)$ time.

Model Architecture

1. General Architecture Definition

The model consists of $M$ layers, each containing:

Position mixer module: $\text{mixer}^\ell: \mathbb{R}^{L\times D} \to \mathbb{R}^{L\times D}$ , enabling interaction between different position embeddings
Feature mixing module: $\text{block}^\ell: \mathbb{R}^D \to \mathbb{R}^D$ , including MLP, layer normalization, etc.

Activation computation: $a^\ell(x)_i = \text{block}^\ell(\text{mixer}^\ell(a^{\ell-1}(x))_i)$

2. LCSM-Specific Definition

For LCSMs, the mixer is implemented via convolution: $\text{mixer}^\ell(y)_t = \sum_{i=1}^{t} y_i \odot \rho^\ell_{t-i}$

where $\odot$ is the Hadamard product and $\rho^\ell \in \mathbb{R}^{L\times D}$ is the filter (typically generated from low-dimensional parameters $\theta$ : $\rho = f(\theta)$ ).

Core Algorithm: Relaxed Polynomial Interpolation

1. Three Computation Strategies

Lazy method:

Compute $z_t = \sum_{i=1}^{t} y_i \cdot \rho_{t-i}$ only when needed
Each position requires $O(t)$ operations, total complexity $O(L^2)$

Eager method:

When $y_t$ becomes available, immediately compute its contribution to all future positions
The $t$ -th iteration requires $O(L-t)$ operations, total complexity still $O(L^2)$

Relaxed method (proposed in this work):

Partition contribution space into blocks, using FFT to efficiently compute intra-block contributions
Key innovation: balanced rectangular tiling rather than thin strips

2. Contribution Aggregation Definition

Define $\tau(y, [l,r], \rho, [l',r'])$ as the aggregated contribution of $y_{[l,r]}$ to $z_{[l',r']}$ : $\tau(y, [l,r], \rho, [l',r'])_t = \sum_{i=l}^{r} y_i \cdot \rho_{t-i}, \quad \forall l' \leq t \leq r'$

Lemma 1: There exists an FFT-based algorithm computing $\tau$ in $O((L_1+L_2)\log(L_1+L_2))$ time, where $L_1 = r-l+1$ and $L_2 = r'-l'+1$ .

3. Tiling Strategy (Algorithm 1)

for i = 1 to L-1:
    U ← largest power of 2 dividing i
    z_i += y_i * ρ_0  # Red cell: direct dependency
    z[i+1:i+U] += τ(y, [i-U+1, i], ρ, [i+1, i+U])  # Gray block: eager computation
    return z_i
    unlock y_{i+1}

Key characteristics:

At iteration $i$ , compute gray block with edge length $U$ (largest power of 2 dividing $i$ )
Red cell handles direct dependency at current position
Gray block precomputes partial future contributions

Complexity analysis (Proposition 1):

For blocks of length $2^q$ , there are $2^{P-1-q}$ invocations (where $L=2^P$ )
Total time: $\sum_{q=0}^{P-1} 2^{P-1-q} \cdot O(2^q \log 2^q) = O(L\log^2 L)$
Memory: $O(L)$ (peak determined by largest block)

LCSM Inference Algorithm (Algorithm 2)

Extends Algorithm 1 to multiple layers and dimensions:

for i = 1 to L-1:
    U ← largest power of 2 dividing i
    for ℓ = 1 to M:  # iterate through layers
        b^ℓ_i += a^{ℓ-1}_i ⊙ ρ^ℓ_0  # Red cell
        a^ℓ_i = block^ℓ(b^ℓ_i)
        b^ℓ[i+1:i+U] += τ(a^{ℓ-1}, [i-U+1, i], ρ^ℓ, [i+1, i+U])  # Gray block
    a^0_{i+1} = sampler(a^M_i)

Complexity (Proposition 2):

Mixer component: $O(MDL\log^2 L)$
Block component: $LM$ invocations (typically $O(MLD^2)$ )
Activation storage: $O(MLD)$

Technical Innovations

1. Cross-layer Parallelization (Algorithm 3)

Gray block computation can be parallelized across all layers:

for i = 1 to L-1:
    for ℓ = 1 to M:
        Process red cells (must be sequential)
    parallel for ℓ = 1 to M:
        Process gray blocks (can be parallelized)

Advantages:

Small blocks (87.5% of blocks have size ≤4) are typically memory-latency limited; parallelization can saturate memory bandwidth
Large blocks use FFT implementation and are compute-intensive; parallelization improves throughput

2. Memory Optimization

Data movement: Reduced from $\Omega(L^2)$ to $O(L\log L)$ (average $\log L$ positions accessed per iteration)
Activation reuse: Store $b^\ell_i$ in the space of $a^\ell_i$ (no longer needed afterward)
FFT precomputation: Precompute DFT of convolution kernels for $\log L$ different block sizes, saving 1.5× computation

3. Circular Convolution Trick

Standard FFT convolution requires FFT of length 4U (output length 3U-1)
This work only needs circular convolution of length 2U (interested output range $[U, 2U-1]$ unaffected by circularity)

4. Data-Dependent Filter Extension (Appendix B)

Supports cases where $\rho$ depends on data by modifying tiling strategy (Algorithm 5), with 2× computational cost.

General Framework: Flash Inference

Architectural Properties

P.1 Contribution-based: Mixer operates through contribution aggregation: $\text{mixer}(y)_i = \text{read}(\text{agg}(\text{cont}(y,1,i), \ldots, \text{cont}(y,i,i)))$

where:

$\text{cont}: \mathbb{R}^D \times \mathbb{N} \times \mathbb{N} \to \mathcal{X}$ : contribution function
$\text{agg}: \mathcal{X}^* \to \mathcal{X}$ : associative aggregation function
$\text{read}: \mathcal{X} \to \mathbb{R}^D$ : read function

Examples:

LCSMs: $\mathcal{X}=\mathbb{R}^D$ , $\text{agg}=\sum$ , $\text{cont}(y,i,j)=y_i\odot\rho_{j-i}$
Self-attention: $\mathcal{X}=\mathbb{R}^D\times\mathbb{R}$ , $\text{cont}(y,i,j)=(v_i\cdot e^{\langle k_i,q_j\rangle}, e^{\langle k_i,q_j\rangle})$ , $\text{read}(v,w)=v/w$

P.2 Query-independent: $\text{cont}(y,i,j)$ does not depend on $y_{[i+1,L]}$ (LCSMs satisfy this; Transformers do not)

General Algorithm (Algorithm 4)

Assume algorithm $\mathcal{A}$ can compute block contributions in $T(L_1, L_2)$ time: $\mathcal{A}(y, [l,r], [l',r']) = \text{agg}(\text{cont}(y,l,p), \ldots, \text{cont}(y,r,p))$

Theorem 2: Under P.1 and P.2, per layer execution:

$L-1$ invocations of $\mathcal{A}$ ( $2^{P-1-q}$ invocations of length $2^q$ )
Total time: $\sum_{q=0}^{P-1} 2^{P-1-q} T(2^q, 2^q)$
Cross-layer parallelization: gray blocks have no data dependencies, can be parallelized

Experimental Setup

Datasets and Configuration

Two experimental settings:

Hyena architecture: Real LCSM model
Synthetic setting: Simplified LCSM (blocks are MLP+GELU, sampler adds noise)

Hyperparameter sweep:

Batch size $B \in \{1,2,4,8\}$
Number of layers $M \in \{18, 36\}$
Embedding dimension $D \in \{256, 768, 864\}$
Sequence length $L$ : largest power of 2 fitting in memory ( $2^{15}$ to $2^{18}$ )

Hardware: NVIDIA H100 and A100 GPUs

Warmup and averaging: 2 warmup runs, average of 4 runs

Comparison Methods

Baselines:

Lazy: Naive per-position computation
Eager: Precompute all future contributions
Lazy NP / Eager NP: Non-parallel versions (no cross-layer parallelization)

Seven implementations of $\tau$ (four on Pareto frontier):

Conv1D: PyTorch default 1D convolution kernel (requires explicit padding)
Flash Conv1D: FlashFFTConv fused kernel
FFT: PyTorch native FFT convolution (DFT→pointwise multiply→IDFT)
FlashFFT: FlashFFTConv fused FFT kernel
Hybrid: Dynamically select optimal implementation based on block size

Evaluation Metrics

End-to-end time: Total time to generate all $L$ tokens
Mixer cumulative time: Time for position mixing component only
Per-token time: Average generation time per token
Speedup ratio: Improvement factor relative to Lazy (parallel version)

Implementation Details

Engineering optimizations:

CUDA Graphs: Record all kernel scheduling for single-token generation as graph, replay subsequently to reduce CPU overhead (10-20% improvement)
FFT precomputation: Precompute DFT of convolution kernels for $\log_2(L)-1$ block sizes
FlashFFT preconfiguration: Pre-initialize configurations for different block sizes to maximize hardware performance
Right padding: Use right padding instead of left padding, reducing computation by half
Circular convolution: Exploit circular convolution properties to halve FFT length

Experimental Results

Main Results

1. Hyena Architecture (Table 1, Figure 2)

Mixer component speedup (relative to Lazy):

Maximum 110×: $B=1, M=18, D=864, L=2^{17}$
Average 64-110×: Consistent significant speedup across configurations
Eager/Lazy baselines: Only 0.54× (actually slower due to lack of optimization)

End-to-end speedup (Table 2):

Maximum 7.8×: $B=8, M=18, D=864, L=2^{15}$
Average 3-8×: End-to-end improvement limited by non-mixer components (MLPs, etc.)
Time decomposition (Figure 2a): Mixer shifts from dominant to secondary component

Per-token response time (Figure 2c):

Low variance: 93.75% of tokens use block size ≤8, stable time
Occasional spikes: Occur during large block computation (but infrequent)

2. Synthetic Setting (Tables 3-4, Figure 3)

Mixer speedup:

Hybrid: 80-124×
Single implementations: Flash Conv1D (5.5-6.5×), FlashFFT (31-56×), FFT (74-119×)
Conv1D (quadratic complexity): Still achieves 5-6× speedup (validates arithmetic intensity improvement from tiling)

End-to-end speedup:

Hybrid: 3.8-11.6×
CUDA Graphs effect: Without CUDA Graphs, end-to-end only 1.6×; with it, reaches 8×

Pareto optimal curve (Figure 3a):

Different $\tau$ implementations optimal for different block sizes
Small blocks (U≤4): Flash Conv1D optimal (memory-latency limited)
Medium blocks (4<U≤64): FlashFFT optimal
Large blocks (U>64): FFT optimal (compute-intensive)

Ablation Studies

1. Cross-layer Parallelization Effect

Lazy NP vs Lazy: 0.76-0.91× (parallelization provides 10-30% improvement)
Eager NP vs Eager: 0.49-0.53× (parallelization provides nearly 2× improvement)
Proposed method: Small blocks dominate, parallelization effect significant

2. $\tau$ Implementation Comparison (Figure 3b)

Hybrid always optimal or near-optimal
FFT near-optimal in most cases (gap <20% from Hybrid)
Flash Conv1D despite $O(L^2)$ complexity, still 5× faster than Lazy/Eager (memory-friendly)

3. Time Decomposition (Figures 3c, 4)

Non-convolution components: Consistent across all methods (CUDA Graphs ensures this)
Convolution components: Hybrid significantly outperforms all baselines

Case Analysis

Cumulative mixer time curves (Figures 2b, 3b):

Lazy/Eager: Linear growth (constant slope)
Proposed method: Logarithmic growth (decreasing slope)
Crossover point: Around 100-1000 tokens, after which advantage becomes significant

Experimental Findings

Theory-practice alignment: $O(L\log^2 L)$ complexity manifests as significant speedup in experiments
Memory bandwidth importance: Flash Conv1D despite quadratic complexity achieves 5× speedup through memory access optimization
Dynamic selection necessity: No single $\tau$ implementation optimal for all block sizes; Hybrid strategy is critical
CPU overhead non-negligible: CUDA Graphs elevates end-to-end speedup from 1.6× to 8×
Parallelization benefits: Small blocks dominate (87.5%), cross-layer parallelization effect significant

1. Transformer Alternative Architectures

SSMs: Mamba (selective SSM), S4 (structured SSM)
LCSMs: Hyena, H3, CKConv, FlexConv
Others: Mega (moving average gated attention)

2. Fast Inference Methods

Recursive perspective: Exploit SSM recursive form, time $O(LD')$ ( $D'$ is state dimension)
Approximate methods:
- Massaroli et al. 2024: Distill to low-dimensional LTI SSM (approximate, no data-dependent support)
- This work: Exact, supports data-dependent filters
Structure exploitation:
- Dilated convolutions (Paine et al. 2016): Linear time, requires specific structure
- This work: No structure assumptions

3. Concurrent Work

Agarwal et al. 2024 (FutureFill): Similar algorithm, focuses on linear dynamical systems
This work advantages: Supports data-dependent filters, more complete system-level optimization

4. FFT and Convolution

van der Hoeven 1997: Theoretical foundation of relaxed polynomial interpolation
FlashFFTConv: Efficient FFT convolution implementation

Conclusions and Discussion

Main Conclusions

Theoretical contribution: First $O(L\log^2 L)$ exact inference algorithm for LCSMs
General framework: Identifies key properties (contribution-based, query-independent), applicable to broader architecture classes
Empirical validation: 7.8× end-to-end speedup on Hyena, 110× speedup in mixer component
System optimization: Cross-layer parallelization, memory optimization, dynamic implementation selection and other engineering contributions

Limitations

Data-dependent filters: While theoretically supported, requires 2× computation; experiments lack thorough verification
Memory requirements: Still requires storing all activations $O(MLD)$ (vs recursive perspective's $O(MD')$ )
Applicability scope:
- Not applicable to Transformers (violates query-independence)
- For extremely low-dimensional SSMs ( $D' \ll \log^2 L$ ), recursive perspective may be superior
Prompt phase: Long prompts still dominate time with prefilling; relative benefit of autoregressive optimization limited
Hardware dependence: Speedup effects depend on GPU memory bandwidth characteristics

Future Directions

Architecture design: Design new architectures satisfying Flash Inference requirements with high quality
Causal data-dependent filters: How to make filters data-dependent while maintaining causality (Arora et al., Karami & Ghodsi show promise)
Hybrid approaches: Combine recursive perspective (small state dimension) with convolution perspective (large state dimension)
More architectures: Extend to other models satisfying framework properties (e.g., certain attention variants)
Distributed inference: Optimization for multi-GPU/multi-node scenarios

In-Depth Evaluation

Strengths

1. Theoretical Rigor

Complete complexity analysis: Clear proof chain from Lemma 1 to Theorem 2
Appropriate abstraction: Properties P.1 and P.2 well-chosen, encompassing LCSMs while excluding inapplicable cases (e.g., Transformers)
Clever tool application: Relaxed polynomial interpolation theory applied ingeniously

2. Method Novelty

Tiling strategy: Balanced rectangular tiling (vs thin strips) is key insight
Cross-layer parallelization: Recognizes gray blocks have no dependencies, breaking traditional layer-sequential execution
Dynamic implementation selection: Hybrid strategy reflects deep understanding of hardware characteristics

3. Experimental Sufficiency

Multi-dimensional evaluation: End-to-end, mixer, per-token time
Comprehensive parameter sweep: 21 configuration combinations (B, M, D, L)
Thorough ablations: 7 $\tau$ implementations, parallel vs non-parallel, CUDA Graphs effects
Two settings: Real Hyena + synthetic (isolates relevant factors)

4. Engineering Contributions

System-level optimization: CUDA Graphs, FFT precomputation, circular convolution and other practical techniques
Open-source potential: Detailed algorithm descriptions facilitate reproduction
Memory analysis: Appendices D/E provide meticulous memory usage discussion

5. Writing Clarity

Excellent visualization: Figure 1's tiling illustration is intuitive
Consistent notation: Clear symbol system throughout
Complete appendices: Extended discussions, proofs, additional experiments well-organized

Weaknesses

1. Experimental Limitations

No real model training: Uses randomly initialized weights, doesn't verify impact on model quality
Missing end-to-end comparisons: No comparison with other efficient architectures like Mamba
Insufficient prompt phase analysis: Limited exploration of actual benefits in long-prompt scenarios
Unverified data-dependent filters: Algorithm 5 only theoretical, no experimental validation

2. Method Limitations

Memory overhead: $O(MLD)$ activation storage can still be bottleneck for long sequences/many layers
Peak memory: Largest block requires additional $O(LD)$ space (though mitigable through sequential processing)
Limited applicability:
- Inapplicable to Transformers (mainstream architecture)
- LCSMs themselves may have lower quality than Transformers
- Requires architecture satisfying specific properties

3. Theoretical Analysis

Constant factors: Constants in $O(L\log^2 L)$ may be large (experiments show small blocks where FFT isn't optimal)
Optimality: Doesn't prove whether $\log^2 L$ is a lower bound
Time-memory tradeoff: Insufficient analysis of time-memory Pareto frontier

4. Insufficient Comparisons

Approximate methods: No experimental comparison of quality-speed tradeoff with Massaroli et al.
Recursive perspective: Limited quantitative analysis of when recursive perspective is superior (only qualitative discussion of $D' \in O(\log^2 L)$ )
Structure exploitation: No comparison with dilated convolution and other structure-specific methods

Impact

1. Academic Contribution

Pioneering: First near-linear exact inference for LCSMs
Theoretical depth: Connects relaxed polynomial interpolation with sequence model inference
Framework value: General property identification can guide future architecture design

2. Practical Value

Immediately applicable: Existing models like Hyena can directly apply
Engineering insights: System optimization techniques (CUDA Graphs, etc.) transferable
Limited scope: LCSMs less prevalent than Transformers in practice, limiting direct impact

3. Reproducibility

Clear algorithms: Detailed pseudocode facilitates implementation
Explicit experimental details: Hyperparameters, hardware configuration specified
Open-source potential: Though code release not mentioned, description sufficient for reproduction
Hardware dependence: Requires high-end GPUs (H100/A100) to verify all results

Applicable Scenarios

1. Ideal Scenarios

Long sequence generation: $L > 10^4$ , complexity advantage pronounced
Autoregressive-dominated: Generated tokens far exceed prompt length
LCSM architectures: Pre-trained Hyena-like models
High-end hardware: GPU with high memory bandwidth supporting parallelization

2. Inapplicable Scenarios

Short sequences: $L < 1000$ , constant overhead may offset advantages
Long prompt, short generation: Prefilling dominates, autoregressive optimization benefit limited
Transformer models: Violates query-independence property
Extremely low-dimensional SSMs: $D' \ll \log^2 L$ , recursive perspective superior

3. Potential Extensions

Hybrid architectures: Transformer + LCSM layers (apply this method to subset of layers)
Approximate variants: Combine this exact method with low-rank approximation
Other modalities: Audio, video generation (convolutions more common)

Key References

van der Hoeven, J. (1997). Lazy multiplication of formal power series. ISSAC. Theoretical foundation
Poli, M. et al. (2023). Hyena hierarchy: Towards larger convolutional language models. Primary application target
Massaroli, S. et al. (2024). Laughing hyena distillery: Extracting compact recurrences from convolutions. NeurIPS. Approximate method comparison
Gu, A. & Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces. SSM related work
Fu, D. Y. et al. (2023). FlashFFTConv: Efficient convolutions for long sequences with tensor cores. Implementation foundation
Agarwal, N. et al. (2024). FutureFill: Fast generation from convolutional sequence models. Concurrent work

Overall Assessment: This is an excellent paper tightly combining theory and practice. Theoretically, it provides the first near-linear exact algorithm for LCSM inference and abstracts a general framework; practically, it achieves significant speedup through system-level optimization. Main limitations are that LCSMs themselves are less prevalent than Transformers in practice, and data-dependent filter experimental validation is insufficient. This work provides new perspectives on sequence model inference optimization, particularly valuable for guiding future architecture design. Recommended for researchers interested in model efficiency, sequence modeling, and system optimization.