2025-11-24T16:10:17.960735

Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

Oncescu, Purandare, Idreos et al.
While transformers have been at the core of most recent advancements in sequence generative models, their computational cost remains quadratic in sequence length. Several subquadratic architectures have been proposed to address this computational issue. Some of them, including long convolution sequence models (LCSMs), such as Hyena, address this issue at training time but remain quadratic during inference. We propose a method for speeding up LCSMs' exact inference to quasilinear $O(L\log^2L)$ time, identify the key properties that make this possible, and propose a general framework that exploits these. Our approach, inspired by previous work on relaxed polynomial interpolation, is based on a tiling which helps decrease memory movement and share computation. It has the added benefit of allowing for almost complete parallelization across layers of the position-mixing part of the architecture. Empirically, we provide a proof of concept implementation for Hyena, which gets up to $7.8\times$ end-to-end improvement over standard inference by improving $110\times$ within the position-mixing part.
academic

Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

Basic Information

  • Paper ID: 2410.12982
  • Title: Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond
  • Authors: Costin-Andrei Oncescu, Sanket Purandare, Stratos Idreos, Sham Kakade (Harvard University)
  • Classification: cs.LG, cs.AI
  • Publication Date: arXiv preprint, submitted October 2024, updated November 2025 (v2)
  • Paper Link: https://arxiv.org/abs/2410.12982

Abstract

This paper addresses the quadratic time complexity problem in the inference phase of long convolutional sequence models (LCSMs), proposing the Flash Inference framework that reduces the time complexity of exact inference to near-linear O(Llog2L)O(L\log^2L). The method is inspired by relaxed polynomial interpolation and employs a tiling strategy to reduce memory movement and share computation. Experiments on the Hyena architecture demonstrate end-to-end inference speedup of 7.8×, with 110× speedup in the position mixing component.

Research Background and Motivation

1. Core Problem

Although Transformers have achieved tremendous success in sequence generation models, their computational cost grows quadratically with sequence length (O(L2)O(L^2)), becoming a bottleneck in both training and inference phases. To address this, researchers have proposed various subquadratic architectures, including state space models (SSMs) and long convolutional sequence models (LCSMs, such as Hyena).

2. Problem Significance

  • Training efficiency resolved: LCSMs can achieve O(LlogL)O(L\log L) complexity during training via FFT
  • Inference efficiency unresolved: During autoregressive inference, since input sequences are generated incrementally, FFT cannot be directly applied, causing complexity to degrade to O(L2)O(L^2)
  • Long context demands: As large language models process increasingly longer contexts, inference efficiency becomes more critical

3. Limitations of Existing Methods

  • Approximate methods (Massaroli et al. 2024): Project convolution filters to low-dimensional LTI SSMs, but this is only approximate and requires expensive distillation precomputation, without supporting data-dependent filters
  • Recursive perspective: May be efficient for low-dimensional SSMs but remains inefficient for high-dimensional SSMs (dimensions close to sequence length)
  • Structure-exploiting methods: Require filters to have specific structures (e.g., low-rank LTI SSM), limiting model expressiveness

4. Research Motivation

This work aims to provide an exact and general-purpose inference acceleration framework that does not depend on specific filter structures while supporting data-dependent filters.

Core Contributions

  1. First near-linear exact inference algorithm: Proposes an O(Llog2L)O(L\log^2 L) time complexity exact inference algorithm for LCSMs, achieving exact simulation compared to previous approximate methods
  2. General framework identification: Identifies key architectural properties enabling fast inference (contribution-based, query-independent), proposing the Flash Inference framework applicable to a broader class of architectures
  3. Cross-layer parallelization: Leverages tiling strategy to enable nearly complete cross-layer parallel computation of position mixing components
  4. Memory optimization: Significantly reduces data movement from Ω(L2)\Omega(L^2) to O(LlogL)O(L\log L) through tiling, saving 2× activation storage for data-independent filters
  5. Empirical validation: Achieves 7.8× end-to-end speedup on Hyena architecture with 110× speedup in convolution components

Detailed Methodology

Task Definition

Autoregressive sequence generation: Given a prompt sequence x1,,xpx_1, \ldots, x_p, the model generates subsequent tokens one by one. At each position ii, the model computes activations ai[1,M]a^{[1,M]}_i through all layers, finally sampling xi+1x_{i+1} from aiMa^M_i.

Core computational bottleneck: For each layer \ell and each dimension, computing: zt=i=1tyiρtiz_t = \sum_{i=1}^{t} y_i \cdot \rho_{t-i}

where yy is the input sequence and ρ\rho is a convolution filter of length LL. Naive implementation requires Ω(L2)\Omega(L^2) time.

Model Architecture

1. General Architecture Definition

The model consists of MM layers, each containing:

  • Position mixer module: mixer:RL×DRL×D\text{mixer}^\ell: \mathbb{R}^{L\times D} \to \mathbb{R}^{L\times D}, enabling interaction between different position embeddings
  • Feature mixing module: block:RDRD\text{block}^\ell: \mathbb{R}^D \to \mathbb{R}^D, including MLP, layer normalization, etc.

Activation computation: a(x)i=block(mixer(a1(x))i)a^\ell(x)_i = \text{block}^\ell(\text{mixer}^\ell(a^{\ell-1}(x))_i)

2. LCSM-Specific Definition

For LCSMs, the mixer is implemented via convolution: mixer(y)t=i=1tyiρti\text{mixer}^\ell(y)_t = \sum_{i=1}^{t} y_i \odot \rho^\ell_{t-i}

where \odot is the Hadamard product and ρRL×D\rho^\ell \in \mathbb{R}^{L\times D} is the filter (typically generated from low-dimensional parameters θ\theta: ρ=f(θ)\rho = f(\theta)).

Core Algorithm: Relaxed Polynomial Interpolation

1. Three Computation Strategies

Lazy method:

  • Compute zt=i=1tyiρtiz_t = \sum_{i=1}^{t} y_i \cdot \rho_{t-i} only when needed
  • Each position requires O(t)O(t) operations, total complexity O(L2)O(L^2)

Eager method:

  • When yty_t becomes available, immediately compute its contribution to all future positions
  • The tt-th iteration requires O(Lt)O(L-t) operations, total complexity still O(L2)O(L^2)

Relaxed method (proposed in this work):

  • Partition contribution space into blocks, using FFT to efficiently compute intra-block contributions
  • Key innovation: balanced rectangular tiling rather than thin strips

2. Contribution Aggregation Definition

Define τ(y,[l,r],ρ,[l,r])\tau(y, [l,r], \rho, [l',r']) as the aggregated contribution of y[l,r]y_{[l,r]} to z[l,r]z_{[l',r']}: τ(y,[l,r],ρ,[l,r])t=i=lryiρti,ltr\tau(y, [l,r], \rho, [l',r'])_t = \sum_{i=l}^{r} y_i \cdot \rho_{t-i}, \quad \forall l' \leq t \leq r'

Lemma 1: There exists an FFT-based algorithm computing τ\tau in O((L1+L2)log(L1+L2))O((L_1+L_2)\log(L_1+L_2)) time, where L1=rl+1L_1 = r-l+1 and L2=rl+1L_2 = r'-l'+1.

3. Tiling Strategy (Algorithm 1)

for i = 1 to L-1:
    U ← largest power of 2 dividing i
    z_i += y_i * ρ_0  # Red cell: direct dependency
    z[i+1:i+U] += τ(y, [i-U+1, i], ρ, [i+1, i+U])  # Gray block: eager computation
    return z_i
    unlock y_{i+1}

Key characteristics:

  • At iteration ii, compute gray block with edge length UU (largest power of 2 dividing ii)
  • Red cell handles direct dependency at current position
  • Gray block precomputes partial future contributions

Complexity analysis (Proposition 1):

  • For blocks of length 2q2^q, there are 2P1q2^{P-1-q} invocations (where L=2PL=2^P)
  • Total time: q=0P12P1qO(2qlog2q)=O(Llog2L)\sum_{q=0}^{P-1} 2^{P-1-q} \cdot O(2^q \log 2^q) = O(L\log^2 L)
  • Memory: O(L)O(L) (peak determined by largest block)

LCSM Inference Algorithm (Algorithm 2)

Extends Algorithm 1 to multiple layers and dimensions:

for i = 1 to L-1:
    U ← largest power of 2 dividing i
    for ℓ = 1 to M:  # iterate through layers
        b^ℓ_i += a^{ℓ-1}_i ⊙ ρ^ℓ_0  # Red cell
        a^ℓ_i = block^ℓ(b^ℓ_i)
        b^ℓ[i+1:i+U] += τ(a^{ℓ-1}, [i-U+1, i], ρ^ℓ, [i+1, i+U])  # Gray block
    a^0_{i+1} = sampler(a^M_i)

Complexity (Proposition 2):

  • Mixer component: O(MDLlog2L)O(MDL\log^2 L)
  • Block component: LMLM invocations (typically O(MLD2)O(MLD^2))
  • Activation storage: O(MLD)O(MLD)

Technical Innovations

1. Cross-layer Parallelization (Algorithm 3)

Gray block computation can be parallelized across all layers:

for i = 1 to L-1:
    for ℓ = 1 to M:
        Process red cells (must be sequential)
    parallel for ℓ = 1 to M:
        Process gray blocks (can be parallelized)

Advantages:

  • Small blocks (87.5% of blocks have size ≤4) are typically memory-latency limited; parallelization can saturate memory bandwidth
  • Large blocks use FFT implementation and are compute-intensive; parallelization improves throughput

2. Memory Optimization

  • Data movement: Reduced from Ω(L2)\Omega(L^2) to O(LlogL)O(L\log L) (average logL\log L positions accessed per iteration)
  • Activation reuse: Store bib^\ell_i in the space of aia^\ell_i (no longer needed afterward)
  • FFT precomputation: Precompute DFT of convolution kernels for logL\log L different block sizes, saving 1.5× computation

3. Circular Convolution Trick

  • Standard FFT convolution requires FFT of length 4U (output length 3U-1)
  • This work only needs circular convolution of length 2U (interested output range [U,2U1][U, 2U-1] unaffected by circularity)

4. Data-Dependent Filter Extension (Appendix B)

Supports cases where ρ\rho depends on data by modifying tiling strategy (Algorithm 5), with 2× computational cost.

General Framework: Flash Inference

Architectural Properties

P.1 Contribution-based: Mixer operates through contribution aggregation: mixer(y)i=read(agg(cont(y,1,i),,cont(y,i,i)))\text{mixer}(y)_i = \text{read}(\text{agg}(\text{cont}(y,1,i), \ldots, \text{cont}(y,i,i)))

where:

  • cont:RD×N×NX\text{cont}: \mathbb{R}^D \times \mathbb{N} \times \mathbb{N} \to \mathcal{X}: contribution function
  • agg:XX\text{agg}: \mathcal{X}^* \to \mathcal{X}: associative aggregation function
  • read:XRD\text{read}: \mathcal{X} \to \mathbb{R}^D: read function

Examples:

  • LCSMs: X=RD\mathcal{X}=\mathbb{R}^D, agg=\text{agg}=\sum, cont(y,i,j)=yiρji\text{cont}(y,i,j)=y_i\odot\rho_{j-i}
  • Self-attention: X=RD×R\mathcal{X}=\mathbb{R}^D\times\mathbb{R}, cont(y,i,j)=(vieki,qj,eki,qj)\text{cont}(y,i,j)=(v_i\cdot e^{\langle k_i,q_j\rangle}, e^{\langle k_i,q_j\rangle}), read(v,w)=v/w\text{read}(v,w)=v/w

P.2 Query-independent: cont(y,i,j)\text{cont}(y,i,j) does not depend on y[i+1,L]y_{[i+1,L]} (LCSMs satisfy this; Transformers do not)

General Algorithm (Algorithm 4)

Assume algorithm A\mathcal{A} can compute block contributions in T(L1,L2)T(L_1, L_2) time: A(y,[l,r],[l,r])=agg(cont(y,l,p),,cont(y,r,p))\mathcal{A}(y, [l,r], [l',r']) = \text{agg}(\text{cont}(y,l,p), \ldots, \text{cont}(y,r,p))

Theorem 2: Under P.1 and P.2, per layer execution:

  • L1L-1 invocations of A\mathcal{A} (2P1q2^{P-1-q} invocations of length 2q2^q)
  • Total time: q=0P12P1qT(2q,2q)\sum_{q=0}^{P-1} 2^{P-1-q} T(2^q, 2^q)
  • Cross-layer parallelization: gray blocks have no data dependencies, can be parallelized

Experimental Setup

Datasets and Configuration

Two experimental settings:

  1. Hyena architecture: Real LCSM model
  2. Synthetic setting: Simplified LCSM (blocks are MLP+GELU, sampler adds noise)

Hyperparameter sweep:

  • Batch size B{1,2,4,8}B \in \{1,2,4,8\}
  • Number of layers M{18,36}M \in \{18, 36\}
  • Embedding dimension D{256,768,864}D \in \{256, 768, 864\}
  • Sequence length LL: largest power of 2 fitting in memory (2152^{15} to 2182^{18})

Hardware: NVIDIA H100 and A100 GPUs

Warmup and averaging: 2 warmup runs, average of 4 runs

Comparison Methods

Baselines:

  1. Lazy: Naive per-position computation
  2. Eager: Precompute all future contributions
  3. Lazy NP / Eager NP: Non-parallel versions (no cross-layer parallelization)

Seven implementations of τ\tau (four on Pareto frontier):

  1. Conv1D: PyTorch default 1D convolution kernel (requires explicit padding)
  2. Flash Conv1D: FlashFFTConv fused kernel
  3. FFT: PyTorch native FFT convolution (DFT→pointwise multiply→IDFT)
  4. FlashFFT: FlashFFTConv fused FFT kernel
  5. Hybrid: Dynamically select optimal implementation based on block size

Evaluation Metrics

  • End-to-end time: Total time to generate all LL tokens
  • Mixer cumulative time: Time for position mixing component only
  • Per-token time: Average generation time per token
  • Speedup ratio: Improvement factor relative to Lazy (parallel version)

Implementation Details

Engineering optimizations:

  1. CUDA Graphs: Record all kernel scheduling for single-token generation as graph, replay subsequently to reduce CPU overhead (10-20% improvement)
  2. FFT precomputation: Precompute DFT of convolution kernels for log2(L)1\log_2(L)-1 block sizes
  3. FlashFFT preconfiguration: Pre-initialize configurations for different block sizes to maximize hardware performance
  4. Right padding: Use right padding instead of left padding, reducing computation by half
  5. Circular convolution: Exploit circular convolution properties to halve FFT length

Experimental Results

Main Results

1. Hyena Architecture (Table 1, Figure 2)

Mixer component speedup (relative to Lazy):

  • Maximum 110×: B=1,M=18,D=864,L=217B=1, M=18, D=864, L=2^{17}
  • Average 64-110×: Consistent significant speedup across configurations
  • Eager/Lazy baselines: Only 0.54× (actually slower due to lack of optimization)

End-to-end speedup (Table 2):

  • Maximum 7.8×: B=8,M=18,D=864,L=215B=8, M=18, D=864, L=2^{15}
  • Average 3-8×: End-to-end improvement limited by non-mixer components (MLPs, etc.)
  • Time decomposition (Figure 2a): Mixer shifts from dominant to secondary component

Per-token response time (Figure 2c):

  • Low variance: 93.75% of tokens use block size ≤8, stable time
  • Occasional spikes: Occur during large block computation (but infrequent)

2. Synthetic Setting (Tables 3-4, Figure 3)

Mixer speedup:

  • Hybrid: 80-124×
  • Single implementations: Flash Conv1D (5.5-6.5×), FlashFFT (31-56×), FFT (74-119×)
  • Conv1D (quadratic complexity): Still achieves 5-6× speedup (validates arithmetic intensity improvement from tiling)

End-to-end speedup:

  • Hybrid: 3.8-11.6×
  • CUDA Graphs effect: Without CUDA Graphs, end-to-end only 1.6×; with it, reaches 8×

Pareto optimal curve (Figure 3a):

  • Different τ\tau implementations optimal for different block sizes
  • Small blocks (U≤4): Flash Conv1D optimal (memory-latency limited)
  • Medium blocks (4<U≤64): FlashFFT optimal
  • Large blocks (U>64): FFT optimal (compute-intensive)

Ablation Studies

1. Cross-layer Parallelization Effect

  • Lazy NP vs Lazy: 0.76-0.91× (parallelization provides 10-30% improvement)
  • Eager NP vs Eager: 0.49-0.53× (parallelization provides nearly 2× improvement)
  • Proposed method: Small blocks dominate, parallelization effect significant

2. τ\tau Implementation Comparison (Figure 3b)

  • Hybrid always optimal or near-optimal
  • FFT near-optimal in most cases (gap <20% from Hybrid)
  • Flash Conv1D despite O(L2)O(L^2) complexity, still 5× faster than Lazy/Eager (memory-friendly)

3. Time Decomposition (Figures 3c, 4)

  • Non-convolution components: Consistent across all methods (CUDA Graphs ensures this)
  • Convolution components: Hybrid significantly outperforms all baselines

Case Analysis

Cumulative mixer time curves (Figures 2b, 3b):

  • Lazy/Eager: Linear growth (constant slope)
  • Proposed method: Logarithmic growth (decreasing slope)
  • Crossover point: Around 100-1000 tokens, after which advantage becomes significant

Experimental Findings

  1. Theory-practice alignment: O(Llog2L)O(L\log^2 L) complexity manifests as significant speedup in experiments
  2. Memory bandwidth importance: Flash Conv1D despite quadratic complexity achieves 5× speedup through memory access optimization
  3. Dynamic selection necessity: No single τ\tau implementation optimal for all block sizes; Hybrid strategy is critical
  4. CPU overhead non-negligible: CUDA Graphs elevates end-to-end speedup from 1.6× to 8×
  5. Parallelization benefits: Small blocks dominate (87.5%), cross-layer parallelization effect significant

1. Transformer Alternative Architectures

  • SSMs: Mamba (selective SSM), S4 (structured SSM)
  • LCSMs: Hyena, H3, CKConv, FlexConv
  • Others: Mega (moving average gated attention)

2. Fast Inference Methods

  • Recursive perspective: Exploit SSM recursive form, time O(LD)O(LD') (DD' is state dimension)
  • Approximate methods:
    • Massaroli et al. 2024: Distill to low-dimensional LTI SSM (approximate, no data-dependent support)
    • This work: Exact, supports data-dependent filters
  • Structure exploitation:
    • Dilated convolutions (Paine et al. 2016): Linear time, requires specific structure
    • This work: No structure assumptions

3. Concurrent Work

  • Agarwal et al. 2024 (FutureFill): Similar algorithm, focuses on linear dynamical systems
  • This work advantages: Supports data-dependent filters, more complete system-level optimization

4. FFT and Convolution

  • van der Hoeven 1997: Theoretical foundation of relaxed polynomial interpolation
  • FlashFFTConv: Efficient FFT convolution implementation

Conclusions and Discussion

Main Conclusions

  1. Theoretical contribution: First O(Llog2L)O(L\log^2 L) exact inference algorithm for LCSMs
  2. General framework: Identifies key properties (contribution-based, query-independent), applicable to broader architecture classes
  3. Empirical validation: 7.8× end-to-end speedup on Hyena, 110× speedup in mixer component
  4. System optimization: Cross-layer parallelization, memory optimization, dynamic implementation selection and other engineering contributions

Limitations

  1. Data-dependent filters: While theoretically supported, requires 2× computation; experiments lack thorough verification
  2. Memory requirements: Still requires storing all activations O(MLD)O(MLD) (vs recursive perspective's O(MD)O(MD'))
  3. Applicability scope:
    • Not applicable to Transformers (violates query-independence)
    • For extremely low-dimensional SSMs (Dlog2LD' \ll \log^2 L), recursive perspective may be superior
  4. Prompt phase: Long prompts still dominate time with prefilling; relative benefit of autoregressive optimization limited
  5. Hardware dependence: Speedup effects depend on GPU memory bandwidth characteristics

Future Directions

  1. Architecture design: Design new architectures satisfying Flash Inference requirements with high quality
  2. Causal data-dependent filters: How to make filters data-dependent while maintaining causality (Arora et al., Karami & Ghodsi show promise)
  3. Hybrid approaches: Combine recursive perspective (small state dimension) with convolution perspective (large state dimension)
  4. More architectures: Extend to other models satisfying framework properties (e.g., certain attention variants)
  5. Distributed inference: Optimization for multi-GPU/multi-node scenarios

In-Depth Evaluation

Strengths

1. Theoretical Rigor

  • Complete complexity analysis: Clear proof chain from Lemma 1 to Theorem 2
  • Appropriate abstraction: Properties P.1 and P.2 well-chosen, encompassing LCSMs while excluding inapplicable cases (e.g., Transformers)
  • Clever tool application: Relaxed polynomial interpolation theory applied ingeniously

2. Method Novelty

  • Tiling strategy: Balanced rectangular tiling (vs thin strips) is key insight
  • Cross-layer parallelization: Recognizes gray blocks have no dependencies, breaking traditional layer-sequential execution
  • Dynamic implementation selection: Hybrid strategy reflects deep understanding of hardware characteristics

3. Experimental Sufficiency

  • Multi-dimensional evaluation: End-to-end, mixer, per-token time
  • Comprehensive parameter sweep: 21 configuration combinations (B, M, D, L)
  • Thorough ablations: 7 τ\tau implementations, parallel vs non-parallel, CUDA Graphs effects
  • Two settings: Real Hyena + synthetic (isolates relevant factors)

4. Engineering Contributions

  • System-level optimization: CUDA Graphs, FFT precomputation, circular convolution and other practical techniques
  • Open-source potential: Detailed algorithm descriptions facilitate reproduction
  • Memory analysis: Appendices D/E provide meticulous memory usage discussion

5. Writing Clarity

  • Excellent visualization: Figure 1's tiling illustration is intuitive
  • Consistent notation: Clear symbol system throughout
  • Complete appendices: Extended discussions, proofs, additional experiments well-organized

Weaknesses

1. Experimental Limitations

  • No real model training: Uses randomly initialized weights, doesn't verify impact on model quality
  • Missing end-to-end comparisons: No comparison with other efficient architectures like Mamba
  • Insufficient prompt phase analysis: Limited exploration of actual benefits in long-prompt scenarios
  • Unverified data-dependent filters: Algorithm 5 only theoretical, no experimental validation

2. Method Limitations

  • Memory overhead: O(MLD)O(MLD) activation storage can still be bottleneck for long sequences/many layers
  • Peak memory: Largest block requires additional O(LD)O(LD) space (though mitigable through sequential processing)
  • Limited applicability:
    • Inapplicable to Transformers (mainstream architecture)
    • LCSMs themselves may have lower quality than Transformers
    • Requires architecture satisfying specific properties

3. Theoretical Analysis

  • Constant factors: Constants in O(Llog2L)O(L\log^2 L) may be large (experiments show small blocks where FFT isn't optimal)
  • Optimality: Doesn't prove whether log2L\log^2 L is a lower bound
  • Time-memory tradeoff: Insufficient analysis of time-memory Pareto frontier

4. Insufficient Comparisons

  • Approximate methods: No experimental comparison of quality-speed tradeoff with Massaroli et al.
  • Recursive perspective: Limited quantitative analysis of when recursive perspective is superior (only qualitative discussion of DO(log2L)D' \in O(\log^2 L))
  • Structure exploitation: No comparison with dilated convolution and other structure-specific methods

Impact

1. Academic Contribution

  • Pioneering: First near-linear exact inference for LCSMs
  • Theoretical depth: Connects relaxed polynomial interpolation with sequence model inference
  • Framework value: General property identification can guide future architecture design

2. Practical Value

  • Immediately applicable: Existing models like Hyena can directly apply
  • Engineering insights: System optimization techniques (CUDA Graphs, etc.) transferable
  • Limited scope: LCSMs less prevalent than Transformers in practice, limiting direct impact

3. Reproducibility

  • Clear algorithms: Detailed pseudocode facilitates implementation
  • Explicit experimental details: Hyperparameters, hardware configuration specified
  • Open-source potential: Though code release not mentioned, description sufficient for reproduction
  • Hardware dependence: Requires high-end GPUs (H100/A100) to verify all results

Applicable Scenarios

1. Ideal Scenarios

  • Long sequence generation: L>104L > 10^4, complexity advantage pronounced
  • Autoregressive-dominated: Generated tokens far exceed prompt length
  • LCSM architectures: Pre-trained Hyena-like models
  • High-end hardware: GPU with high memory bandwidth supporting parallelization

2. Inapplicable Scenarios

  • Short sequences: L<1000L < 1000, constant overhead may offset advantages
  • Long prompt, short generation: Prefilling dominates, autoregressive optimization benefit limited
  • Transformer models: Violates query-independence property
  • Extremely low-dimensional SSMs: Dlog2LD' \ll \log^2 L, recursive perspective superior

3. Potential Extensions

  • Hybrid architectures: Transformer + LCSM layers (apply this method to subset of layers)
  • Approximate variants: Combine this exact method with low-rank approximation
  • Other modalities: Audio, video generation (convolutions more common)

Key References

  1. van der Hoeven, J. (1997). Lazy multiplication of formal power series. ISSAC. Theoretical foundation
  2. Poli, M. et al. (2023). Hyena hierarchy: Towards larger convolutional language models. Primary application target
  3. Massaroli, S. et al. (2024). Laughing hyena distillery: Extracting compact recurrences from convolutions. NeurIPS. Approximate method comparison
  4. Gu, A. & Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces. SSM related work
  5. Fu, D. Y. et al. (2023). FlashFFTConv: Efficient convolutions for long sequences with tensor cores. Implementation foundation
  6. Agarwal, N. et al. (2024). FutureFill: Fast generation from convolutional sequence models. Concurrent work

Overall Assessment: This is an excellent paper tightly combining theory and practice. Theoretically, it provides the first near-linear exact algorithm for LCSM inference and abstracts a general framework; practically, it achieves significant speedup through system-level optimization. Main limitations are that LCSMs themselves are less prevalent than Transformers in practice, and data-dependent filter experimental validation is insufficient. This work provides new perspectives on sequence model inference optimization, particularly valuable for guiding future architecture design. Recommended for researchers interested in model efficiency, sequence modeling, and system optimization.