2025-11-12T14:19:10.228100

State-Space Models for Tabular Prior-Data Fitted Networks

Koch, Wever, Raisch et al.
Recent advancements in foundation models for tabular data, such as TabPFN, demonstrated that pretrained Transformer architectures can approximate Bayesian inference with high predictive performance. However, Transformers suffer from quadratic complexity with respect to sequence length, motivating the exploration of more efficient sequence models. In this work, we investigate the potential of using Hydra, a bidirectional linear-time structured state space model (SSM), as an alternative to Transformers in TabPFN. A key challenge lies in SSM's inherent sensitivity to the order of input tokens - an undesirable property for tabular datasets where the row order is semantically meaningless. We investigate to what extent a bidirectional approach can preserve efficiency and enable symmetric context aggregation. Our experiments show that this approach reduces the order-dependence, achieving predictive performance competitive to the original TabPFN model.
academic

State-Space Models for Tabular Prior-Data Fitted Networks

Basic Information

  • Paper ID: 2510.14573
  • Title: State-Space Models for Tabular Prior-Data Fitted Networks
  • Authors: Felix Koch, Marcel Wever, Fabian Raisch, Benjamin Tischler
  • Classification: cs.LG
  • Publication Time/Venue: Proceedings of the 1st ICML Workshop on Foundation Models for Structured Data, Vancouver, Canada. 2025
  • Paper Link: https://arxiv.org/abs/2510.14573

Abstract

Recent advancements in foundation models for tabular data, such as TabPFN, demonstrated that pretrained Transformer architectures can approximate Bayesian inference with high predictive performance. However, Transformers suffer from quadratic complexity with respect to sequence length, motivating the exploration of more efficient sequence models. In this work, we investigate the potential of using Hydra, a bidirectional linear-time structured state space model (SSM), as an alternative to Transformers in TabPFN. A key challenge lies in SSM's inherent sensitivity to the order of input tokens - an undesirable property for tabular datasets where the row order is semantically meaningless. We investigate to what extent a bidirectional approach can preserve efficiency and enable symmetric context aggregation. Our experiments show that this approach reduces the order-dependence, achieving predictive performance competitive to the original TabPFN model.

Research Background and Motivation

  1. Problem to be Addressed: This research targets the computational efficiency problem of Transformer architectures in foundation models for tabular data, particularly the O(n²) complexity that limits scalability on large-scale datasets.
  2. Problem Significance: TabPFN as a foundation model for tabular data demonstrates excellent performance, enabling Bayesian inference approximation within milliseconds. However, its Transformer-based architecture faces memory and computational bottlenecks when processing large-scale data.
  3. Limitations of Existing Methods:
    • Transformer self-attention mechanisms have quadratic complexity
    • Direct replacement of Transformer with Mamba introduces sensitivity to input sequence order
    • Row order in tabular data is semantically meaningless, conflicting with SSM's causal design
  4. Research Motivation: To explore structured state space models (SSMs) as alternatives to Transformers, maintaining the efficiency advantages of linear complexity while reducing input order dependence through bidirectional processing mechanisms.

Core Contributions

  1. Proposed Hydra-based TabPFN Architecture: Integrated bidirectional structured state space model Hydra into TabPFN, achieving linear-time complexity for tabular data processing.
  2. Introduced Repeated Context Permutation (RCP) Technique: Further reduced SSM's sensitivity to sequence order by performing multiple random permutations of inputs and averaging prediction results.
  3. Achieved Significant Scalability Improvements: Compared to original TabPFN, the new method can handle two orders of magnitude larger datasets (extending from 2¹⁵ to 2¹⁷ rows).
  4. Maintained Competitive Predictive Performance: In OpenML CC-18 benchmark testing, Hydra-based TabPFN's accuracy is only 1.1% lower than the original model.

Methodology Details

Task Definition

This paper studies tabular classification tasks where:

  • Input: Complete tabular dataset containing training and test samples
  • Output: Class probability predictions for test samples
  • Constraint: Inference must be completed in a single forward pass without gradient updates or fine-tuning

Model Architecture

1. Hydra Architecture Replacement

  • Core Design: Replace Transformer encoder with stacked Hydra layers
  • Bidirectional Processing: Utilize quasi-separable matrix mixers for bidirectional state space modeling
  • Layer Structure: Each Hydra layer contains bidirectional state space mixing followed by feedforward transformation

2. Embedding Strategy Preservation

  • Retain original TabPFN's data embedding methodology
  • Each input represented as concatenation of feature values and class labels
  • Handle unlabeled data during inference through marginalization over all possible label assignments

3. Repeated Context Permutation (RCP)

Algorithm flow:

Input: Number of permutations r, context D, test sample xtest
Output: Predicted class value
Initialize empty list: outputs ← []
for i = 1 to r do
    Shuffle rows of D: Dp ← shuffle(D)
    Concatenate xtest to Dp: Din ← Dp ∪ xtest
    Predict: outputs[i] ← PFN.predict(Din)
end for
Return average of outputs

Technical Innovations

  1. Bidirectionality Addresses Order Sensitivity: Compared to unidirectional Mamba, Hydra's bidirectional processing enables symmetric context aggregation, reducing input order dependence.
  2. Linear Complexity: Achieves O(n) complexity through quasi-separable matrix multiplication, providing significant advantages over Transformer's O(n²).
  3. RCP Strategy: Innovatively reduces order sensitivity through multiple random permutations and result averaging, a customized design for tabular data characteristics.

Experimental Setup

Datasets

  • Primary Dataset: OpenML CC-18 benchmark suite
  • Filtering Criteria: ≤2000 rows, ≤100 features, ≤10 classes
  • Final Datasets: 30 multiclass classification datasets
  • Data Splitting: Each dataset randomly split into train/test sets 16 times

Evaluation Metrics

  1. Accuracy: Classification correctness rate
  2. AUC OvO: One-vs-One multiclass AUC
  3. KL Divergence: Measures prediction distribution differences across different input permutations, assessing order sensitivity
  4. Inference Time: Computational time across different input scales
  5. Memory Usage: Maximum dataset size that can be processed

Comparison Methods

  • Transformer-based TabPFN: Original baseline model
  • Mamba-based TabPFN: Unidirectional SSM replacement approach
  • Hydra-based TabPFN: Proposed bidirectional SSM approach

Implementation Details

  • Training Hardware: Nvidia A40 GPU (48GB)
  • Testing Hardware: NVIDIA H100 80GB
  • Training Time: Transformer 48 hours, Mamba 52 hours, Hydra 134 hours
  • Key Hyperparameters:
    • Learning rate: 0.0001
    • SSM layers: 24 (2× Transformer layers)
    • Embedding dimension: 1024

Experimental Results

Main Results

1. Scalability Comparison

  • Transformer Limit: 2¹⁵ rows (constrained by 80GB GPU memory)
  • Hydra Limit: 2¹⁷ rows (constrained by PyTorch 32-bit indexing, not hardware)
  • Performance Improvement: 100× increase in processable data scale

2. Predictive Performance Comparison

  • Hydra vs Transformer: Average accuracy difference -1.1%, AUC difference -1.1%
  • Hydra vs Mamba: Hydra accuracy average 3.6% higher
  • Variance Analysis: Hydra exhibits lower performance variance than Mamba

3. Order Sensitivity Analysis

Measured by KL divergence:

  • KL divergence significantly decreases with increasing RCP iterations
  • Hydra exhibits lower order sensitivity than Mamba
  • RCP strategy effectively mitigates effects of anomalous permutations

Ablation Studies

Impact of RCP Iterations

  • Accuracy: Improves with increasing RCP iterations, but improvement magnitude is relatively modest
  • KL Divergence: Significantly decreases, indicating reduced order dependence
  • Computational Cost: Linearly increases r× inference time

Architecture Comparison

  • Unidirectional vs Bidirectional: Hydra's bidirectional mechanism significantly outperforms Mamba's unidirectional processing
  • Layer Configuration: Following Mamba paper recommendations, using 2× Transformer layers

Experimental Findings

  1. Importance of Bidirectionality: Bidirectional processing is crucial for the unordered nature of tabular data
  2. Efficiency-Performance Balance: Achieved significant efficiency improvements while maintaining competitive performance
  3. RCP Effectiveness: Multiple permutation averaging strategy effectively reduces order sensitivity
  4. Hardware Limitation Breakthrough: Successfully overcame Transformer's memory limitations on large-scale data

Tabular Foundation Models

  • TabPFN: Pioneering Transformer model for tabular data
  • TabFlex: Extension using linear attention
  • Mambular: Tabular deep learning model based on Mamba

State Space Models

  • Mamba: Selective state space model achieving linear complexity
  • Hydra: Bidirectional SSM extension supporting non-causal modeling
  • S4: Foundational work on structured state space sequence models

Efficiency Optimization Methods

  • FlashAttention: Reduces Transformer memory requirements through IO optimization
  • Linear Attention: Linear complexity attention mechanism alternatives

Conclusions and Discussion

Main Conclusions

  1. Hydra successfully addresses TabPFN's scalability problem, improving processing capacity by two orders of magnitude
  2. Bidirectional SSM is more suitable for the unordered nature of tabular data than unidirectional SSM
  3. RCP strategy is an effective method for reducing SSM order sensitivity
  4. Achieved competitive performance with Transformers while maintaining linear complexity

Limitations

  1. Retraining Requirement: Due to architectural differences, the entire model requires retraining
  2. Context Limitation: Experiments remain limited to within 1000 rows, insufficient exploration of large-scale scenarios
  3. RCP Overhead: Multiple permutations increase inference time by r×
  4. Order Optimization: Insufficient investigation of optimal permutation strategies

Future Directions

  1. Large-Scale Validation: Test SSM-based TabPFN on datasets with >10k rows
  2. Optimal Permutation: Research optimal row permutation strategies for SSMs
  3. Architecture Optimization: Explore more efficient bidirectional SSM architectures
  4. Theoretical Analysis: Deepen understanding of bidirectionality's theoretical foundations for tabular data modeling

In-Depth Evaluation

Strengths

  1. Clear Problem Definition: Accurately identifies core bottlenecks of TabPFN and proposes targeted solutions
  2. Rational Technical Choices: Hydra's bidirectional characteristics well-match the unordered nature of tabular data
  3. Comprehensive Experimental Design: Includes multidimensional evaluation of performance, efficiency, and order sensitivity
  4. Strong Result Convincingness: Achieves significant scalability improvements while maintaining performance
  5. High Method Practicality: RCP strategy is simple, effective, and easy to implement and deploy

Weaknesses

  1. Limited Innovation Degree: Primarily combines existing techniques, lacking fundamental innovation
  2. Insufficient Theoretical Analysis: Lacks deep theoretical explanation for why bidirectionality resolves order sensitivity
  3. Limited Experimental Scale: Still constrained to relatively small datasets, insufficient demonstration of large-scale processing capability
  4. Incomplete Comparisons: Lacks direct comparison with other linear complexity methods (e.g., Linear Attention)
  5. Insufficient Hyperparameter Analysis: Due to high training costs, insufficient hyperparameter optimization was performed

Impact

  1. Academic Contribution: Provides new insights and empirical evidence for efficiency optimization of tabular foundation models
  2. Practical Value: Addresses real-world scalability problems with high practical utility
  3. Inspirational Significance: Demonstrates SSM's potential in structured data modeling, likely inspiring further research
  4. Reproducibility: Code is publicly available with detailed experimental settings, ensuring good reproducibility

Applicable Scenarios

  1. Large-Scale Tabular Classification: Particularly suitable for tabular classification tasks requiring processing of large sample volumes
  2. Real-Time Inference Scenarios: Linear complexity makes it suitable for applications with strict inference speed requirements
  3. Resource-Constrained Environments: Requires less memory and computational resources compared to Transformers
  4. Few-Shot Learning: Maintains TabPFN's advantages in few-shot scenarios

References

Main references include:

  1. Hollmann et al. (2023) - Original TabPFN paper
  2. Gu & Dao (2023) - Mamba architecture
  3. Hwang et al. (2024) - Hydra bidirectional SSM
  4. Dao et al. (2022) - FlashAttention optimization techniques
  5. Zeng et al. (2024) - TabFlex linear attention method

This paper makes valuable contributions to addressing scalability issues in tabular foundation models. By cleverly combining bidirectional SSMs with repeated permutation strategies, it successfully balances efficiency and performance requirements. While somewhat limited in theoretical innovation, its practical value and inspirational significance for future research merit recognition.