2025-11-12T14:19:10.228100

State-Space Models for Tabular Prior-Data Fitted Networks

Koch, Wever, Raisch et al.

Recent advancements in foundation models for tabular data, such as TabPFN, demonstrated that pretrained Transformer architectures can approximate Bayesian inference with high predictive performance. However, Transformers suffer from quadratic complexity with respect to sequence length, motivating the exploration of more efficient sequence models. In this work, we investigate the potential of using Hydra, a bidirectional linear-time structured state space model (SSM), as an alternative to Transformers in TabPFN. A key challenge lies in SSM's inherent sensitivity to the order of input tokens - an undesirable property for tabular datasets where the row order is semantically meaningless. We investigate to what extent a bidirectional approach can preserve efficiency and enable symmetric context aggregation. Our experiments show that this approach reduces the order-dependence, achieving predictive performance competitive to the original TabPFN model.

academic

State-Space Models for Tabular Prior-Data Fitted Networks

Basic Information

Paper ID: 2510.14573
Title: State-Space Models for Tabular Prior-Data Fitted Networks
Authors: Felix Koch, Marcel Wever, Fabian Raisch, Benjamin Tischler
Classification: cs.LG
Publication Time/Venue: Proceedings of the 1st ICML Workshop on Foundation Models for Structured Data, Vancouver, Canada. 2025
Paper Link: https://arxiv.org/abs/2510.14573

Abstract

Research Background and Motivation

Problem to be Addressed: This research targets the computational efficiency problem of Transformer architectures in foundation models for tabular data, particularly the O(n²) complexity that limits scalability on large-scale datasets.
Problem Significance: TabPFN as a foundation model for tabular data demonstrates excellent performance, enabling Bayesian inference approximation within milliseconds. However, its Transformer-based architecture faces memory and computational bottlenecks when processing large-scale data.
Limitations of Existing Methods:
- Transformer self-attention mechanisms have quadratic complexity
- Direct replacement of Transformer with Mamba introduces sensitivity to input sequence order
- Row order in tabular data is semantically meaningless, conflicting with SSM's causal design
Research Motivation: To explore structured state space models (SSMs) as alternatives to Transformers, maintaining the efficiency advantages of linear complexity while reducing input order dependence through bidirectional processing mechanisms.

Core Contributions

Proposed Hydra-based TabPFN Architecture: Integrated bidirectional structured state space model Hydra into TabPFN, achieving linear-time complexity for tabular data processing.
Introduced Repeated Context Permutation (RCP) Technique: Further reduced SSM's sensitivity to sequence order by performing multiple random permutations of inputs and averaging prediction results.
Achieved Significant Scalability Improvements: Compared to original TabPFN, the new method can handle two orders of magnitude larger datasets (extending from 2¹⁵ to 2¹⁷ rows).
Maintained Competitive Predictive Performance: In OpenML CC-18 benchmark testing, Hydra-based TabPFN's accuracy is only 1.1% lower than the original model.

Methodology Details

Task Definition

This paper studies tabular classification tasks where:

Input: Complete tabular dataset containing training and test samples
Output: Class probability predictions for test samples
Constraint: Inference must be completed in a single forward pass without gradient updates or fine-tuning

Model Architecture

1. Hydra Architecture Replacement

Core Design: Replace Transformer encoder with stacked Hydra layers
Bidirectional Processing: Utilize quasi-separable matrix mixers for bidirectional state space modeling
Layer Structure: Each Hydra layer contains bidirectional state space mixing followed by feedforward transformation

2. Embedding Strategy Preservation

Retain original TabPFN's data embedding methodology
Each input represented as concatenation of feature values and class labels
Handle unlabeled data during inference through marginalization over all possible label assignments

3. Repeated Context Permutation (RCP)

Algorithm flow:

Input: Number of permutations r, context D, test sample xtest
Output: Predicted class value
Initialize empty list: outputs ← []
for i = 1 to r do
    Shuffle rows of D: Dp ← shuffle(D)
    Concatenate xtest to Dp: Din ← Dp ∪ xtest
    Predict: outputs[i] ← PFN.predict(Din)
end for
Return average of outputs

Technical Innovations

Bidirectionality Addresses Order Sensitivity: Compared to unidirectional Mamba, Hydra's bidirectional processing enables symmetric context aggregation, reducing input order dependence.
Linear Complexity: Achieves O(n) complexity through quasi-separable matrix multiplication, providing significant advantages over Transformer's O(n²).
RCP Strategy: Innovatively reduces order sensitivity through multiple random permutations and result averaging, a customized design for tabular data characteristics.

Experimental Setup

Datasets

Primary Dataset: OpenML CC-18 benchmark suite
Filtering Criteria: ≤2000 rows, ≤100 features, ≤10 classes
Final Datasets: 30 multiclass classification datasets
Data Splitting: Each dataset randomly split into train/test sets 16 times

Evaluation Metrics

Accuracy: Classification correctness rate
AUC OvO: One-vs-One multiclass AUC
KL Divergence: Measures prediction distribution differences across different input permutations, assessing order sensitivity
Inference Time: Computational time across different input scales
Memory Usage: Maximum dataset size that can be processed

Comparison Methods

Transformer-based TabPFN: Original baseline model
Mamba-based TabPFN: Unidirectional SSM replacement approach
Hydra-based TabPFN: Proposed bidirectional SSM approach

Implementation Details

Training Hardware: Nvidia A40 GPU (48GB)
Testing Hardware: NVIDIA H100 80GB
Training Time: Transformer 48 hours, Mamba 52 hours, Hydra 134 hours
Key Hyperparameters:
- Learning rate: 0.0001
- SSM layers: 24 (2× Transformer layers)
- Embedding dimension: 1024

Experimental Results

Main Results

1. Scalability Comparison

Transformer Limit: 2¹⁵ rows (constrained by 80GB GPU memory)
Hydra Limit: 2¹⁷ rows (constrained by PyTorch 32-bit indexing, not hardware)
Performance Improvement: 100× increase in processable data scale

2. Predictive Performance Comparison

Hydra vs Transformer: Average accuracy difference -1.1%, AUC difference -1.1%
Hydra vs Mamba: Hydra accuracy average 3.6% higher
Variance Analysis: Hydra exhibits lower performance variance than Mamba

3. Order Sensitivity Analysis

Measured by KL divergence:

KL divergence significantly decreases with increasing RCP iterations
Hydra exhibits lower order sensitivity than Mamba
RCP strategy effectively mitigates effects of anomalous permutations

Ablation Studies

Impact of RCP Iterations

Accuracy: Improves with increasing RCP iterations, but improvement magnitude is relatively modest
KL Divergence: Significantly decreases, indicating reduced order dependence
Computational Cost: Linearly increases r× inference time

Architecture Comparison

Unidirectional vs Bidirectional: Hydra's bidirectional mechanism significantly outperforms Mamba's unidirectional processing
Layer Configuration: Following Mamba paper recommendations, using 2× Transformer layers

Experimental Findings

Importance of Bidirectionality: Bidirectional processing is crucial for the unordered nature of tabular data
Efficiency-Performance Balance: Achieved significant efficiency improvements while maintaining competitive performance
RCP Effectiveness: Multiple permutation averaging strategy effectively reduces order sensitivity
Hardware Limitation Breakthrough: Successfully overcame Transformer's memory limitations on large-scale data

Tabular Foundation Models

TabPFN: Pioneering Transformer model for tabular data
TabFlex: Extension using linear attention
Mambular: Tabular deep learning model based on Mamba

State Space Models

Mamba: Selective state space model achieving linear complexity
Hydra: Bidirectional SSM extension supporting non-causal modeling
S4: Foundational work on structured state space sequence models

Efficiency Optimization Methods

FlashAttention: Reduces Transformer memory requirements through IO optimization
Linear Attention: Linear complexity attention mechanism alternatives

Conclusions and Discussion

Main Conclusions

Hydra successfully addresses TabPFN's scalability problem, improving processing capacity by two orders of magnitude
Bidirectional SSM is more suitable for the unordered nature of tabular data than unidirectional SSM
RCP strategy is an effective method for reducing SSM order sensitivity
Achieved competitive performance with Transformers while maintaining linear complexity

Limitations

Retraining Requirement: Due to architectural differences, the entire model requires retraining
Context Limitation: Experiments remain limited to within 1000 rows, insufficient exploration of large-scale scenarios
RCP Overhead: Multiple permutations increase inference time by r×
Order Optimization: Insufficient investigation of optimal permutation strategies

Future Directions

Large-Scale Validation: Test SSM-based TabPFN on datasets with >10k rows
Optimal Permutation: Research optimal row permutation strategies for SSMs
Architecture Optimization: Explore more efficient bidirectional SSM architectures
Theoretical Analysis: Deepen understanding of bidirectionality's theoretical foundations for tabular data modeling

In-Depth Evaluation

Strengths

Clear Problem Definition: Accurately identifies core bottlenecks of TabPFN and proposes targeted solutions
Rational Technical Choices: Hydra's bidirectional characteristics well-match the unordered nature of tabular data
Comprehensive Experimental Design: Includes multidimensional evaluation of performance, efficiency, and order sensitivity
Strong Result Convincingness: Achieves significant scalability improvements while maintaining performance
High Method Practicality: RCP strategy is simple, effective, and easy to implement and deploy

Weaknesses

Limited Innovation Degree: Primarily combines existing techniques, lacking fundamental innovation
Insufficient Theoretical Analysis: Lacks deep theoretical explanation for why bidirectionality resolves order sensitivity
Limited Experimental Scale: Still constrained to relatively small datasets, insufficient demonstration of large-scale processing capability
Incomplete Comparisons: Lacks direct comparison with other linear complexity methods (e.g., Linear Attention)
Insufficient Hyperparameter Analysis: Due to high training costs, insufficient hyperparameter optimization was performed

Impact

Academic Contribution: Provides new insights and empirical evidence for efficiency optimization of tabular foundation models
Practical Value: Addresses real-world scalability problems with high practical utility
Inspirational Significance: Demonstrates SSM's potential in structured data modeling, likely inspiring further research
Reproducibility: Code is publicly available with detailed experimental settings, ensuring good reproducibility

Applicable Scenarios

Large-Scale Tabular Classification: Particularly suitable for tabular classification tasks requiring processing of large sample volumes
Real-Time Inference Scenarios: Linear complexity makes it suitable for applications with strict inference speed requirements
Resource-Constrained Environments: Requires less memory and computational resources compared to Transformers
Few-Shot Learning: Maintains TabPFN's advantages in few-shot scenarios

References

Main references include:

Hollmann et al. (2023) - Original TabPFN paper
Gu & Dao (2023) - Mamba architecture
Hwang et al. (2024) - Hydra bidirectional SSM
Dao et al. (2022) - FlashAttention optimization techniques
Zeng et al. (2024) - TabFlex linear attention method

This paper makes valuable contributions to addressing scalability issues in tabular foundation models. By cleverly combining bidirectional SSMs with repeated permutation strategies, it successfully balances efficiency and performance requirements. While somewhat limited in theoretical innovation, its practical value and inspirational significance for future research merit recognition.