2025-11-11T13:04:09.550712

TabDistill: Distilling Transformers into Neural Nets for Few-Shot Tabular Classification

Dissanayake, Dutta

Transformer-based models have shown promising performance on tabular data compared to their classical counterparts such as neural networks and Gradient Boosted Decision Trees (GBDTs) in scenarios with limited training data. They utilize their pre-trained knowledge to adapt to new domains, achieving commendable performance with only a few training examples, also called the few-shot regime. However, the performance gain in the few-shot regime comes at the expense of significantly increased complexity and number of parameters. To circumvent this trade-off, we introduce TabDistill, a new strategy to distill the pre-trained knowledge in complex transformer-based models into simpler neural networks for effectively classifying tabular data. Our framework yields the best of both worlds: being parameter-efficient while performing well with limited training data. The distilled neural networks surpass classical baselines such as regular neural networks, XGBoost and logistic regression under equal training data, and in some cases, even the original transformer-based models that they were distilled from.

academic

TabDistill: Distilling Transformers into Neural Nets for Few-Shot Tabular Classification

Basic Information

Paper ID: 2511.05704
Title: TabDistill: Distilling Transformers into Neural Nets for Few-Shot Tabular Classification
Authors: Pasan Dissanayake, Sanghamitra Dutta (University of Maryland, College Park)
Categories: cs.LG cs.AI cs.CL
Publication Date: November 7, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2511.05704

Abstract

Transformer-based models have demonstrated promising performance on tabular data compared to classical counterparts such as neural networks and Gradient Boosted Decision Trees (GBDTs) in scenarios with limited training data. They leverage pre-trained knowledge to adapt to new domains, achieving commendable performance with only a few training examples, commonly referred to as the few-shot regime. However, the performance gains in the few-shot regime come at the expense of significantly increased complexity and parameter count. To circumvent this trade-off, we introduce TabDistill, a novel strategy to distill pre-trained knowledge from complex transformer-based models into simpler neural networks for effective tabular data classification. Our framework achieves the best of both worlds: parameter efficiency while maintaining strong performance with limited training data. The distilled neural networks surpass classical baselines such as standard neural networks, XGBoost, and logistic regression under equal training data conditions, and in some cases, even exceed the performance of the original transformer-based models from which they were distilled.

Research Background and Motivation

Problem Definition

This research addresses a fundamental contradiction in tabular data classification: while Transformer-based models demonstrate superior performance in few-shot scenarios, they suffer from enormous parameter counts and high computational complexity, making them difficult to deploy in practical applications.

Problem Significance

Practical Application Needs: In high-risk domains such as finance, healthcare, and manufacturing, scarce labeled data is a common challenge, exemplified by rare disease diagnosis and prediction of once-in-a-century natural phenomena
Data Annotation Costs: Financial applications involve expensive data annotation with inherent subjectivity, annotation errors, and lack of consensus
Deployment Constraints: Real-world applications require parameter-efficient and scalable models to accommodate varying infrastructure capabilities

Limitations of Existing Approaches

Traditional Methods: XGBoost, CatBoost, and LightGBM excel with abundant data but show significant performance degradation in few-shot scenarios
Transformer Methods: TabPFN and TabLLM demonstrate excellent few-shot performance but require millions to billions of parameters, resulting in prohibitive inference costs
Efficiency-Performance Trade-off: Lack of solutions that maintain few-shot performance while achieving parameter efficiency

Research Motivation

The authors pose a central question: "Can we achieve both objectives simultaneously—maintaining parameter efficiency while performing well with limited training data?"

Core Contributions

Proposes TabDistill Framework: A novel strategy for distilling Transformer model knowledge into neural networks, achieving parameter-efficient tabular data classification
Dual Model Instantiation: Framework implementation based on TabPFN (~11M parameters) and BigScience T0pp (~11B parameters), distilled into MLPs with approximately 1,000 parameters
Experimental Validation: Verification on 5 tabular datasets demonstrating that distilled MLPs surpass classical baselines and, in certain cases, even exceed the original Transformer models
Innovative Training Strategy: Introduction of permutation-based training techniques to prevent overfitting on extremely small training sets

Methodology Details

Task Definition

Given a small-scale tabular dataset $D_N = \{(x_n, y_n), x_n \in X, y_n \in \{0,1\}, n=1,...,N\}$ , where $N \sim 10$ , the objective is to generate a simple MLP $h_\theta(x): X \to \{0,1\}$ by leveraging knowledge from a pre-trained Transformer model $f$ .

Model Architecture

Overall Framework

TabDistill comprises two stages:

Stage 1: Fine-tuning the base Transformer model to generate a quality MLP
Stage 2: Optional additional MLP fine-tuning

Core Components

Base Model Decomposition:
- Encoder: $f_E(s): S \to Z$
- Decoder: $f_D(z): Z \to \{0,1\}$
MLP Architecture:
```
h_θ(x) = ReLU(W_R ReLU(···ReLU(W_2 ReLU(W_1 x + b_1) + b_2)···) + b_R)
```
where R denotes the number of layers and L denotes hidden layer width
Linear Mapping:
```
m_η(z) = LayerNorm(Az + b)
```
where $A \in R^{dim(Θ)×dim(Z)}$ $A \in R^{d im (Θ) \times d im (Z)}$ and $η = (A,b)$ $η = (A, b)$

Training Procedure

Stage 1 Loss Function:

L(η; D_N) = Σ[y_n log(σ(h_θ(x_n))[[1]]) + (1-y_n) log(σ(h_θ(x_n))[[0]])]

where $θ = m_η(f_E(g(D_N)))$

Technical Innovations

Hypernetwork Paradigm: Leveraging insights from computer vision, the Transformer serves as a hypernetwork generating neural network weights
Permutation Augmentation: Random feature reordering at each training epoch prevents overfitting
Parameter-Efficient Fine-tuning: Only linear mapping parameters $η$ are fine-tuned while base model parameters remain frozen
Two-Stage Design: Sequential distillation followed by fine-tuning fully exploits pre-trained knowledge

Specific Instantiations

TabDistill + TabPFN

Direct use of tabular data with $g(x) = x$ (identity transformation)
Encoder output dimension: $192N$
Mapping matrix dimension: $dim(Θ) × 192N$

TabDistill + T0pp

Text serialization: "The <column name> is <value>"
Encoder output dimension: 4096
Mapping matrix dimension: $dim(Θ) × 4096$

Experimental Setup

Datasets

Five public tabular datasets are employed:

Bank (UCI Bank Marketing): Predicting customer subscription to term deposits
Blood (UCI Blood Transfusion): Predicting blood donation likelihood
Calhousing (California Housing): Predicting housing value categories
Heart (UCI Heart Disease): Predicting heart disease presence
Income (Census Income): Predicting annual income exceeding 50K

Evaluation Metrics

ROC-AUC serves as the primary evaluation metric, considering classification performance in few-shot scenarios.

Comparison Methods

Classical Baselines: Logistic regression, XGBoost, independently trained MLP
Base Models: TabPFN, T0pp (TabLLM)
Distilled Models: TabDistill + TabPFN, TabDistill + T0pp

Implementation Details

MLP Architecture: 4 layers with 10 neurons per layer (~1,000 parameters)
Training Configuration: Stage 1 fine-tuning for 300 epochs, Stage 2 for additional 100 epochs
Hyperparameter Optimization: Grid search via Weights & Biases
Sample Scales: N ∈ {4, 8, 16, 32, 64}

Experimental Results

Primary Results

Based on ROC-AUC results from Table 1:

Extreme Few-Shot Scenario (N=4)

TabDistill + TabPFN achieves 0.72 on the Bank dataset, significantly surpassing all classical baselines
TabDistill + T0pp demonstrates excellent performance across multiple datasets, such as Calhousing (0.67) and Income (0.70)

Performance Trends

Performance Improvement with Sample Increase: All methods show general performance improvement as N increases
Baseline Method Variation: No single classical method universally dominates across all datasets
Model Selection Differences: TabDistill + TabPFN generally outperforms TabDistill + T0pp, with the reverse observed on the Income dataset

Comparison with Base Models

Table 3 reveals surprising results:

In certain cases, distilled MLPs exceed the performance of original Transformer models
For example, on Bank dataset with N=4: TabDistill + TabPFN (0.72) > TabPFN (0.62)
This indicates that distillation not only compresses the model but may also enhance performance

Ablation Studies

Model Complexity Impact (Table 2)

Testing the influence of layer count R on performance
Results demonstrate: performance degradation beyond a certain complexity threshold
4-layer architecture achieves optimal performance in most cases

Feature Attribution Analysis (Figure 3)

Using SHAP for feature importance analysis:

Distilled models maintain consistency with classical baselines in feature importance
Models correctly identify important features even after feature permutation
Demonstrates that base models correctly learn associations between MLP weights and feature ordering

Experimental Findings

Significant Distillation Effects: Distilled models substantially outperform classical methods in extreme few-shot scenarios
Parameter Efficiency: Compression from millions/billions to thousands of parameters yields enormous efficiency gains
Effective Knowledge Transfer: Pre-trained knowledge successfully transfers to simple MLPs
Strong Robustness: Permutation augmentation strategy effectively prevents overfitting

Classical Tabular Data Algorithms

Traditional Advantages: XGBoost, LightGBM, and CatBoost have long dominated the tabular data domain
Few-Shot Limitations: From-scratch classical models show significant performance degradation in few-shot scenarios

Transformer Applications to Tabular Data

SAINT: Employs attention mechanisms to model row-column interactions with self-supervised pre-training
TabPFN: Pre-trained on extensive synthetic tabular data, enabling direct prediction on new tasks without additional training
TabLLM Series: Serializes tabular data as text, leveraging LLMs for classification

Meta-Learning and Hypernetworks

Meta-Learning Connection: Transformers excel at in-context learning, analogous to meta-learning paradigms
Hypernetwork Applications: Computer vision has established precedent for using Transformers to generate neural network weights
Novel Contribution: First application of this paradigm to the tabular data domain

Knowledge Distillation

Traditional Distillation: Aligns student model outputs with teacher model outputs through loss functions
Distinction from Prior Work: Directly extracts neural networks from Transformers without loss alignment

Conclusions and Discussion

Main Conclusions

Effectiveness Validation: TabDistill successfully achieves balance between parameter efficiency and few-shot performance
Performance Advantages: Distilled MLPs surpass classical baselines in most cases, and in certain scenarios even exceed original Transformer models
Practical Value: Provides a practically deployable solution accommodating diverse infrastructure requirements

Limitations

The authors honestly acknowledge the following shortcomings:

Large-Sample Performance: Limited performance improvements when training samples increase
Simple Mapping Function: Current linear mapping may constrain performance ceiling
Bias Inheritance: Distilled models may inherit biases from base models
Application Scope: Currently validated only on binary classification tasks

Future Directions

Mapping Function Improvements: Exploring more complex mapping functions to enhance performance
Application Extension: Extending to natural language inference, instruction tuning, and other few-shot tasks
Bias Mitigation: Reducing base model biases through Stage 2 MLP fine-tuning
Multi-Task Learning: Investigating simultaneous handling of multiple tabular tasks

In-Depth Evaluation

Strengths

Strong Problem Targeting: Accurately identifies and addresses core contradictions in practical applications
Methodological Innovation: First application of hypernetwork paradigm to tabular data distillation
Comprehensive Experimental Design:
- Multi-dataset validation
- Sufficient baseline comparisons
- Detailed ablation studies
- Feature attribution analysis
Convincing Results: Not only achieves intended objectives but discovers the interesting phenomenon of distilled models exceeding original models
High Practical Value: Provides directly applicable solutions

Weaknesses

Insufficient Theoretical Analysis: Lacks theoretical explanation for why distilled models can exceed original models
Limited Dataset Scale: Validation on only 5 relatively small-scale datasets
Single Task Type: Considers only binary classification, excluding regression or multi-class tasks
Limited Base Model Coverage: Tests only two base models with limited scope
Incomplete Computational Cost Analysis: Lacks detailed comparison of actual training and inference computational costs

Impact

Academic Contributions:
- Pioneering new direction in tabular data Transformer distillation
- Providing novel perspectives for few-shot learning
- Bridging hypernetworks and knowledge distillation research domains
Practical Value:
- Addresses important deployment challenges
- Provides feasible solutions for resource-constrained environments
- Directly applicable to industrial scenarios
Reproducibility:
- Provides detailed implementation details
- Open-source commitment enhances reproducibility
- Clear and repeatable experimental setup

Applicable Scenarios

Resource-Constrained Environments: Mobile devices, edge computing scenarios
Few-Shot Applications: Medical diagnosis, financial risk control, quality inspection with scarce data
Real-Time Inference Requirements: Online services requiring rapid response
Model Interpretability Requirements: Simple MLPs offer greater interpretability compared to complex Transformers

References

The paper cites extensive related work, primarily including:

Classical tabular data methods: XGBoost, LightGBM, CatBoost, etc.
Transformer tabular applications: TabPFN, SAINT, TabLLM series
Knowledge distillation: Classical works by Hinton et al.
Hypernetworks: Related applications in computer vision
Meta-learning: Transformer in-context learning research

Overall Assessment: This is a high-quality research paper that proposes innovative solutions to practical problems with comprehensive experimental validation, demonstrating significant academic and practical value. While certain limitations exist, it makes important contributions to the advancement of related fields.