2025-11-11T13:04:09.550712

TabDistill: Distilling Transformers into Neural Nets for Few-Shot Tabular Classification

Dissanayake, Dutta
Transformer-based models have shown promising performance on tabular data compared to their classical counterparts such as neural networks and Gradient Boosted Decision Trees (GBDTs) in scenarios with limited training data. They utilize their pre-trained knowledge to adapt to new domains, achieving commendable performance with only a few training examples, also called the few-shot regime. However, the performance gain in the few-shot regime comes at the expense of significantly increased complexity and number of parameters. To circumvent this trade-off, we introduce TabDistill, a new strategy to distill the pre-trained knowledge in complex transformer-based models into simpler neural networks for effectively classifying tabular data. Our framework yields the best of both worlds: being parameter-efficient while performing well with limited training data. The distilled neural networks surpass classical baselines such as regular neural networks, XGBoost and logistic regression under equal training data, and in some cases, even the original transformer-based models that they were distilled from.
academic

TabDistill: Distilling Transformers into Neural Nets for Few-Shot Tabular Classification

Basic Information

  • Paper ID: 2511.05704
  • Title: TabDistill: Distilling Transformers into Neural Nets for Few-Shot Tabular Classification
  • Authors: Pasan Dissanayake, Sanghamitra Dutta (University of Maryland, College Park)
  • Categories: cs.LG cs.AI cs.CL
  • Publication Date: November 7, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2511.05704

Abstract

Transformer-based models have demonstrated promising performance on tabular data compared to classical counterparts such as neural networks and Gradient Boosted Decision Trees (GBDTs) in scenarios with limited training data. They leverage pre-trained knowledge to adapt to new domains, achieving commendable performance with only a few training examples, commonly referred to as the few-shot regime. However, the performance gains in the few-shot regime come at the expense of significantly increased complexity and parameter count. To circumvent this trade-off, we introduce TabDistill, a novel strategy to distill pre-trained knowledge from complex transformer-based models into simpler neural networks for effective tabular data classification. Our framework achieves the best of both worlds: parameter efficiency while maintaining strong performance with limited training data. The distilled neural networks surpass classical baselines such as standard neural networks, XGBoost, and logistic regression under equal training data conditions, and in some cases, even exceed the performance of the original transformer-based models from which they were distilled.

Research Background and Motivation

Problem Definition

This research addresses a fundamental contradiction in tabular data classification: while Transformer-based models demonstrate superior performance in few-shot scenarios, they suffer from enormous parameter counts and high computational complexity, making them difficult to deploy in practical applications.

Problem Significance

  1. Practical Application Needs: In high-risk domains such as finance, healthcare, and manufacturing, scarce labeled data is a common challenge, exemplified by rare disease diagnosis and prediction of once-in-a-century natural phenomena
  2. Data Annotation Costs: Financial applications involve expensive data annotation with inherent subjectivity, annotation errors, and lack of consensus
  3. Deployment Constraints: Real-world applications require parameter-efficient and scalable models to accommodate varying infrastructure capabilities

Limitations of Existing Approaches

  1. Traditional Methods: XGBoost, CatBoost, and LightGBM excel with abundant data but show significant performance degradation in few-shot scenarios
  2. Transformer Methods: TabPFN and TabLLM demonstrate excellent few-shot performance but require millions to billions of parameters, resulting in prohibitive inference costs
  3. Efficiency-Performance Trade-off: Lack of solutions that maintain few-shot performance while achieving parameter efficiency

Research Motivation

The authors pose a central question: "Can we achieve both objectives simultaneously—maintaining parameter efficiency while performing well with limited training data?"

Core Contributions

  1. Proposes TabDistill Framework: A novel strategy for distilling Transformer model knowledge into neural networks, achieving parameter-efficient tabular data classification
  2. Dual Model Instantiation: Framework implementation based on TabPFN (~11M parameters) and BigScience T0pp (~11B parameters), distilled into MLPs with approximately 1,000 parameters
  3. Experimental Validation: Verification on 5 tabular datasets demonstrating that distilled MLPs surpass classical baselines and, in certain cases, even exceed the original Transformer models
  4. Innovative Training Strategy: Introduction of permutation-based training techniques to prevent overfitting on extremely small training sets

Methodology Details

Task Definition

Given a small-scale tabular dataset DN={(xn,yn),xnX,yn{0,1},n=1,...,N}D_N = \{(x_n, y_n), x_n \in X, y_n \in \{0,1\}, n=1,...,N\}, where N10N \sim 10, the objective is to generate a simple MLP hθ(x):X{0,1}h_\theta(x): X \to \{0,1\} by leveraging knowledge from a pre-trained Transformer model ff.

Model Architecture

Overall Framework

TabDistill comprises two stages:

  • Stage 1: Fine-tuning the base Transformer model to generate a quality MLP
  • Stage 2: Optional additional MLP fine-tuning

Core Components

  1. Base Model Decomposition:
    • Encoder: fE(s):SZf_E(s): S \to Z
    • Decoder: fD(z):Z{0,1}f_D(z): Z \to \{0,1\}
  2. MLP Architecture:
    h_θ(x) = ReLU(W_R ReLU(···ReLU(W_2 ReLU(W_1 x + b_1) + b_2)···) + b_R)
    

    where R denotes the number of layers and L denotes hidden layer width
  3. Linear Mapping:
    m_η(z) = LayerNorm(Az + b)
    

    where ARdim(Θ)×dim(Z)A \in R^{dim(Θ)×dim(Z)} and η=(A,b)η = (A,b)

Training Procedure

Stage 1 Loss Function:

L(η; D_N) = Σ[y_n log(σ(h_θ(x_n))[[1]]) + (1-y_n) log(σ(h_θ(x_n))[[0]])]

where θ=mη(fE(g(DN)))θ = m_η(f_E(g(D_N)))

Technical Innovations

  1. Hypernetwork Paradigm: Leveraging insights from computer vision, the Transformer serves as a hypernetwork generating neural network weights
  2. Permutation Augmentation: Random feature reordering at each training epoch prevents overfitting
  3. Parameter-Efficient Fine-tuning: Only linear mapping parameters ηη are fine-tuned while base model parameters remain frozen
  4. Two-Stage Design: Sequential distillation followed by fine-tuning fully exploits pre-trained knowledge

Specific Instantiations

TabDistill + TabPFN

  • Direct use of tabular data with g(x)=xg(x) = x (identity transformation)
  • Encoder output dimension: 192N192N
  • Mapping matrix dimension: dim(Θ)×192Ndim(Θ) × 192N

TabDistill + T0pp

  • Text serialization: "The <column name> is <value>"
  • Encoder output dimension: 4096
  • Mapping matrix dimension: dim(Θ)×4096dim(Θ) × 4096

Experimental Setup

Datasets

Five public tabular datasets are employed:

  1. Bank (UCI Bank Marketing): Predicting customer subscription to term deposits
  2. Blood (UCI Blood Transfusion): Predicting blood donation likelihood
  3. Calhousing (California Housing): Predicting housing value categories
  4. Heart (UCI Heart Disease): Predicting heart disease presence
  5. Income (Census Income): Predicting annual income exceeding 50K

Evaluation Metrics

ROC-AUC serves as the primary evaluation metric, considering classification performance in few-shot scenarios.

Comparison Methods

  1. Classical Baselines: Logistic regression, XGBoost, independently trained MLP
  2. Base Models: TabPFN, T0pp (TabLLM)
  3. Distilled Models: TabDistill + TabPFN, TabDistill + T0pp

Implementation Details

  • MLP Architecture: 4 layers with 10 neurons per layer (~1,000 parameters)
  • Training Configuration: Stage 1 fine-tuning for 300 epochs, Stage 2 for additional 100 epochs
  • Hyperparameter Optimization: Grid search via Weights & Biases
  • Sample Scales: N ∈ {4, 8, 16, 32, 64}

Experimental Results

Primary Results

Based on ROC-AUC results from Table 1:

Extreme Few-Shot Scenario (N=4)

  • TabDistill + TabPFN achieves 0.72 on the Bank dataset, significantly surpassing all classical baselines
  • TabDistill + T0pp demonstrates excellent performance across multiple datasets, such as Calhousing (0.67) and Income (0.70)
  1. Performance Improvement with Sample Increase: All methods show general performance improvement as N increases
  2. Baseline Method Variation: No single classical method universally dominates across all datasets
  3. Model Selection Differences: TabDistill + TabPFN generally outperforms TabDistill + T0pp, with the reverse observed on the Income dataset

Comparison with Base Models

Table 3 reveals surprising results:

  • In certain cases, distilled MLPs exceed the performance of original Transformer models
  • For example, on Bank dataset with N=4: TabDistill + TabPFN (0.72) > TabPFN (0.62)
  • This indicates that distillation not only compresses the model but may also enhance performance

Ablation Studies

Model Complexity Impact (Table 2)

  • Testing the influence of layer count R on performance
  • Results demonstrate: performance degradation beyond a certain complexity threshold
  • 4-layer architecture achieves optimal performance in most cases

Feature Attribution Analysis (Figure 3)

Using SHAP for feature importance analysis:

  • Distilled models maintain consistency with classical baselines in feature importance
  • Models correctly identify important features even after feature permutation
  • Demonstrates that base models correctly learn associations between MLP weights and feature ordering

Experimental Findings

  1. Significant Distillation Effects: Distilled models substantially outperform classical methods in extreme few-shot scenarios
  2. Parameter Efficiency: Compression from millions/billions to thousands of parameters yields enormous efficiency gains
  3. Effective Knowledge Transfer: Pre-trained knowledge successfully transfers to simple MLPs
  4. Strong Robustness: Permutation augmentation strategy effectively prevents overfitting

Classical Tabular Data Algorithms

  • Traditional Advantages: XGBoost, LightGBM, and CatBoost have long dominated the tabular data domain
  • Few-Shot Limitations: From-scratch classical models show significant performance degradation in few-shot scenarios

Transformer Applications to Tabular Data

  • SAINT: Employs attention mechanisms to model row-column interactions with self-supervised pre-training
  • TabPFN: Pre-trained on extensive synthetic tabular data, enabling direct prediction on new tasks without additional training
  • TabLLM Series: Serializes tabular data as text, leveraging LLMs for classification

Meta-Learning and Hypernetworks

  • Meta-Learning Connection: Transformers excel at in-context learning, analogous to meta-learning paradigms
  • Hypernetwork Applications: Computer vision has established precedent for using Transformers to generate neural network weights
  • Novel Contribution: First application of this paradigm to the tabular data domain

Knowledge Distillation

  • Traditional Distillation: Aligns student model outputs with teacher model outputs through loss functions
  • Distinction from Prior Work: Directly extracts neural networks from Transformers without loss alignment

Conclusions and Discussion

Main Conclusions

  1. Effectiveness Validation: TabDistill successfully achieves balance between parameter efficiency and few-shot performance
  2. Performance Advantages: Distilled MLPs surpass classical baselines in most cases, and in certain scenarios even exceed original Transformer models
  3. Practical Value: Provides a practically deployable solution accommodating diverse infrastructure requirements

Limitations

The authors honestly acknowledge the following shortcomings:

  1. Large-Sample Performance: Limited performance improvements when training samples increase
  2. Simple Mapping Function: Current linear mapping may constrain performance ceiling
  3. Bias Inheritance: Distilled models may inherit biases from base models
  4. Application Scope: Currently validated only on binary classification tasks

Future Directions

  1. Mapping Function Improvements: Exploring more complex mapping functions to enhance performance
  2. Application Extension: Extending to natural language inference, instruction tuning, and other few-shot tasks
  3. Bias Mitigation: Reducing base model biases through Stage 2 MLP fine-tuning
  4. Multi-Task Learning: Investigating simultaneous handling of multiple tabular tasks

In-Depth Evaluation

Strengths

  1. Strong Problem Targeting: Accurately identifies and addresses core contradictions in practical applications
  2. Methodological Innovation: First application of hypernetwork paradigm to tabular data distillation
  3. Comprehensive Experimental Design:
    • Multi-dataset validation
    • Sufficient baseline comparisons
    • Detailed ablation studies
    • Feature attribution analysis
  4. Convincing Results: Not only achieves intended objectives but discovers the interesting phenomenon of distilled models exceeding original models
  5. High Practical Value: Provides directly applicable solutions

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks theoretical explanation for why distilled models can exceed original models
  2. Limited Dataset Scale: Validation on only 5 relatively small-scale datasets
  3. Single Task Type: Considers only binary classification, excluding regression or multi-class tasks
  4. Limited Base Model Coverage: Tests only two base models with limited scope
  5. Incomplete Computational Cost Analysis: Lacks detailed comparison of actual training and inference computational costs

Impact

  1. Academic Contributions:
    • Pioneering new direction in tabular data Transformer distillation
    • Providing novel perspectives for few-shot learning
    • Bridging hypernetworks and knowledge distillation research domains
  2. Practical Value:
    • Addresses important deployment challenges
    • Provides feasible solutions for resource-constrained environments
    • Directly applicable to industrial scenarios
  3. Reproducibility:
    • Provides detailed implementation details
    • Open-source commitment enhances reproducibility
    • Clear and repeatable experimental setup

Applicable Scenarios

  1. Resource-Constrained Environments: Mobile devices, edge computing scenarios
  2. Few-Shot Applications: Medical diagnosis, financial risk control, quality inspection with scarce data
  3. Real-Time Inference Requirements: Online services requiring rapid response
  4. Model Interpretability Requirements: Simple MLPs offer greater interpretability compared to complex Transformers

References

The paper cites extensive related work, primarily including:

  • Classical tabular data methods: XGBoost, LightGBM, CatBoost, etc.
  • Transformer tabular applications: TabPFN, SAINT, TabLLM series
  • Knowledge distillation: Classical works by Hinton et al.
  • Hypernetworks: Related applications in computer vision
  • Meta-learning: Transformer in-context learning research

Overall Assessment: This is a high-quality research paper that proposes innovative solutions to practical problems with comprehensive experimental validation, demonstrating significant academic and practical value. While certain limitations exist, it makes important contributions to the advancement of related fields.