TabDistill: Distilling Transformers into Neural Nets for Few-Shot Tabular Classification
Dissanayake, Dutta
Transformer-based models have shown promising performance on tabular data compared to their classical counterparts such as neural networks and Gradient Boosted Decision Trees (GBDTs) in scenarios with limited training data. They utilize their pre-trained knowledge to adapt to new domains, achieving commendable performance with only a few training examples, also called the few-shot regime. However, the performance gain in the few-shot regime comes at the expense of significantly increased complexity and number of parameters. To circumvent this trade-off, we introduce TabDistill, a new strategy to distill the pre-trained knowledge in complex transformer-based models into simpler neural networks for effectively classifying tabular data. Our framework yields the best of both worlds: being parameter-efficient while performing well with limited training data. The distilled neural networks surpass classical baselines such as regular neural networks, XGBoost and logistic regression under equal training data, and in some cases, even the original transformer-based models that they were distilled from.
academic
TabDistill: Distilling Transformers into Neural Nets for Few-Shot Tabular Classification
Transformer-based models have demonstrated promising performance on tabular data compared to classical counterparts such as neural networks and Gradient Boosted Decision Trees (GBDTs) in scenarios with limited training data. They leverage pre-trained knowledge to adapt to new domains, achieving commendable performance with only a few training examples, commonly referred to as the few-shot regime. However, the performance gains in the few-shot regime come at the expense of significantly increased complexity and parameter count. To circumvent this trade-off, we introduce TabDistill, a novel strategy to distill pre-trained knowledge from complex transformer-based models into simpler neural networks for effective tabular data classification. Our framework achieves the best of both worlds: parameter efficiency while maintaining strong performance with limited training data. The distilled neural networks surpass classical baselines such as standard neural networks, XGBoost, and logistic regression under equal training data conditions, and in some cases, even exceed the performance of the original transformer-based models from which they were distilled.
This research addresses a fundamental contradiction in tabular data classification: while Transformer-based models demonstrate superior performance in few-shot scenarios, they suffer from enormous parameter counts and high computational complexity, making them difficult to deploy in practical applications.
Practical Application Needs: In high-risk domains such as finance, healthcare, and manufacturing, scarce labeled data is a common challenge, exemplified by rare disease diagnosis and prediction of once-in-a-century natural phenomena
Data Annotation Costs: Financial applications involve expensive data annotation with inherent subjectivity, annotation errors, and lack of consensus
Deployment Constraints: Real-world applications require parameter-efficient and scalable models to accommodate varying infrastructure capabilities
Traditional Methods: XGBoost, CatBoost, and LightGBM excel with abundant data but show significant performance degradation in few-shot scenarios
Transformer Methods: TabPFN and TabLLM demonstrate excellent few-shot performance but require millions to billions of parameters, resulting in prohibitive inference costs
Efficiency-Performance Trade-off: Lack of solutions that maintain few-shot performance while achieving parameter efficiency
The authors pose a central question: "Can we achieve both objectives simultaneously—maintaining parameter efficiency while performing well with limited training data?"
Proposes TabDistill Framework: A novel strategy for distilling Transformer model knowledge into neural networks, achieving parameter-efficient tabular data classification
Dual Model Instantiation: Framework implementation based on TabPFN (~11M parameters) and BigScience T0pp (~11B parameters), distilled into MLPs with approximately 1,000 parameters
Experimental Validation: Verification on 5 tabular datasets demonstrating that distilled MLPs surpass classical baselines and, in certain cases, even exceed the original Transformer models
Innovative Training Strategy: Introduction of permutation-based training techniques to prevent overfitting on extremely small training sets
Given a small-scale tabular dataset DN={(xn,yn),xn∈X,yn∈{0,1},n=1,...,N}, where N∼10, the objective is to generate a simple MLP hθ(x):X→{0,1} by leveraging knowledge from a pre-trained Transformer model f.
The paper cites extensive related work, primarily including:
Classical tabular data methods: XGBoost, LightGBM, CatBoost, etc.
Transformer tabular applications: TabPFN, SAINT, TabLLM series
Knowledge distillation: Classical works by Hinton et al.
Hypernetworks: Related applications in computer vision
Meta-learning: Transformer in-context learning research
Overall Assessment: This is a high-quality research paper that proposes innovative solutions to practical problems with comprehensive experimental validation, demonstrating significant academic and practical value. While certain limitations exist, it makes important contributions to the advancement of related fields.