roam/implpaper-tabpfn.org at 3a2b6bfde881ada76154e0ff1d2566bd52ce9602

aner/roam

Fork 0

Files

T

aner 16c7ccac8a backup: 2026-06-19 18:09

2026-06-19 18:09:39 +03:00

1.7 KiB

Raw Blame History

impl/paper-tabpfn

impl/paper-tabpfn

impl/research

TabPFN: A Transformer That Solves Small Tabular Classification in a Second Hollmann et al. — ICLR 2023 — arXiv:2207.01848

Problem

Small tabular datasets demand expensive hyperparameter search and still often lose to boosted trees. TabPFN asks: can you eliminate tuning entirely while matching AutoML?

Core Method

A Transformer pre-trained offline on millions of synthetic datasets sampled from structural causal models. At inference, the full training set is passed as context — no gradient updates. The model receives (X_train, y_train, X_test) as one sequence and outputs predictions in a single forward pass (in-context learning).

Evaluation

18 OpenML-CC18 datasets + 67 small numerical OpenML datasets
Up to 1,000 training points, 100 features, 10 classes
Compared against AutoML systems (Auto-sklearn), XGBoost, random forests

Key Results

Outperforms boosted trees on small datasets; matches top AutoML
230× faster than AutoML baselines; 5,700× with GPU
No hyperparameter tuning required

Limitations

v1: numerical features only, no missing values, max ~1,000 training samples
TabPFN v2 (2025, arXiv:2502.17361) lifts most constraints — handles larger datasets, mixed types, missing values

Relevance to ROLL

impl/paper-beyond-rebalancing identifies TabPFN as the best-performing classifier on imbalanced tabular data without rebalancing — it is the current bar to beat
ROLL's niche (optimizing TPR at a specific FPR threshold) is orthogonal: TabPFN uses no custom loss or ROC objective
If evaluating on small KEEL datasets (≤1,000 samples), TabPFN is the strongest baseline to include

1.7 KiB Raw Blame History Unescape Escape