backup: 2026-06-19 18:09

2026-06-19 18:09:39 +03:00
parent 4806acf57d
commit 16c7ccac8a
7 changed files with 299 additions and 30 deletions
@@ -0,0 +1,37 @@
+* impl/paper-tabpfn
+
+[[id:151d5686-6f40-4158-a59a-b0be94cdc969][impl/research]]
+
+*TabPFN: A Transformer That Solves Small Tabular Classification in a Second*
+Hollmann et al. — ICLR 2023 — arXiv:2207.01848
+
+** Problem
+
+Small tabular datasets demand expensive hyperparameter search and still often lose to boosted trees. TabPFN asks: can you eliminate tuning entirely while matching AutoML?
+
+** Core Method
+
+A Transformer pre-trained *offline* on millions of synthetic datasets sampled from structural causal models. At inference, the full training set is passed as context — no gradient updates. The model receives =(X_train, y_train, X_test)= as one sequence and outputs predictions in a single forward pass (in-context learning).
+
+** Evaluation
+
+- 18 OpenML-CC18 datasets + 67 small numerical OpenML datasets
+- Up to 1,000 training points, 100 features, 10 classes
+- Compared against AutoML systems (Auto-sklearn), XGBoost, random forests
+
+** Key Results
+
+- Outperforms boosted trees on small datasets; matches top AutoML
+- 230× faster than AutoML baselines; 5,700× with GPU
+- No hyperparameter tuning required
+
+** Limitations
+
+- v1: numerical features only, no missing values, max ~1,000 training samples
+- TabPFN v2 (2025, arXiv:2502.17361) lifts most constraints — handles larger datasets, mixed types, missing values
+
+** Relevance to ROLL
+
+- [[id:8f59b736-04ea-4d11-9195-30d125a127f8][impl/paper-beyond-rebalancing]] identifies TabPFN as the best-performing classifier on imbalanced tabular data without rebalancing — it is the current bar to beat
+- ROLL's niche (optimizing TPR at a specific FPR threshold) is orthogonal: TabPFN uses no custom loss or ROC objective
+- If evaluating on small KEEL datasets (≤1,000 samples), TabPFN is the strongest baseline to include