backup: 2026-06-19 18:09

2026-06-19 18:09:39 +03:00
parent 4806acf57d
commit 16c7ccac8a
7 changed files with 299 additions and 30 deletions
@@ -0,0 +1,73 @@
+* impl/research
+
+Survey of academic literature on class imbalance in deep learning, relevant to ROLL's thesis positioning.
+
+** Key Papers
+
+| Paper | Venue | Node |
+|-------|-------|------|
+| CLIMB (arXiv:2505.17451) | NeurIPS 2025 | — |
+| [[id:8f59b736-04ea-4d11-9195-30d125a127f8][impl/paper-beyond-rebalancing]] | 2024 | detailed node |
+| Simplifying NN Training Under Class Imbalance (arXiv:2312.02517) | 2023 | — |
+| Investigating Group DRO (arXiv:2303.02505) | 2023 | — |
+| [[id:bf0fc08a-e806-48df-b188-7a2c4c41c693][impl/paper-tabpfn]] | ICLR 2023 | detailed node |
+| Survey on Imbalanced Learning (Springer 2024) | Springer AI Review | — |
+| Rethinking Class Imbalance (arXiv:2305.03900) | 2023 | — |
+
+** Competing Strategies
+
+Methods the literature benchmarks against (relevant as ROLL baselines):
+
+- *Resampling*: SMOTE, ADASYN, CSMOUTE, BorderlineSMOTE, ROSE
+- *Cost-sensitive*: class weighting, focal loss, asymmetric loss
+- *Ensemble*: BalancedBagging, EasyEnsemble, RUSBoost, BalancedRandomForest
+- *Threshold moving*: post-hoc calibration on decision threshold
+- *DL-specific*: LDAM-DRW, M2m, MiSLAS, BBN (mostly image long-tail)
+- *Tabular DL baselines*: XGBoost, LightGBM, CatBoost, MLP, ResNet, FT-Transformer, [[id:bf0fc08a-e806-48df-b188-7a2c4c41c693][impl/paper-tabpfn]]
+- *CLIMB finding*: ensembles dominate; naive rebalancing (SMOTE alone) often underperforms
+
+Metrics used: AUC-ROC, G-Mean, F1, Precision/Recall. AUC and G-Mean are the standard for imbalanced eval.
+ROLL's TPR-at-FPR framing is non-standard but more practically useful — position this as an advantage.
+
+** Dataset Coverage vs Literature
+
+*** Well Covered by ROLL
+- All glass variants (glass0–6) — standard KEEL
+- Yeast3, ecoli-0-1_vs_5, wisconsin, cleveland, pima, haberman, iris0, vowel0, vehicle2, page-blocks, new-thyroid1, led7digit
+- Adult, Forest Cover, Bank Marketing (medium tabular)
+- Credit Card Fraud (~285K, IR 577:1) — common in fraud literature
+
+*** Gaps vs Literature (datasets in papers ROLL doesn't have)
+
+| Dataset | IR | Samples | Appears In |
+|---------|----|---------|------------|
+| Abalone9-18 | ~130 | 731 | [[id:8f59b736-04ea-4d11-9195-30d125a127f8][Beyond Rebalancing]], CLIMB |
+| Annthyroid | 7.2 | 6916 | [[id:8f59b736-04ea-4d11-9195-30d125a127f8][Beyond Rebalancing]], many UCI surveys |
+| Satellite | 22 | 6435 | [[id:8f59b736-04ea-4d11-9195-30d125a127f8][Beyond Rebalancing]] |
+| Segment | 6 | 2310 | [[id:8f59b736-04ea-4d11-9195-30d125a127f8][Beyond Rebalancing]] |
+| Yeast4/5/6 | 8–33 | ~1484 | [[id:8f59b736-04ea-4d11-9195-30d125a127f8][Beyond Rebalancing]], CLIMB |
+| Ecoli4 | 15.8 | 336 | [[id:8f59b736-04ea-4d11-9195-30d125a127f8][Beyond Rebalancing]] |
+| KC1/KC2/PC1/CM1 (software) | 5–13 | 415–1783 | [[id:8f59b736-04ea-4d11-9195-30d125a127f8][Beyond Rebalancing]] |
+| Pen-local/Pen-global | 9–671 | 7291 | [[id:8f59b736-04ea-4d11-9195-30d125a127f8][Beyond Rebalancing]] |
+
+*** Non-Standard or Unusual in ROLL
+- *Higgs*: ROLL samples 500K balanced (50/50) — not a standard imbalanced benchmark; physics ML context
+- *Home Credit*: Kaggle competition dataset; rare in academic imbalance papers
+- *CIFAR-10 binary* (class 1 vs rest, IR ~9): DL imbalance papers use long-tail formulation instead — results not directly comparable to LDAM/MiSLAS tables
+
+** Recommendations for Baseline Strengthening
+
+Priority additions (available in KEEL, low effort):
+1. Yeast4, Yeast5, Yeast6 — stress-test high IR range
+2. Annthyroid — one of the most cited UCI imbalanced datasets
+3. Abalone9-18 — extreme IR (130:1), covers the hard regime
+4. Ecoli4 — rounds out ecoli coverage at IR 15.8
+
+Lower priority (useful if sweeping many baselines):
+5. Satellite, Segment, Pen-local — common in full KEEL sweeps
+6. KC1/PC1 — software metrics datasets; different domain from biology/finance
+
+** Paper Subnodes
+
+- [[id:bf0fc08a-e806-48df-b188-7a2c4c41c693][impl/paper-tabpfn]] — TabPFN: in-context learning for small tabular classification (ICLR 2023)
+- [[id:8f59b736-04ea-4d11-9195-30d125a127f8][impl/paper-beyond-rebalancing]] — benchmark of 12 classifiers under imbalance, no rebalancing (2024)