Why do tree-based models still outperform deep learning on tabular data?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • srbench

    A living benchmark framework for symbolic regression

  • A great paper and an important result.

    However, it omits to cite the highly relevant SRBench paper from 2021, which also carefully curates a suitable set of regression benchmarks and shows that Genetic Programming approaches also tend to be better than deep learning.

    https://github.com/cavalab/srbench

    cc u/optimalsolver

  • Spearmint

    Spearmint Bayesian optimization codebase

  • It occurs to me that a system, trained on peer-reviewed applied-machine-learning literature and Kaggle winners, that generates candidates for structured feature-engineering specifications, based on plaintext descriptions of columns' real-world meaning, should be considered a requisite part of the "meta" here.

    Ah, and then you could iterate within the resulting feature-engineering-suggestion space as a hyper-parameter between experiments, which could be optimized with e.g. https://github.com/HIPS/Spearmint . The papers write themselves!

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • decision-forests

    A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.

  • I can't explain it, but I help maintain TensorFlow Decision Forests [1] and Yggdrasil Decision Forests [2], and in an AutoML system at work that trains models on lots of various users data, decision forest models gets selected as best (after AutoML tries various model types and hyperparameters) somewhere between 20% to 40% of the times, systematically. It's pretty interesting. Other ML types considered are NN, linear models (with auto feature crossings generation), and a couple of other variations.

    [1] https://github.com/tensorflow/decision-forests

  • yggdrasil-decision-forests

    A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.

  • Oh, you touched my favorite topic of whole dataset training.

    Take a look at [1] and go straight to the page 8, figure 2(b).

    [1] http://proceedings.mlr.press/v48/taylor16.pdf

    The paper talks about whole dataset training and one of the datasets used is HIGGS [2]. The figure 2(b) shows two whole dataset training approaches (L-BFGS and ADMM) vs SGD. SGD tops at the accuracy with which both whole dataset approaches start, basically.

    [2] https://archive.ics.uci.edu/ml/datasets/HIGGS#

    HIGGS is strange dataset. It is narrow, having only 29 features. It is also relatively long, about 11M samples (10M to train, 0.5M to validate and last 0.5M to test). It is also hard to get right with SGD.

    But if you perform whole dataset optimization, even linear regression can get you good accuracy [3] (some experiments of mine).

    [3] https://github.com/thesz/higgs-logistic-regression

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Why do tree-based models still outperform deep learning on tabular data? (2022)

    3 projects | news.ycombinator.com | 5 Mar 2024
  • Any MLOps platform you use?

    5 projects | /r/selfhosted | 25 Feb 2023
  • Binary image classification using random forest algorithm

    1 project | /r/embedded | 7 Nov 2022
  • [P] Tree compiler that speeds up LightGBM model inference by ~30x

    1 project | /r/MachineLearning | 23 Aug 2021
  • Trying to deploy a TeachableMachine model to TFLite Micro for Arduino Nano BLE 33

    3 projects | /r/computervision | 26 May 2021