Why do tree-based models still outperform deep learning on tabular data?

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

srbench

2 194 9.1 Python

A living benchmark framework for symbolic regression

A great paper and an important result.
However, it omits to cite the highly relevant SRBench paper from 2021, which also carefully curates a suitable set of regression benchmarks and shows that Genetic Programming approaches also tend to be better than deep learning.
https://github.com/cavalab/srbench
cc u/optimalsolver

Spearmint

2 1,529 0.0 Python

Spearmint Bayesian optimization codebase

It occurs to me that a system, trained on peer-reviewed applied-machine-learning literature and Kaggle winners, that generates candidates for structured feature-engineering specifications, based on plaintext descriptions of columns' real-world meaning, should be considered a requisite part of the "meta" here.
Ah, and then you could iterate within the resulting feature-engineering-suggestion space as a hyper-parameter between experiments, which could be optimized with e.g. https://github.com/HIPS/Spearmint . The papers write themselves!

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
decision-forests

1 651 8.3 Python

A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.

I can't explain it, but I help maintain TensorFlow Decision Forests [1] and Yggdrasil Decision Forests [2], and in an AutoML system at work that trains models on lots of various users data, decision forest models gets selected as best (after AutoML tries various model types and hyperparameters) somewhere between 20% to 40% of the times, systematically. It's pretty interesting. Other ML types considered are NN, linear models (with auto feature crossings generation), and a couple of other variations.
[1] https://github.com/tensorflow/decision-forests

yggdrasil-decision-forests

4 429 9.5 C++

A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.
higgs-logistic-regression

2 1 3.6 Haskell

Oh, you touched my favorite topic of whole dataset training.
Take a look at [1] and go straight to the page 8, figure 2(b).
[1] http://proceedings.mlr.press/v48/taylor16.pdf
The paper talks about whole dataset training and one of the datasets used is HIGGS [2]. The figure 2(b) shows two whole dataset training approaches (L-BFGS and ADMM) vs SGD. SGD tops at the accuracy with which both whole dataset approaches start, basically.
[2] https://archive.ics.uci.edu/ml/datasets/HIGGS#
HIGGS is strange dataset. It is narrow, having only 29 features. It is also relatively long, about 11M samples (10M to train, 0.5M to validate and last 0.5M to test). It is also hard to get right with SGD.
But if you perform whole dataset optimization, even linear regression can get you good accuracy [3] (some experiments of mine).
[3] https://github.com/thesz/higgs-logistic-regression

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Why do tree-based models still outperform deep learning on tabular data? (2022)

3 projects | news.ycombinator.com | 5 Mar 2024
Any MLOps platform you use?

5 projects | /r/selfhosted | 25 Feb 2023
Binary image classification using random forest algorithm

1 project | /r/embedded | 7 Nov 2022
[P] Tree compiler that speeds up LightGBM model inference by ~30x

1 project | /r/MachineLearning | 23 Aug 2021
Trying to deploy a TeachableMachine model to TFLite Micro for Arduino Nano BLE 33

3 projects | /r/computervision | 26 May 2021

Why do tree-based models still outperform deep learning on tabular data?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Machine Learning Tensorflow random-forest ML decision-trees
Post date: 3 Aug 2022

srbench

Spearmint

InfluxDB

decision-forests

yggdrasil-decision-forests

higgs-logistic-regression

SaaSHub

Related posts