-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
decision-forests
A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.
-
yggdrasil-decision-forests
A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
A great paper and an important result.
However, it omits to cite the highly relevant SRBench paper from 2021, which also carefully curates a suitable set of regression benchmarks and shows that Genetic Programming approaches also tend to be better than deep learning.
https://github.com/cavalab/srbench
cc u/optimalsolver
It occurs to me that a system, trained on peer-reviewed applied-machine-learning literature and Kaggle winners, that generates candidates for structured feature-engineering specifications, based on plaintext descriptions of columns' real-world meaning, should be considered a requisite part of the "meta" here.
Ah, and then you could iterate within the resulting feature-engineering-suggestion space as a hyper-parameter between experiments, which could be optimized with e.g. https://github.com/HIPS/Spearmint . The papers write themselves!
I can't explain it, but I help maintain TensorFlow Decision Forests [1] and Yggdrasil Decision Forests [2], and in an AutoML system at work that trains models on lots of various users data, decision forest models gets selected as best (after AutoML tries various model types and hyperparameters) somewhere between 20% to 40% of the times, systematically. It's pretty interesting. Other ML types considered are NN, linear models (with auto feature crossings generation), and a couple of other variations.
[1] https://github.com/tensorflow/decision-forests
Oh, you touched my favorite topic of whole dataset training.
Take a look at [1] and go straight to the page 8, figure 2(b).
[1] http://proceedings.mlr.press/v48/taylor16.pdf
The paper talks about whole dataset training and one of the datasets used is HIGGS [2]. The figure 2(b) shows two whole dataset training approaches (L-BFGS and ADMM) vs SGD. SGD tops at the accuracy with which both whole dataset approaches start, basically.
[2] https://archive.ics.uci.edu/ml/datasets/HIGGS#
HIGGS is strange dataset. It is narrow, having only 29 features. It is also relatively long, about 11M samples (10M to train, 0.5M to validate and last 0.5M to test). It is also hard to get right with SGD.
But if you perform whole dataset optimization, even linear regression can get you good accuracy [3] (some experiments of mine).
[3] https://github.com/thesz/higgs-logistic-regression
Related posts
-
Why do tree-based models still outperform deep learning on tabular data? (2022)
-
Any MLOps platform you use?
-
Binary image classification using random forest algorithm
-
[P] Tree compiler that speeds up LightGBM model inference by ~30x
-
Trying to deploy a TeachableMachine model to TFLite Micro for Arduino Nano BLE 33