[P] Which Machine Learning Classifiers are best for small datasets? An empirical study

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

Empirical_Study_of_Ensemble_Learning_Methods

1 10 0.0 R

Training ensemble machine learning classifiers, with flexible templates for repeated cross-validation and parameter tuning

I've actually made the same kind of graph before. In this image: each point is the average of 5 out-of-fold predictions for one trial of k-fold cross-validation. I repeated the procedure 40 times to visualize the out-of-fold accuracy on the Wisconsin diagnostic breast cancer data set (560 observations on 30 numeric variables). I evaluated 14 models for classification:

pyGAM

2 838 2.6 Python

[HELP REQUESTED] Generalized Additive Models in Python

Ah they haven't quite gotten around to supporting multiclass classification yet! https://github.com/dswah/pyGAM/pull/213

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
mljar-supervised

51 2,929 8.5 Python

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation

What machine have you used for comparison? I would like to check the performance of AutoML that I'm working on.

optuna

34 9,640 9.9 Python

A hyperparameter optimization framework

I don't know anything about scikit-optimize. Optuna doesn't have less constrained parameters like normal/log-normal, useful when approaching a new problem. It also doesn't implement the constant liar algorithm for TPE. The latter is easy to fix, the former can be worked around if you carefully observe the ranges of good parameters and do a re-run or two.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project