[P] Which Machine Learning Classifiers are best for small datasets? An empirical study

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • Empirical_Study_of_Ensemble_Learning_Methods

    Training ensemble machine learning classifiers, with flexible templates for repeated cross-validation and parameter tuning

  • I've actually made the same kind of graph before. In this image: each point is the average of 5 out-of-fold predictions for one trial of k-fold cross-validation. I repeated the procedure 40 times to visualize the out-of-fold accuracy on the Wisconsin diagnostic breast cancer data set (560 observations on 30 numeric variables). I evaluated 14 models for classification:

  • pyGAM

    [HELP REQUESTED] Generalized Additive Models in Python

  • Ah they haven't quite gotten around to supporting multiclass classification yet! https://github.com/dswah/pyGAM/pull/213

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • mljar-supervised

    Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation

  • What machine have you used for comparison? I would like to check the performance of AutoML that I'm working on.

  • optuna

    A hyperparameter optimization framework

  • I don't know anything about scikit-optimize. Optuna doesn't have less constrained parameters like normal/log-normal, useful when approaching a new problem. It also doesn't implement the constant liar algorithm for TPE. The latter is easy to fix, the former can be worked around if you carefully observe the ranges of good parameters and do a re-run or two.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts