What’s your approach to highly imbalanced data sets?

This page summarizes the projects mentioned and recommended in the original post on /r/datascience

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • ydata-synthetic

    Synthetic data generators for tabular and time-series data

  • There's a pletora of undersampling and oversampling models you can try out. To avoid removing information form the dataset, you can focus on oversampling techniques. You can try imbalanced-learn or smote-variants. Given enough data, using fully synthetic data is also an option, you can check ydata-synthetic for it. Let us know how it turned out!

  • general_class_balancer

    Data matching algorithm for categorical and continuous variables

  • Multivariate data matching. I wrote a function to do this in grad school: https://github.com/mleming/general_class_balancer

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • deodel

    A mixed attributes predictive algorithm implemented in Python.

  • Just to mention that there is also a new algorithm that is immune to the imbalance of data. An implementation in python is available at: - https://github.com/c4pub/deodel

  • imbalanced-learn

    A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

  • There's a pletora of undersampling and oversampling models you can try out. To avoid removing information form the dataset, you can focus on oversampling techniques. You can try imbalanced-learn or smote-variants. Given enough data, using fully synthetic data is also an option, you can check ydata-synthetic for it. Let us know how it turned out!

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts