[P] We are building a curated list of open source tooling for data-centric AI workflows, looking for contributions.

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • cleanlab

    The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

  • How about cleanlab? It works for any data you can train a classifier or get embeddings on (text, tabular, image, audio, etc). We just released some new features as well. Currently, cleanlab can automatically:

  • refinery

    The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.

  • You definitely forgot https://www.kern.ai/ :)

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • grape

    🍇 GRAPE is a Rust/Python Graph Representation Learning library for Predictions and Evaluations (by AnacletoLAB)

  • For graph embeddings, there's quite a few. I'd recommend this one, but there's also this one (disclaimer: I'm the author) or this one, more of a DGL library.

  • nodevectors

    Fastest network node embeddings in the west

  • For graph embeddings, there's quite a few. I'd recommend this one, but there's also this one (disclaimer: I'm the author) or this one, more of a DGL library.

  • dgl

    Python package built to ease deep learning on graph, on top of existing DL frameworks.

  • For graph embeddings, there's quite a few. I'd recommend this one, but there's also this one (disclaimer: I'm the author) or this one, more of a DGL library.

  • optuna

    A hyperparameter optimization framework

  • Keras Tuner, Optuna : https://github.com/optuna/optuna ?

  • awesome-production-machine-learning

    A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning

  • There is a cool, gigantic list for MLOps that I can recommend: https://github.com/EthicalML/awesome-production-machine-learning

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • dcai-lab

    Lab assignments for Introduction to Data-Centric AI, MIT IAP 2024 👩🏽‍💻

  • Thanks for the kind words! Make sure to check out the current open MIT course if you are just starting out: https://dcai.csail.mit.edu/

  • snorkel

    A system for quickly generating training data with weak supervision

  • The paid product came out of an open source tool: https://github.com/snorkel-team/snorkel

  • deodel

    A mixed attributes predictive algorithm implemented in Python.

  • The deodel classifier can act as a quick dataset evaluation tool. If your data is available in table format, you can check its potential for prediction/classification. Just feed it to deodel. It accepts mixed attributes without any preliminary curation. It simply considers attribute values expressed as floats (dot decimal) as being continuous. It accepts even a mix of continuous and categorical values for the same attribute column.

  • BotLibre

    An open platform for artificial intelligence, chat bots, virtual agents, social media automation, and live chat automation.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts