[P] We are building a curated list of open source tooling for data-centric AI workflows, looking for contributions.

This page summarizes the projects mentioned and recommended in the original post on reddit.com/r/MachineLearning

Our great sponsors
  • CodiumAI - TestGPT | Generating meaningful tests for busy devs
  • Sonar - Write Clean Python Code. Always.
  • InfluxDB - Access the most powerful time series database as a service
  • ONLYOFFICE ONLYOFFICE Docs — document collaboration in your environment
  • cleanlab

    The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

    How about cleanlab? It works for any data you can train a classifier or get embeddings on (text, tabular, image, audio, etc). We just released some new features as well. Currently, cleanlab can automatically:

  • refinery

    The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.

    You definitely forgot https://www.kern.ai/ :)

  • CodiumAI

    TestGPT | Generating meaningful tests for busy devs. Get non-trivial tests (and trivial, too!) suggested right inside your IDE, so you can code smart, create more value, and stay confident when you push.

  • grape

    🍇 GRAPE is a Rust/Python Graph Representation Learning library for Predictions and Evaluations (by AnacletoLAB)

    For graph embeddings, there's quite a few. I'd recommend this one, but there's also this one (disclaimer: I'm the author) or this one, more of a DGL library.

  • nodevectors

    Fastest network node embeddings in the west

    For graph embeddings, there's quite a few. I'd recommend this one, but there's also this one (disclaimer: I'm the author) or this one, more of a DGL library.

  • dgl

    Python package built to ease deep learning on graph, on top of existing DL frameworks.

    For graph embeddings, there's quite a few. I'd recommend this one, but there's also this one (disclaimer: I'm the author) or this one, more of a DGL library.

  • optuna

    A hyperparameter optimization framework

    Keras Tuner, Optuna : https://github.com/optuna/optuna ?

  • awesome-production-machine-learning

    A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning

    There is a cool, gigantic list for MLOps that I can recommend: https://github.com/EthicalML/awesome-production-machine-learning

  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • dcai-lab

    Lab assignments for Introduction to Data-Centric AI, MIT IAP 2023 👩🏽‍💻

    Thanks for the kind words! Make sure to check out the current open MIT course if you are just starting out: https://dcai.csail.mit.edu/

  • snorkel

    A system for quickly generating training data with weak supervision

    The paid product came out of an open source tool: https://github.com/snorkel-team/snorkel

  • deodel

    A mixed attributes classifier algorithm implemented in Python.

    The deodel classifier can act as a quick dataset evaluation tool. If your data is available in table format, you can check its potential for prediction/classification. Just feed it to deodel. It accepts mixed attributes without any preliminary curation. It simply considers attribute values expressed as floats (dot decimal) as being continuous. It accepts even a mix of continuous and categorical values for the same attribute column.

  • BotLibre

    An open platform for artificial intelligence, chat bots, virtual agents, social media automation, and live chat automation.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts