Ask HN: Data Management for AI Training

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • dvc

    🦉 ML Experiments and Data Management with Git

  • * User interface for less tech savy people ( e.g just a git like command line is fine for engineers but not for field personell who are not in IT )

    I know of tools like https://dvc.org/ but a) they are just layers on top of git b) break appart on huge datasets without a folder hierarchy ( git tree objects just don't work for linear lists of items ) are only useable by IT personell, and require checking out at least a part of the dataset.

    Our datasets would be 100.000.000 x 100 MB = 10 PB of raw data. Training data should be delivered to training nodes via network etc.. we just can't have a full checkout of that data...

  • oxen-release

    Lightning fast data version control system for structured and unstructured machine learning datasets. We aim to make versioning datasets as easy as versioning code.

  • We have been working on a data version control tool called Oxen that is tackling many of your needs. Feel free to check it out here:

    https://github.com/Oxen-AI/oxen-release#-oxen

    Going down your list of requirements, Oxen has:

    * Data versioning, similar paradigm to git, but built from the ground up for large ML datasets

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • dolt

    Dolt – Git for Data

  • If you are just looking for data versioning there is Dolt:

    https://github.com/dolthub/dolt

    And that has a user-friendly UI in DoltHub:

    https://www.dolthub.com/

    You wouldn't store the images themselves in Dolt, those would likely be links to S3 but al the labels and surrounding metadata could be stored in Dolt?

    DISCLAIMER: I'm the CEO of DoltHub so this is self-promotion.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts