Our great sponsors
-
oxen-release
Lightning fast data version control system for structured and unstructured machine learning datasets. We aim to make versioning datasets as easy as versioning code.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
* User interface for less tech savy people ( e.g just a git like command line is fine for engineers but not for field personell who are not in IT )
I know of tools like https://dvc.org/ but a) they are just layers on top of git b) break appart on huge datasets without a folder hierarchy ( git tree objects just don't work for linear lists of items ) are only useable by IT personell, and require checking out at least a part of the dataset.
Our datasets would be 100.000.000 x 100 MB = 10 PB of raw data. Training data should be delivered to training nodes via network etc.. we just can't have a full checkout of that data...
We have been working on a data version control tool called Oxen that is tackling many of your needs. Feel free to check it out here:
https://github.com/Oxen-AI/oxen-release#-oxen
Going down your list of requirements, Oxen has:
* Data versioning, similar paradigm to git, but built from the ground up for large ML datasets
If you are just looking for data versioning there is Dolt:
https://github.com/dolthub/dolt
And that has a user-friendly UI in DoltHub:
https://www.dolthub.com/
You wouldn't store the images themselves in Dolt, those would likely be links to S3 but al the labels and surrounding metadata could be stored in Dolt?
DISCLAIMER: I'm the CEO of DoltHub so this is self-promotion.