A Critical Field Guide for Working with Machine Learning Datasets

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

qdrant

139 17,718 9.9 Rust

Qdrant - High-performance, massive-scale Vector Database for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/

If you want to perform vector search over your data, then Qdrant (https://qdrant.tech) is worth checking out.

oxen-release

22 826 9.1 Python

Lightning fast data version control system for structured and unstructured machine learning datasets. We aim to make versioning datasets as easy as versioning code.

We've been working on an open source tool called Oxen to help store large ML datasets. It's optimized for large sets of unstructured data ie images, video, audio, text, as well as parquet or arrow style DataFrames.
Would love to get some feedback on it!
https://github.com/Oxen-AI/oxen-release#-oxen

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Milvus

104 26,645 10.0 Go

A cloud-native vector database, storage for next generation AI applications

Vector databases in general are good for storing large amounts of unstructured data by first converting them into embeddings via ML models. There's also feature stores, which store and organize features for later use in model training or predictive analytics. Feature stores generally come in _before_ models get trained, while vector databases generally come _after_ (i.e. they use trained models).
Milvus (https://milvus.io) and Feast (https://feast.dev/) are two of the most well known vector databases and feature stores, respectively.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project