A Critical Field Guide for Working with Machine Learning Datasets

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • qdrant

    Qdrant - High-performance, massive-scale Vector Database for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/

  • If you want to perform vector search over your data, then Qdrant (https://qdrant.tech) is worth checking out.

  • oxen-release

    Lightning fast data version control system for structured and unstructured machine learning datasets. We aim to make versioning datasets as easy as versioning code.

  • We've been working on an open source tool called Oxen to help store large ML datasets. It's optimized for large sets of unstructured data ie images, video, audio, text, as well as parquet or arrow style DataFrames.

    Would love to get some feedback on it!

    https://github.com/Oxen-AI/oxen-release#-oxen

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • Milvus

    A cloud-native vector database, storage for next generation AI applications

  • Vector databases in general are good for storing large amounts of unstructured data by first converting them into embeddings via ML models. There's also feature stores, which store and organize features for later use in model training or predictive analytics. Feature stores generally come in _before_ models get trained, while vector databases generally come _after_ (i.e. they use trained models).

    Milvus (https://milvus.io) and Feast (https://feast.dev/) are two of the most well known vector databases and feature stores, respectively.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts