Easy way to load, create, version, query and visualize computer vision datasets

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • Activeloop Hub

    Discontinued Data Lake for Deep Learning. Build, manage, query, version, & visualize datasets. Stream data real-time to PyTorch/TensorFlow. https://activeloop.ai [Moved to: https://github.com/activeloopai/deeplake] (by activeloopai)

  • Hi HN,

    In machine learning, we are faced with tensor-based computations (that's the language that ML models think in). I've recently discovered a project that helps you make it much easier to set up and conduct machine learning projects, and enables you to create and store datasets in deep learning-native format.

    Hub by Activeloop (https://github.com/activeloopai/Hub) is an open-source Python package that arranges data in Numpy-like arrays. It integrates smoothly with deep learning frameworks such as TensorFlow and PyTorch for faster GPU processing and training. In addition, one can update the data stored in the cloud, create machine learning pipelines using Hub API and interact with datasets (e.g. visualize) in Activeloop platform (https://app.activeloop.ai). The real benefit for me is that, I can stream my datasets without the need to store them on my machine (my datasets can be up to 10GB+ big, but it works just as well with 100GB+ datasets like ImageNet (https://docs.activeloop.ai/datasets/imagenet-dataset), for instance).

    Hub allows us to store images, audio, video data in a way that can be accessed at lightning speed. The data can be stored on GCS/S3 buckets, local storage, or on Activeloop cloud. The data can directly be used in the training TensorFlow/ PyTorch models so that you don't need to set up data pipelines. The package also comes with data version control, dataset search queries, and distributed workloads.

    For me, personally the simplicity of the API stands out, for instance:

    Loading datasets in seconds

      import hub ds = hub.load("hub://activeloop/cifar10-train")

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts