Launch HN: Activeloop (YC S18) – Data lake for deep learning

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • deeplake

    Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

    Hi HN, I'm Davit, the CEO of Activeloop (https://activeloop.ai). We've made a “data lake” (industry jargon for a large data store with lots of heterogeneous data) that’s optimized for deep learning. Keeping your data in an AI-optimized format means you can ship AI models faster, without having to build complex data infrastructure for image, audio, and video data (check out our GitHub here: https://github.com/activeloopai/deeplake).

    Deep Lake stores complex data such as images, audio, videos, annotations/labels, and tabular data, in the form of tensors—a type of data structure used in linear algebra, which AI systems like to consume.

    We then rapidly stream the data into three destinations: (a) a SQL-like language (Tensor Query Language) that you can use to query your data; (b) an in-browser engine that you can use to visualize your data; and (c) deep learning frameworks, letting you do AI magic on your data while fully utilizing your GPUs. Here’s a 10-minute demo: https://www.youtube.com/watch?v=SxsofpSIw3k&t.

    Back in 2016, I started my Ph.D. research in Deep Learning and witnessed the transition from GBs to TBs, then petabyte datasets. To run our models at scale, we needed to rethink how we handled data. One of the ways we optimized our workflows included streaming the data, while asynchronously running the computation on GPUs. This served as an inspiration for creating Activeloop.

    When you want to use unstructured data for deep learning purposes, you’ll encounter the following options:

    - Storing metadata (pointers to the unstructured data) in a regular database, and images in object storage. It is inefficient to query the metadata table and then fetch images from object storages for high-throughput workloads.

    - Store images inside a database. This typically explodes the memory cost and will cost you money. For example, storing images in MongoDB and using them to train a model would cost 20x more than a Deep Lake setup [2].

    - Extend Parquet or Arrow to store images. On the plus side, you can now use existing analytical tools such as Spark, Kafka, and even DuckDB. But even major self-driving car companies failed on this path.

    - Build custom infrastructure aligned with your data in-house. Assuming you have the money and access to 10 solid data engineers with PhD-level knowledge, this still takes time (~2.5+ years), is difficult to extend beyond the initial vertical, will be hard to maintain, and will defocus your data scientists.

    Whatever the case, you'll get slow iteration cycles, under-utilized GPUs, and lots of ML engineer busywork (thus high costs).

    Your unstructured data already sits in a data lake such as S3 or a distributed file system (e.g., Lustre) and you probably don’t want to change this. Deep Lake keeps everything that a regular data lake makes great. It helps you version-control, run SQL queries, ingest billion-row data efficiently, and visualize terabyte-scale datasets in your browser or notebook. But there is one key difference from traditional data lakes: we store complex data, such as images, audio, videos, annotations/labels, and tabular data, in a tensorial form that is optimized for deep learning and GPU utilization.

    Some stats/benchmarks since our launch:

    - In a third-party benchmark by Yale University [3], Deep Lake provided the fastest data loader for PyTorch, especially when it comes to networked loading;

    -Deep Lake handles scale and long distance: we trained a 1B parameter CLIP model on 16xA100 GPUs on the same machine on LAION-400M dataset, streaming the data from US-EAST (AWS) to US-CENTRAL (GCP) [4] [5];

    - You can access datasets as large as 200M samples of image-text pairs) in seconds (as compared to the 100+ hours it takes via traditional methods) with one line of code. [6]

    What's free and what's not: the data format, the Python dataloader, version control, and data lineage (a log of how the data came to its current state) with the Python API are open-source [7]. The query language, fast streaming, and visualization engines are built in C++ and are closed-source for the time being, but are accessible via a Python interface. Users can store up to 300GB of their data with us for free. Our growth plan is $995/month and includes an optimized query engine, the fast data loader, and features like analytics. If you're an academic, you can get this plan for free. Finally, we have an enterprise plan including role-based access control, security, integrations, and more than 10 TB of managed data [8].

    Teams at Intel, Google, & MILA use Deep Lake. If you want to read more, we have an enterprise-y whitepaper at https://www.deeplake.ai/whitepaper, an academic paper at https://arxiv.org/abs/2209.10785, and a launch blog post with deep dive into features at https://www.activeloop.ai/resources/introducing-deep-lake-th....

    I would love to hear your thoughts on this, especially anything about how you manage your deep learning data and what issues you run in with your infra. I look forward to all your comments. Thanks a lot!

    [1] https://www.activeloop.ai/resources/introducing-deep-lake-th...

    [2] https://imgur.com/a/AZtWSkA

    [3] https://arxiv.org/abs/2209.13705

    [4] https://imgur.com/a/POtHklM

    [5] https://github.com/activeloopai/deeplake-laion-clip

    [6] https://datasets.activeloop.ai/docs/ml/datasets/coco-dataset...

    [7] https://github.com/activeloopai/deeplake

    [8] https://app.activeloop.ai/pricing

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts