Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression. Learn more →
Similar projects and alternatives to deeplake
Library for reading and writing large multi-dimensional arrays.
Artificial intelligence software for MapleStory that uses various machine learning and computer vision techniques to navigate challenging in-game environments
Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way :chestnut:
⚡ Building applications with LLMs through composability ⚡
Modern columnar data format for ML implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
deeplake reviews and mentions
Build ChatGPT for Financial Documents with LangChain + Deep Lake
2 projects | reddit.com/r/learnmachinelearning | 2 Mar 2023
As the world is increasingly generating vast amounts of financial data, the need for advanced tools to analyze and make sense of it has never been greater. This is where LangChain and Deep Lake come in, offering a powerful combination of technology to help build a question-answering tool based on financial data. After participating in a LangChain hackathon last week, I created a way to use Deep Lake, the data lake for deep learning (a package my team and I are building) with LangChain. I decided to put together a guide of sorts on how you can approach building your own question-answering tools with LangChain and Deep Lake as the data store.
Launch HN: Activeloop (YC S18) – Data lake for deep learning
Hi HN, I'm Davit, the CEO of Activeloop (https://activeloop.ai). We've made a “data lake” (industry jargon for a large data store with lots of heterogeneous data) that’s optimized for deep learning. Keeping your data in an AI-optimized format means you can ship AI models faster, without having to build complex data infrastructure for image, audio, and video data (check out our GitHub here: https://github.com/activeloopai/deeplake).
Deep Lake stores complex data such as images, audio, videos, annotations/labels, and tabular data, in the form of tensors—a type of data structure used in linear algebra, which AI systems like to consume.
We then rapidly stream the data into three destinations: (a) a SQL-like language (Tensor Query Language) that you can use to query your data; (b) an in-browser engine that you can use to visualize your data; and (c) deep learning frameworks, letting you do AI magic on your data while fully utilizing your GPUs. Here’s a 10-minute demo: https://www.youtube.com/watch?v=SxsofpSIw3k&t.
Back in 2016, I started my Ph.D. research in Deep Learning and witnessed the transition from GBs to TBs, then petabyte datasets. To run our models at scale, we needed to rethink how we handled data. One of the ways we optimized our workflows included streaming the data, while asynchronously running the computation on GPUs. This served as an inspiration for creating Activeloop.
When you want to use unstructured data for deep learning purposes, you’ll encounter the following options:
- Storing metadata (pointers to the unstructured data) in a regular database, and images in object storage. It is inefficient to query the metadata table and then fetch images from object storages for high-throughput workloads.
- Store images inside a database. This typically explodes the memory cost and will cost you money. For example, storing images in MongoDB and using them to train a model would cost 20x more than a Deep Lake setup .
- Extend Parquet or Arrow to store images. On the plus side, you can now use existing analytical tools such as Spark, Kafka, and even DuckDB. But even major self-driving car companies failed on this path.
- Build custom infrastructure aligned with your data in-house. Assuming you have the money and access to 10 solid data engineers with PhD-level knowledge, this still takes time (~2.5+ years), is difficult to extend beyond the initial vertical, will be hard to maintain, and will defocus your data scientists.
Whatever the case, you'll get slow iteration cycles, under-utilized GPUs, and lots of ML engineer busywork (thus high costs).
Your unstructured data already sits in a data lake such as S3 or a distributed file system (e.g., Lustre) and you probably don’t want to change this. Deep Lake keeps everything that a regular data lake makes great. It helps you version-control, run SQL queries, ingest billion-row data efficiently, and visualize terabyte-scale datasets in your browser or notebook. But there is one key difference from traditional data lakes: we store complex data, such as images, audio, videos, annotations/labels, and tabular data, in a tensorial form that is optimized for deep learning and GPU utilization.
Some stats/benchmarks since our launch:
- In a third-party benchmark by Yale University , Deep Lake provided the fastest data loader for PyTorch, especially when it comes to networked loading;
-Deep Lake handles scale and long distance: we trained a 1B parameter CLIP model on 16xA100 GPUs on the same machine on LAION-400M dataset, streaming the data from US-EAST (AWS) to US-CENTRAL (GCP)  ;
- You can access datasets as large as 200M samples of image-text pairs) in seconds (as compared to the 100+ hours it takes via traditional methods) with one line of code. 
What's free and what's not: the data format, the Python dataloader, version control, and data lineage (a log of how the data came to its current state) with the Python API are open-source . The query language, fast streaming, and visualization engines are built in C++ and are closed-source for the time being, but are accessible via a Python interface. Users can store up to 300GB of their data with us for free. Our growth plan is $995/month and includes an optimized query engine, the fast data loader, and features like analytics. If you're an academic, you can get this plan for free. Finally, we have an enterprise plan including role-based access control, security, integrations, and more than 10 TB of managed data .
Teams at Intel, Google, & MILA use Deep Lake. If you want to read more, we have an enterprise-y whitepaper at https://www.deeplake.ai/whitepaper, an academic paper at https://arxiv.org/abs/2209.10785, and a launch blog post with deep dive into features at https://www.activeloop.ai/resources/introducing-deep-lake-th....
I would love to hear your thoughts on this, especially anything about how you manage your deep learning data and what issues you run in with your infra. I look forward to all your comments. Thanks a lot!
Re: HF - we know them and admire their work (primarily, until very recently, focused on NLP, while we focus mostly on CV). As mentioned in the post, a large part of Deep Lake, including the Python-based dataloader and dataset format, is open source as well - https://github.com/activeloopai/deeplake.
Likewise, we curate a list of large open source datasets here -> https://datasets.activeloop.ai/docs/ml/, but our main thing isn't aggregating datasets (focus for HF datasets), but rather providing people with a way to manage their data efficiently. That being said, all of the 125+ public datasets we have are available in seconds with one line of code. :)
We haven't benchmarked against HF datasets in a while, but Deep Lake's dataloader is much, much faster in third-party benchmarks (see this https://arxiv.org/pdf/2209.13705 and here for an older version, that was much slower than what we have now, see this: https://pasteboard.co/la3DmCUR2iFb.png). HF under the hood uses Git-LFS (to the best of my knowledge) and is not opinionated on formats, so LAION just dumps Parquet files on their storage.
While your setup would work for a few TBs, scaling to PB would be tricky including maintaining your own infrastructure. And yep, as you said NAS/NFS would neither be able to handle the scale (especially writes with 1k workers). I am also slightly curious about your use of mmap files with image/video compressed data (as zero-copy won’t happen) unless you decompress inside the GPU ;), but would love to learn more from you! Re: pricing thanks for the feedback, storage is one component and customly priced for PB-scale workloads.
You can store your data either remotely or locally (see here on how https://docs.activeloop.ai/getting-started/creating-datasets...).
You can then visualize your datasets if their stored on our cloud, in AWS/GCP, or you can drag and drop your local dataset in Deep Lake format into our UI (https://docs.activeloop.ai/dataset-visualization)
We do, with version control, Python based dataloader and dataset format being open source! Please check out https://github.com/activeloopai/deeplake.
[N] Google releases TensorStore for High-Performance, Scalable Array Storage
3 projects | reddit.com/r/MachineLearning | 22 Sep 2022
This is very similar to what Activeloop doing with their work.
A note from our sponsor - InfluxDB
www.influxdata.com | 31 Mar 2023
activeloopai/deeplake is an open source project licensed under Mozilla Public License 2.0 which is an OSI approved license.