delta
dvc
delta | dvc | |
---|---|---|
74 | 121 | |
8,023 | 14,454 | |
1.4% | 1.0% | |
9.8 | 8.9 | |
3 days ago | 5 days ago | |
Scala | Python | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
delta
-
Twitter's 600-Tweet Daily Limit Crisis: Soaring GCP Costs and the Open Source Fix Elon Musk Ignored
Delta Lake: Delta Lake is an open-source storage layer that provides ACID transactions, scalable metadata management, and data versioning on top of existing data lakes. It aims to bring reliability and performance optimizations to big data workloads while ensuring data integrity and consistency.
-
Stream Processing Systems in 2025: RisingWave, Flink, Spark Streaming, and What's Ahead
When it comes to stream processing systems, Iceberg support varies across vendors. Databricks, which oversees Spark Streaming, focuses on Delta Lake. Apache Flink, heavily influenced by Alibaba’s contributions, promotes Paimon, an alternative to Iceberg. RisingWave, on the other hand, fully embraces Iceberg. Rather than focusing solely on one table format, RisingWave aims to support various catalog services, including AWS Glue Catalog, Polaris, and Unity Catalog.
-
Apache Iceberg
Hidden partitioning is the most interesting Iceberg feature, because most of the very large datasets are timeseries fact tables.
I don't remember seeing that in Delta Lake [1], which is probably because the industry standard benchmarks join date as a dimension table and do not use timestamp ranges instead of dates.
[1] - https://github.com/delta-io/delta/issues/490
-
25 Open Source AI Tools to Cut Your Development Time in Half
Delta Lake is a storage layer framework that provides reliability to data lakes. It addresses the challenges of managing large-scale data in lakehouse architectures, where data is stored in an open format and used for various purposes, like machine learning (ML). Data engineers can build real-time pipelines or ML applications using Delta Lake because it supports both batch and streaming data processing. It also brings ACID (atomicity, consistency, isolation, durability) transactions to data lakes, ensuring data integrity even with concurrent reads and writes from multiple pipelines.
-
Make Rust Object Oriented with the dual-trait pattern
There is a neat example, of how a third party project belonging to the Linux Foundation, is implementing UserDefinedLogicalNodeCore: MetricObserver in delta-rs. The developer had to use only #[derive(Debug, Hash, Eq, PartialEq)] to get dyn_eq and dyn_hash implemented.
-
Delta Lake vs. Parquet: A Comparison
Delta is pretty great, let's you do upserts into tables in DataBricks much easier than without it.
I think the website is here: https://delta.io
-
Understanding Parquet, Iceberg and Data Lakehouses
I often hear references to Apache Iceberg and Delta Lake as if they’re two peas in the Open Table Formats pod. Yet…
Here’s the Apache Iceberg table format specification:
https://iceberg.apache.org/spec/
As they like to say in patent law, anyone “skilled in the art” of database systems could use this to build and query Iceberg tables without too much difficulty.
This is nominally the Delta Lake equivalent:
https://github.com/delta-io/delta/blob/master/PROTOCOL.md
I defy anyone to even scope out what level of effort would be required to fully implement the current spec, let alone what would be involved in keeping up to date as this beast evolves.
Frankly, the Delta Lake spec reads like a reverse engineering of whatever implementation tradeoffs Databricks is making as they race to build out a lakehouse for every Fortune 1000 company burned by Hadoop (which is to say, most of them).
My point is that I’ve yet to be convinced that buying into Delta Lake is actually buying into an open ecosystem. Would appreciate any reassurance on this front!
-
Getting Started with Flink SQL, Apache Iceberg and DynamoDB Catalog
Apache Iceberg is one of the three types of lakehouse, the other two are Apache Hudi and Delta Lake.
-
[D] Is there other better data format for LLM to generate structured data?
The Apache Spark / Databricks community prefers Apache parquet or Linux Fundation's delta.io over json.
-
Delta vs Iceberg: make love not war
Delta 3.0 extends an olive branch. https://github.com/delta-io/delta/releases/tag/v3.0.0rc1
dvc
- Ask HN: What is the simplest data orchestration tool you've worked with?
-
10 Must-Know Open Source Platform Engineering Tools for AI/ML Workflows
Data Version Control is a powerful version control tool tailored for ML workflows. It ensures reproducibility by tracking and sharing data, pipelines, experiments, and models. With its Git-like interface, it integrates seamlessly with existing Git repositories. It supports various cloud storage like AWS S3 and Azure Blob, thus enabling versioning of large datasets without bloating your Git repositories.
-
Top 10 MLOps Tools for 2025
9. DVC
- Data Version Control
-
S3 as a Git remote and LFS server
I haven't heard of dvc, so I had to google it, which took me to: https://dvc.org/
But I'm still confused as to what is dvc is after a cursory glance at their homepage.
-
serverless-registry: A Docker registry backed by Workers and R2
I’m self-hosting gitea just for their private docker registry. LFS is actually slow for heavy deep learning workflow with millions of small files. I’m using DVC [1] instead.
[1]: https://dvc.org
- GitOps ML Experiments, data versioning, model registry
-
25 Open Source AI Tools to Cut Your Development Time in Half
Implementing version control for machine learning projects entails managing both code and the datasets, ML models, performance metrics, and other development-related artifacts. Its purpose is to bring the best practices from software engineering, like version control and reproducibility, to the world of data science and machine learning. DVC enables data scientists and ML engineers to track changes to data and models like Git does for code, making it able to run on top of any Git repository. It enables the management of model experiments.
-
Essential Deep Learning Checklist: Best Practices Unveiled
Tool: Consider using Data Version Control (DVC) to manage your datasets, models, and their respective versions. DVC integrates with Git, allowing you to handle large data files and model binaries without cluttering your repository. It also makes it easy to version your training datasets and models, ensuring you can always match a model back to its exact training environment.
-
10 Open Source Tools for Building MLOps Pipelines
As Git helps you with code versions and the ability to roll back to previous versions for code repositories, DVC has built-in support for tracking your data and model. This helps machine learning teams reproduce the experiments run by your fellows and facilitates collaboration. DVC is based on the principles of Git and is easy to learn since the commands are similar to those of Git. Other benefits of using DVC include:
What are some alternatives?
lakeFS - lakeFS - Data version control for your data lake | Git for data
MLflow - Open source platform for the machine learning lifecycle
delta-rs - A native Rust library for Delta Lake, with bindings into Python
LakeSoul - LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
git-lfs - Git extension for versioning large files