delta VS dvc

Compare delta vs dvc and see what are their differences.

delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

dvc

🦉Data Version Control | Git for Data & Models | ML Experiments Management (by iterative)
Our great sponsors
  • Scout APM - Truly a developer’s best friend
  • SonarQube - Static code analysis for 29 languages.
  • Zigi - Workflow assistant built for devs & their teams
  • InfluxDB - Build time-series-based applications quickly and at scale.
delta dvc
44 84
5,433 10,720
1.8% 1.4%
9.9 9.9
8 days ago 1 day ago
Scala Python
Apache License 2.0 Apache License 2.0
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

delta

Posts with mentions or reviews of delta. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-10-24.
  • The Evolution of the Data Engineer Role
    2 projects | news.ycombinator.com | 24 Oct 2022
    FACT table (sale $, order quantity, order ID, product ID)

    customer, account_types etc are dimensions to filter your low-level transactional data. The schema like a snowflake when you add enough dimensions, hence the name.

    The FACT table makes "measures" available to the user. Example: Count of Orders. These are based on the values in the FACT table (your big table of IDs that link to dimensions and low-level transactional data).

    You can then slice and dice your count of orders by fields in the dimensions.

    You could then add Sum of Sale ($) as an additional measure. "Abstract" measures like Average Sale ($) per Order can also be added in the OLTP backend engine.

    End users will often be using Excel or Tableau to create their own dashboards / graphs / reports. This pattern makes sense in that case --> user can explore the heavily structured business data according to all the pre-existing business rules.

    Pros:

    - Great for enterprise businesses with existing application databases

    - Highly structured and transaction support (ACID compliance)

    - Ease of use for end business user (create a new pivot table in Excel)

    - Easy to query (basically a bunch of SQL queries)

    - Encapsulates all your business rules in one place -- a.k.a. single source of truth.

    Cons

    - Massive start up cost (have to work out the schema before you even write any code)

    - Slow to change (imagine if the raw transaction amounts suddenly changed to £ after a certain date!)

    - Massive nightly ETL jobs (these break fairly often)

    - Usually proprietary tooling / storage (think MS SQL Server)

    ---

    2. Data Lake

    Throw everything into an S3 bucket. Database table? Throw it into the S3 bucket. Image data? Throw it into the S3 bucket. Kitchen sink? Throw it into the S3 bucket.

    Process your data when you're ready to process it. Read in your data from S3, process it, write back to S3 as an "output file" for downstream consumption.

    Pros:

    - Easy to set up

    - Fairly simple and standardised i/o (S3 apis work with pandas and pyspark dataframes etc)

    - Can store data remotely until ready to process it

    - Highly flexible as mostly unstructured (create new S3 keys -- a.k.a. directories -- on the fly )

    - Cheap storage

    Cons:

    - Doesn't scale -- turns into a "data swamp"

    - Not always ACID compliant (looking at you Digital Ocean)

    - Very easy to duplicate data

    ---

    3. Data Lakehouse

    Essentially a data lake with some additional bits.

    A. Delta Lake Storage Format a.k.a. Delta Tables

    https://delta.io

    Versioned files acting like versioned tables. Writing to a file will create a new version of the file, with previous versions stored for a set number of updates. Appending to the file creates a new version of the file in the same way (e.g. add a new order streamed in from the ordering application).

    Every file -- a.k.a. delta table -- becomes ACID compliant. You can rollback the table to last week and replay e.g. because change X caused bug Y to happen.

    AWS does allow you do this, but it was a right ol' pain in the arse whenever I had to deal with massively partitioned parquet files. Delta Lake makes versioning the outputs much easier and it is much easier to rollback.

    B. Data Storage Layout

    Enforce a schema based on processing stages to get some performance & data governance benefits.

    Example processing stage schema: DATA IN -> EXTRACT -> TRANSFORM -> AGGREGATE -> REPORTABLE

    Or the "medallion" schema: Bronze -> Silver -> Gold.

    Write out the data at each processing stage to a delta lake table/file. You can now query 5x data sources instead of 2x. The table's rarity indicates the degree of "data enrichment" you have performed -- i.e. how useful have you made the data. Want to update the codebase for the AGGREGATE stage? Just rerun from the TRANSFORM table (rather than run it all from scratch). This also acts as a caching layer. In a Data Warehouse, the entire query needs to be run from scratch each time you change a field. Here, you could just deliver the REPORTABLE tables as artefacts whenever you change them.

    C. "Metadata" Tracking

    See AWS Glue Data Catalog.

    Index files that match a specific S3 key pattern and/or file format and/or AWS S3 tag etc. throughout your S3 bucket. Store the results in a publicly accessible table. Now you can perform SQL queries against the metadata of your data. Want to find that file you were working on last week? Run a query based on last modified time. Want to find files that contain a specific column name? Run a query based on column names.

    Pros:

    - transactional versioning -- ACID compliance and the ability to rollback data over time (I accidentally deleted an entire column of data / calculated VAT wrong yesterday)

    - processing-stage schema storage layout acts as a caching layer (only process from the stage where you need to)

    - no need for humans to remember the specific path to the files they were working on as files are all indexed

    - less chance of creating a "data swamp"

    - changes become easier to audit as you can track the changes between versions

    Cons:

    - Delta lake table format is only really available with Apache Spark / Databricks processing engines (mostly, for now)

    - Requires enforcement of the processing-stage schema (your data scientists will just ignore you when you request they start using it)

    - More setup cost than a simple data lake

    - Basically a move back towards proprietary tooling (some FOSS libs are starting to pop up for it)

    ---

    4. Data Mesh

    geoduck14's answer on this was pretty good. basically have a data infrastructure team, and them domain-specific teams that spring up as needed (like an infra team looking after your k8s clusters, and application teams that use the clusters). domain specific data team use the data platform provided by the data infrastructure team.

    Previously worked somewhere in a "product" team which basically performed this function. They just didn't call it a "data mesh".

  • 5 Reasons Your Data Lakehouse should Embrace Dremio Cloud
    2 projects | dev.to | 9 Aug 2022
    You can query data organized in many open table formats like Apache Iceberg and Delta Lake. (Here is a good article on what is a table format and the differences between different ones)
  • Delta 2.0 - The Foundation of your Data Lakehouse is Open
    2 projects | reddit.com/r/apachespark | 5 Aug 2022
    Note that the roadmap can be found at https://github.com/delta-io/delta/issues/1307 and we’re actively asking for feedback so we can prioritize the remaining items. Please chime in there so we can track and re-prioritize! Thanks!
    2 projects | reddit.com/r/apachespark | 5 Aug 2022
    Still not quite completely on par with the Databricks version, with missing features like GENERATED ALWAYS AS IDENTITY, but this is getting good.
  • Databricks platform for small data, is it worth it?
    3 projects | reddit.com/r/dataengineering | 29 Jun 2022
    Currently the infrastructure we have is some custom made pipelines that load the data on S3, and I use Delta Tables here and there for its convenience: ACID, time travel, merges, CDC etc...
  • Data point versioning infrastructure for time traveling to a precise point in time?
    2 projects | reddit.com/r/dataengineering | 18 Jun 2022
    I've been playing around a bit with Delta (Table/Lake) whatever you want to call it. It has time travel so you can look back and see what the data looked like at a particular point in time. https://delta.io/
  • How-to-Guide: Contributing to Open Source
    19 projects | reddit.com/r/dataengineering | 11 Jun 2022
    Delta Lake
  • What companies/startups are using Scala (open source projects on github)?
    13 projects | reddit.com/r/scala | 24 May 2022
    There are so many of them in big data, e.g. Kafka, Spark, Flink, Delta, Snowplow, Finagle, Deequ, CMAK, OpenWhisk, Snowflake, TheHive, TVM-VTA, etc.
  • What is a Delta Table?
    2 projects | reddit.com/r/dataengineering | 20 May 2022
    Ah. I believe you are correct. As I look at the examples section for python in the github repo these are looking almost identical to what I was seeing. https://github.com/delta-io/delta/blob/master/examples/python/quickstart.py
    2 projects | reddit.com/r/dataengineering | 20 May 2022
    It is a specific table format. https://delta.io/ it’s an open source project just read their website, will have way more info than these comments.

dvc

Posts with mentions or reviews of dvc. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-11-17.
  • How do you manage results, plots, etc.?
    4 projects | reddit.com/r/bioinformatics | 17 Nov 2022
    Bioinf has a lot of biologists who have transitioned into more technical/coding focused roles, so you'll find there's not a lot of engineering workflow standards out there compared to DS or SWE. As others have said, snakemake is the most common, but thats just a pipeline managment tool, it doesn't manage data or outputs. I personally use DVC for data and pipeline management (and include jupyter and papermill to make it all work), although I haven't yet gotten onboard with their experiments feature (which is what would manage different parameters and figures/results beyond versioning). I looked into MLflow and some other options when I was getting started (I do tool development and bioinf analysis), but I wanted data versioning to ensure experiment reproducibility (kind of a critcal part of science IMO), and many of the other solutions like Airflow (common in DS industry) seemed to be overkill for smaller bioinfo projects. DVC meets the requirements and I like it in concept, although in practice there have been many updates that have been a bit of a pain to keep up with/integrate. I've got a bioinfo/ds project template on github that roles together git, conda, DVC, jupyter and papermill to ensure experiment reproducibility, and is setup as a template that can be deployed with cookiecutter - check it out if you like.
  • [P] Stream and Upload Versioned Data
    2 projects | reddit.com/r/MachineLearning | 2 Nov 2022
    Hi r/MachineLearning I'm an ML Team Lead at DagsHub (https://www.dagshub.com/), and I wanted to share something cool that we've been working on. As you all know, DVC (dvc.org) is an open-source CLI tool that acts as an extension to Git for large-scale data version control. A while back we integrated into the platform, providing a built-in DVC remote.
  • Should I use GitHub with Unity if I am working by myself?
    7 projects | reddit.com/r/gamedev | 1 Nov 2022
    DVC might be useful for assets. It integrates with Git by adding tiny metadata files to the managed assets. Instead of storing those directly in the repo, the assets themselves are added to .gitignore, and you can pull or push those to an external file storage such as S3. It's technically for machine learning/data science projects, but I can see it being useful for gamedev if you don't want to pay for LFS. Versioning is supported as well.
    7 projects | reddit.com/r/gamedev | 1 Nov 2022
    Git is definitely useful for version controlling all your scripts. For your textures and binary objects you can save it in text format so you can commit them, or you could use DVC! Its like git LFS but it has some significant advantages! Git LFS has a limit after which we need to pay for. DVC you can simply use your cloud service as the storage therefore you are not limited and its easier to manage as you have more control. It can feel cumbersome at first since you have your own dvc add pull push commands and it creates additional files with .dvc extension as pointers, but you get used to it! Its mainly used for machine learning projects but can be used here as well! So try it out
  • Data Version Control
    8 projects | news.ycombinator.com | 1 Oct 2022
    It was definitely a bad choice. I wasn't there so I can only speculate. My guess is because it is sort of ubiquitous and thus a low-hanging fruit and devs didn't know better, or the related corollary, it's what S3 uses for ETags, so it probably seemed logical. Either way, seems like someone did it and didn't know better, no one agrees on a fix or whether it's even necessary to change, and thus it's stuck for now.

    There's an ongoing discussion about replacing/configuring the hash function, but it looks like it hasn't gone anywhere substantial.

    https://github.com/iterative/dvc/issues/3069

    8 projects | news.ycombinator.com | 1 Oct 2022
  • Alternative for git with big file
    2 projects | reddit.com/r/datascience | 13 Jul 2022
    If you need to manage data concurrently with git, especially if you want to version control the data similar to how git version controls code, you could look into something like DVC. It has the ability to sync data files with a bunch of different protocols / cloud services (ssh, AWS, etc)
  • VS Code extension to track ML experiments
    2 projects | reddit.com/r/machinelearningnews | 27 Jun 2022
    The extension uses Data Version Control (DVC) under the hood (we are DVC team) and gives you:
  • Eden
    16 projects | news.ycombinator.com | 12 Apr 2022
    Data can and should be versioned, but not by just `git add BLOAT`. Take a look at https://dvc.org/: blobs are uploaded to a S3 compatible blob storage, metadata is versioned in a config file and this one gets versioned in git
    16 projects | news.ycombinator.com | 12 Apr 2022

What are some alternatives?

When comparing delta and dvc you can also consider the following projects:

MLflow - Open source platform for the machine learning lifecycle

Activeloop Hub - Data Lake for Deep Learning. Build, manage, query, version, & visualize datasets. Stream data real-time to PyTorch/TensorFlow. https://activeloop.ai [Moved to: https://github.com/activeloopai/deeplake]

ploomber - The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

lakeFS - Git-like capabilities for your object storage

hudi - Upserts, Deletes And Incremental Processing on Big Data.

aim - Aim 💫 — easy-to-use and performant open-source ML experiment tracker.

delta-rs - A native Rust library for Delta Lake, with bindings into Python and Ruby.

iceberg - Apache Iceberg

guildai - Experiment tracking, ML developer tools

git-submodules - Git Submodule alternative with equivalent features, but easier to use and maintain.

delta-oss