MLflow VS dvc

Compare MLflow vs dvc and see what are their differences.

MLflow

Open source platform for the machine learning lifecycle (by mlflow)

dvc

🦉Data Version Control | Git for Data & Models | ML Experiments Management (by iterative)
Our great sponsors
  • InfluxDB - Access the most powerful time series database as a service
  • Sonar - Write Clean Python Code. Always.
  • SaaSHub - Software Alternatives and Reviews
MLflow dvc
40 91
13,865 11,209
2.4% 2.0%
9.6 9.8
3 days ago 6 days ago
Python Python
Apache License 2.0 Apache License 2.0
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

MLflow

Posts with mentions or reviews of MLflow. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-02-25.

dvc

Posts with mentions or reviews of dvc. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-02-16.
  • Oxen.ai: Fast Unstructured Data Version Control
    6 projects | news.ycombinator.com | 16 Feb 2023
    How does this compare with other systems, like DVC (https://dvc.org/) for example?
  • Career advice for getting into NLP from a Computer Science background?
    2 projects | reddit.com/r/LanguageTechnology | 10 Feb 2023
    For the data cleaning and training parts, you might have projects where you've used kaggle datasets to train models and you've done appropriate feature engineering and data exploration to help you to understand whether data might need to be under or over sampled or cleaned in some other way. I'd give bonus points to someone who has thoughts about how training pipelines might be semi or fully automated in a production environment (e.g. use of scripts and tools like dvc to make things easy to reproduce. I'd want to see evidence of appropriate metrics (e.g. I know its 99% accurate and that might be great but if its a 10-way classification on a very unbalanced dataset, what can you tell me about performance on the smallest class?).
  • ML experiment tracking with DagsHub, MLFlow, and DVC
    4 projects | dev.to | 12 Jan 2023
    Here, we’ll implement the experimentation workflow using DagsHub, Google Colab, MLflow, and data version control (DVC). We’ll focus on how to do this without diving deep into the technicalities of building or designing a workbench from scratch. Going that route might increase the complexity involved, especially if you are in the early stages of understanding ML workflows, just working on a small project, or trying to implement a proof of concept.
  • Show HN: We scaled Git to support 1 TB repos
    9 projects | news.ycombinator.com | 13 Dec 2022
    There are a couple of other contenders in this space. DVC (https://dvc.org/) seems most similar.

    If you're interested in something you can self-host... I work on Pachyderm (https://github.com/pachyderm/pachyderm), which doesn't have a Git-like interface, but also implements data versioning. Our approach de-duplicates between files (even very small files), and our storage algorithm doesn't create objects proportional to O(n) directory nesting depth as Xet appears to. (Xet is very much like Git in that respect.)

    The data versioning system enables us to run pipelines based on changes to your data; the pipelines declare what files they read, and that allows us to schedule processing jobs that only reprocess new or changed data, while still giving you a full view of what "would" have happened if all the data had been reprocessed. This, to me, is the key advantage of data versioning; you can save hundreds of thousands of dollars on compute. Being able to undo an oopsie is just icing on the cake.

    Xet's system for mounting a remote repo as a filesystem is a good idea. We do that too :)

  • Is it possible to create a symbolic link to a folder to solve case sensitivity?
    5 projects | reddit.com/r/linuxquestions | 1 Dec 2022
    https://github.com/psf/black/issues/338 https://github.com/VeriorPies/ParrelSync/issues/61 https://github.com/prusa3d/PrusaSlicer/issues/5751 https://github.com/iterative/dvc/issues/2530 https://github.com/facebook/relay/issues/3647 And I know godmode9 at one point absolutely freaked when navigating into a symlink. It kinda depends on the app and what it's trying to load
  • How do you manage results, plots, etc.?
    4 projects | reddit.com/r/bioinformatics | 17 Nov 2022
    Bioinf has a lot of biologists who have transitioned into more technical/coding focused roles, so you'll find there's not a lot of engineering workflow standards out there compared to DS or SWE. As others have said, snakemake is the most common, but thats just a pipeline managment tool, it doesn't manage data or outputs. I personally use DVC for data and pipeline management (and include jupyter and papermill to make it all work), although I haven't yet gotten onboard with their experiments feature (which is what would manage different parameters and figures/results beyond versioning). I looked into MLflow and some other options when I was getting started (I do tool development and bioinf analysis), but I wanted data versioning to ensure experiment reproducibility (kind of a critcal part of science IMO), and many of the other solutions like Airflow (common in DS industry) seemed to be overkill for smaller bioinfo projects. DVC meets the requirements and I like it in concept, although in practice there have been many updates that have been a bit of a pain to keep up with/integrate. I've got a bioinfo/ds project template on github that roles together git, conda, DVC, jupyter and papermill to ensure experiment reproducibility, and is setup as a template that can be deployed with cookiecutter - check it out if you like.
  • [P] Stream and Upload Versioned Data
    2 projects | reddit.com/r/MachineLearning | 2 Nov 2022
    Hi r/MachineLearning I'm an ML Team Lead at DagsHub (https://www.dagshub.com/), and I wanted to share something cool that we've been working on. As you all know, DVC (dvc.org) is an open-source CLI tool that acts as an extension to Git for large-scale data version control. A while back we integrated into the platform, providing a built-in DVC remote.
  • Should I use GitHub with Unity if I am working by myself?
    7 projects | reddit.com/r/gamedev | 1 Nov 2022
    DVC might be useful for assets. It integrates with Git by adding tiny metadata files to the managed assets. Instead of storing those directly in the repo, the assets themselves are added to .gitignore, and you can pull or push those to an external file storage such as S3. It's technically for machine learning/data science projects, but I can see it being useful for gamedev if you don't want to pay for LFS. Versioning is supported as well.
    7 projects | reddit.com/r/gamedev | 1 Nov 2022
    Git is definitely useful for version controlling all your scripts. For your textures and binary objects you can save it in text format so you can commit them, or you could use DVC! Its like git LFS but it has some significant advantages! Git LFS has a limit after which we need to pay for. DVC you can simply use your cloud service as the storage therefore you are not limited and its easier to manage as you have more control. It can feel cumbersome at first since you have your own dvc add pull push commands and it creates additional files with .dvc extension as pointers, but you get used to it! Its mainly used for machine learning projects but can be used here as well! So try it out
  • Data Version Control
    8 projects | news.ycombinator.com | 1 Oct 2022
    It was definitely a bad choice. I wasn't there so I can only speculate. My guess is because it is sort of ubiquitous and thus a low-hanging fruit and devs didn't know better, or the related corollary, it's what S3 uses for ETags, so it probably seemed logical. Either way, seems like someone did it and didn't know better, no one agrees on a fix or whether it's even necessary to change, and thus it's stuck for now.

    There's an ongoing discussion about replacing/configuring the hash function, but it looks like it hasn't gone anywhere substantial.

    https://github.com/iterative/dvc/issues/3069

What are some alternatives?

When comparing MLflow and dvc you can also consider the following projects:

clearml - ClearML - Auto-Magical CI/CD to streamline your ML workflow. Experiment Manager, MLOps and Data-Management

Sacred - Sacred is a tool to help you configure, organize, log and reproduce experiments developed at IDSIA.

zenml - ZenML 🙏: Build portable, production-ready MLOps pipelines. https://zenml.io.

guildai - Experiment tracking, ML developer tools

tensorflow - An Open Source Machine Learning Framework for Everyone

Prophet - Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

neptune-client - :ledger: Experiment tracking tool and model registry

H2O - H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

gensim - Topic Modelling for Humans

Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

onnxruntime - ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

dagster - An orchestration platform for the development, production, and observation of data assets.