MLflow
dvc
Our great sponsors
MLflow | dvc | |
---|---|---|
40 | 91 | |
13,865 | 11,209 | |
2.4% | 2.0% | |
9.6 | 9.8 | |
3 days ago | 6 days ago | |
Python | Python | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
MLflow
-
Any MLOps platform you use?
I have an old labmate who uses a similar setup with MLFlow and can endorse it.
MLflow - an open-source platform for managing your ML lifecycle. What’s great is that they also support popular Python libraries like TensorFlow, PyTorch, scikit-learn, and R.
-
Selfhosted chatGPT with local contente
even for people who don't have an ML background there's now a lot of very fully-featured model deployment environments that allow self-hosting (kubeflow has a good self-hosting option, as do mlflow and metaflow), handle most of the complicated stuff involved in just deploying an individual model, and work pretty well off the shelf.
-
ML experiment tracking with DagsHub, MLFlow, and DVC
Here, we’ll implement the experimentation workflow using DagsHub, Google Colab, MLflow, and data version control (DVC). We’ll focus on how to do this without diving deep into the technicalities of building or designing a workbench from scratch. Going that route might increase the complexity involved, especially if you are in the early stages of understanding ML workflows, just working on a small project, or trying to implement a proof of concept.
-
AI in DevOps?
MLflow
-
AWS re:invent 2022 wish list
I am seeing growing demand for MLflow (https://mlflow.org/) and I am seeing a lot of people looking at Databricks as commercial offering for MLflow. Alternatively, some popele are implementing something like Managing your Machine Learning lifecycle with MLflow. Therefore, I think this was on my wish list last year, but I really hope AWS announce a Managed MLFlow Service. I know version 2.X is too new but at least 1.X would be great start.
-
✨ 7 Best Machine Learning Experiment Logging Tools in 2022 🚀
đź”— https://mlflow.org
- [D] Who here are convinced that they have a really good setup that keeps track of their ML experiments?
-
JBCNConf 2022: A great farewell
She made mentions to ML-Ops and MLFlow including Vertex AI the GCP implementation. I will post the video as soon as it is available. In the meantime, you can enjoy any other talk from Nerea Luis
-
Keeping Your Machine Learning Models on the Right Track: Getting Started with MLflow, Part 2
In our last post, we discussed the importance of tracking Machine Learning experiments, metrics and parameters. We also showed how easy it is to get started in these topics by leveraging the power of MLflow (for those who are not aware, MLflow is currently the de-facto standard platform for machine learning experiment and model management).
dvc
-
Oxen.ai: Fast Unstructured Data Version Control
How does this compare with other systems, like DVC (https://dvc.org/) for example?
-
Career advice for getting into NLP from a Computer Science background?
For the data cleaning and training parts, you might have projects where you've used kaggle datasets to train models and you've done appropriate feature engineering and data exploration to help you to understand whether data might need to be under or over sampled or cleaned in some other way. I'd give bonus points to someone who has thoughts about how training pipelines might be semi or fully automated in a production environment (e.g. use of scripts and tools like dvc to make things easy to reproduce. I'd want to see evidence of appropriate metrics (e.g. I know its 99% accurate and that might be great but if its a 10-way classification on a very unbalanced dataset, what can you tell me about performance on the smallest class?).
-
ML experiment tracking with DagsHub, MLFlow, and DVC
Here, we’ll implement the experimentation workflow using DagsHub, Google Colab, MLflow, and data version control (DVC). We’ll focus on how to do this without diving deep into the technicalities of building or designing a workbench from scratch. Going that route might increase the complexity involved, especially if you are in the early stages of understanding ML workflows, just working on a small project, or trying to implement a proof of concept.
-
Show HN: We scaled Git to support 1 TB repos
There are a couple of other contenders in this space. DVC (https://dvc.org/) seems most similar.
If you're interested in something you can self-host... I work on Pachyderm (https://github.com/pachyderm/pachyderm), which doesn't have a Git-like interface, but also implements data versioning. Our approach de-duplicates between files (even very small files), and our storage algorithm doesn't create objects proportional to O(n) directory nesting depth as Xet appears to. (Xet is very much like Git in that respect.)
The data versioning system enables us to run pipelines based on changes to your data; the pipelines declare what files they read, and that allows us to schedule processing jobs that only reprocess new or changed data, while still giving you a full view of what "would" have happened if all the data had been reprocessed. This, to me, is the key advantage of data versioning; you can save hundreds of thousands of dollars on compute. Being able to undo an oopsie is just icing on the cake.
Xet's system for mounting a remote repo as a filesystem is a good idea. We do that too :)
-
Is it possible to create a symbolic link to a folder to solve case sensitivity?
https://github.com/psf/black/issues/338 https://github.com/VeriorPies/ParrelSync/issues/61 https://github.com/prusa3d/PrusaSlicer/issues/5751 https://github.com/iterative/dvc/issues/2530 https://github.com/facebook/relay/issues/3647 And I know godmode9 at one point absolutely freaked when navigating into a symlink. It kinda depends on the app and what it's trying to load
-
How do you manage results, plots, etc.?
Bioinf has a lot of biologists who have transitioned into more technical/coding focused roles, so you'll find there's not a lot of engineering workflow standards out there compared to DS or SWE. As others have said, snakemake is the most common, but thats just a pipeline managment tool, it doesn't manage data or outputs. I personally use DVC for data and pipeline management (and include jupyter and papermill to make it all work), although I haven't yet gotten onboard with their experiments feature (which is what would manage different parameters and figures/results beyond versioning). I looked into MLflow and some other options when I was getting started (I do tool development and bioinf analysis), but I wanted data versioning to ensure experiment reproducibility (kind of a critcal part of science IMO), and many of the other solutions like Airflow (common in DS industry) seemed to be overkill for smaller bioinfo projects. DVC meets the requirements and I like it in concept, although in practice there have been many updates that have been a bit of a pain to keep up with/integrate. I've got a bioinfo/ds project template on github that roles together git, conda, DVC, jupyter and papermill to ensure experiment reproducibility, and is setup as a template that can be deployed with cookiecutter - check it out if you like.
-
[P] Stream and Upload Versioned Data
Hi r/MachineLearning I'm an ML Team Lead at DagsHub (https://www.dagshub.com/), and I wanted to share something cool that we've been working on. As you all know, DVC (dvc.org) is an open-source CLI tool that acts as an extension to Git for large-scale data version control. A while back we integrated into the platform, providing a built-in DVC remote.
-
Should I use GitHub with Unity if I am working by myself?
DVC might be useful for assets. It integrates with Git by adding tiny metadata files to the managed assets. Instead of storing those directly in the repo, the assets themselves are added to .gitignore, and you can pull or push those to an external file storage such as S3. It's technically for machine learning/data science projects, but I can see it being useful for gamedev if you don't want to pay for LFS. Versioning is supported as well.
Git is definitely useful for version controlling all your scripts. For your textures and binary objects you can save it in text format so you can commit them, or you could use DVC! Its like git LFS but it has some significant advantages! Git LFS has a limit after which we need to pay for. DVC you can simply use your cloud service as the storage therefore you are not limited and its easier to manage as you have more control. It can feel cumbersome at first since you have your own dvc add pull push commands and it creates additional files with .dvc extension as pointers, but you get used to it! Its mainly used for machine learning projects but can be used here as well! So try it out
-
Data Version Control
It was definitely a bad choice. I wasn't there so I can only speculate. My guess is because it is sort of ubiquitous and thus a low-hanging fruit and devs didn't know better, or the related corollary, it's what S3 uses for ETags, so it probably seemed logical. Either way, seems like someone did it and didn't know better, no one agrees on a fix or whether it's even necessary to change, and thus it's stuck for now.
There's an ongoing discussion about replacing/configuring the hash function, but it looks like it hasn't gone anywhere substantial.
What are some alternatives?
clearml - ClearML - Auto-Magical CI/CD to streamline your ML workflow. Experiment Manager, MLOps and Data-Management
Sacred - Sacred is a tool to help you configure, organize, log and reproduce experiments developed at IDSIA.
zenml - ZenML 🙏: Build portable, production-ready MLOps pipelines. https://zenml.io.
guildai - Experiment tracking, ML developer tools
tensorflow - An Open Source Machine Learning Framework for Everyone
Prophet - Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.
neptune-client - :ledger: Experiment tracking tool and model registry
H2O - H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
gensim - Topic Modelling for Humans
Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
onnxruntime - ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
dagster - An orchestration platform for the development, production, and observation of data assets.