|5 days ago||1 day ago|
|Apache License 2.0||Apache License 2.0|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
[D] Tips for ML workflow on raw data
2 projects | reddit.com/r/MachineLearning | 21 Jan 2022
Machine Learning adventures with MLFlow - Deploying models from local system to Production
1 project | reddit.com/r/learnmachinelearning | 22 Dec 2021
Its a bug with mlflow -> https://github.com/mlflow/mlflow/issues/3755 Keep the server on, open another terminal export MLFLOW_TRACKING_URI env variable, if on windows set the env variable.....should work.
Old guy programmer here, need to brush up on Python quickly!
13 projects | reddit.com/r/Python | 6 Dec 2021
mlflow for logging and visualizing ML model experiments
Taking on the ML pipeline challenge: why data scientists need to own their ML workflows in production
4 projects | dev.to | 6 Dec 2021
So, if you even want to use MLFlow to track your experiments, run the pipeline on Airflow, and then deploy a model to a Neptune Model Registry, ZenML will facilitate this MLOps Stack for you. This decision can be made jointly by the data scientists and engineers. As ZenML is a framework, custom pieces of the puzzle can also be added here to accommodate legacy infrastructure.
[D] 5 considerations for Deploying Machine Learning Models in Production – what did I miss?
3 projects | reddit.com/r/MachineLearning | 21 Nov 2021
Consideration Number #2: Consider using model life cycle development and management platforms like MLflow, DVC, Weights & Biases, or SageMaker Studio. And Ray, Ray Tune, Ray Train (formerly Ray SGD), PyTorch and TensorFlow for distributed, compute-intensive and deep learning ML workloads.
[P] DagYard - DVC x MLflow x Colab x Gdrive - Automatically Configured
2 projects | reddit.com/r/MachineLearning | 18 Nov 2021
MLflow tracking automates the logging process of experiments and sends live information to a local or remote server while the training is still running.
Data Science toolset summary from 2021
13 projects | dev.to | 13 Nov 2021
MLflow - https://mlflow.org/
How to store preprocessing and feature engineering pipeline?
1 project | reddit.com/r/datascience | 21 Oct 2021
MLOps project based template
4 projects | reddit.com/r/mlops | 11 Oct 2021
ML workflow - MLflow
[D] Facebook Visdom vs Google Tensorboard for Pytorch
5 projects | reddit.com/r/MachineLearning | 26 Sep 2021
Oh I think most of the paid tracking solutions have auto refresh. As for the free ones? At clear.ml we have them for quite a while, for MLflow there is an open feature request. https://github.com/mlflow/mlflow/issues/2099
[D] Tips for ML workflow on raw data
2 projects | reddit.com/r/MachineLearning | 21 Jan 2022
Try to use a version controls tool for ML such as DVC
Git-annex – Managing large files with Git
2 projects | news.ycombinator.com | 15 Jan 2022
IPFS for a shitty cause
3 projects | reddit.com/r/DataHoarder | 6 Jan 2022
Also, if anyone has ideas on how to better handle scientific data with IPFS I'd love to dig more into it. I'm fairly interested in getting DVC to work with IPFS: https://github.com/iterative/dvc/discussions/6777. I think git+dvc+ipfs would be a big step forward for fields that intersect with dataset storage / machine learning (which is a lot of them).
HPC Rocket - A tool to run Slurm jobs from CI pipelines
4 projects | reddit.com/r/Python | 3 Jan 2022
This looks really interesting! I have a similar scenario but haven't looked into it yet. Have you looked at dvc.org - I'm planning on using it together with slurm and what they call CML for my projects. On that context I also wrote a tool that makes DVC more pythonic https://github.com/zincware/ZnTrack altough I'm currently restructuring it a bit but having backwards compatibility in mind.
Unstructured Data Governance for ML
4 projects | reddit.com/r/dataengineering | 31 Dec 2021
Pre-commit: framework for managing/maintaining multi-language pre-commit hooks
9 projects | news.ycombinator.com | 20 Dec 2021
Here's our setup, which is the result of several iterations and ergonomics refinements. Note: our stack is 90% python, with TS for frontend. Also 95% devs use mac (there's one data scientist on windows, he uses WSL).
We install enough utilities with `brew` to get pyenv working, use that to build all python versions. Then iirc `brew install pipx`, maybe it's `pip3 install --user pipx`. Anyway, that's the only python library binary installed outside a venv.
Pipx installs isort, black, dvc, and pre-commit.
Every repo has a Makefile. This drives all the common operations. Pyproject.toml (/eslint.json?) set the config for isort and black (or eslint). `make format` runs isort and black on python, eslint on js. `make lint` just verifies.
Pre-commit only runs the lint, it doesn't format. It also runs some scripts to ensure you aren't accidentally committing large files. Pre-commit also runs several DVC actions (the default dvc hooks) on commit, push, and checkout. These run in a venv managed by pre-commit. We just pin the version.
Github actions has a dedicated lint.yaml which runs a python linter action. We use the black version here to define which black pipx installs. We use `act` if we wanna see how an action runs without sending a commit just to trigger jobs.
As an aside, I'm still fiddling with the dvc `pre-commit` post-checkout hooks. They don't always pull the files when they ought to.
Most of the actual unit/integration tests run in containers, but they can run in a venv with the same logic, thanks to makefile. We use a dvc action to sync files in CI.
So yeah there's technically 2 copies of black and dvc, but we just use pinning. In practice, we've only had one issue with discrepancies in behavior locally vs CI, which was local black not catching a rule to avoid ''' for docstrings; using """ fixed it. On the whole, pre-commit saves against a lot of annoying goofs, but CI system is law, so we largely harmonize against that.
IMHO, this is the least egregious "double accounting" we have in local vs staging ci vs production ci (I lost that battle, manager would rather keep staing.yaml and production.yaml, rather than parameterize. Shrug.gif).
Running Collaborative Machine Learning Experiments with DVC and Git - Tutorial
1 project | reddit.com/r/GitOps | 13 Dec 2021
The following tutorial explains how you can bundle your data and code changes for each ML experiment and push those to a remote for your team to check out using DVC and Git: Running Collaborative ML Experiments
Don't Just Track Your ML Experiments, Version Them - Managing Machine Learning Experiments as Code with Git and DVC Open Source Tools
1 project | reddit.com/r/opensource | 10 Dec 2021
The following guide is explaining how ML experiment versioning with DVC (Data Version Control) open source tools brings together the benefits of traditional code versioning and modern day experiment tracking: Don't Just Track Your ML Experiments, Version Them
Managing Your Machine Learning Experiments as Code with Git and DVC
1 project | reddit.com/r/github | 9 Dec 2021
Experiment versioning treats experiments as code. It saves all metrics, hyperparameters, and artifact information in text files that can be versioned by Git, which becomes a store for experiment meta-information. The article above shows how with DVC tool, you can push experiments just like Git branches, giving you flexibility to share experiment you choose.
DVC (DataVersionControl) - Managing Machine Learning Experiments as Code with Git and DVC
1 project | reddit.com/r/githubprojects | 9 Dec 2021
What are some alternatives?
Sacred - Sacred is a tool to help you configure, organize, log and reproduce experiments developed at IDSIA.
clearml - ClearML - Auto-Magical CI/CD to streamline your ML workflow. Experiment Manager, MLOps and Data-Management
zenml - ZenML 🙏: MLOps framework to create reproducible pipelines.
Prophet - Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.
H2O - H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
tensorflow - An Open Source Machine Learning Framework for Everyone
guildai - Experiment tracking, ML developer tools
neptune-client - Neptune client library - integrate your Python scripts with Neptune
gensim - Topic Modelling for Humans
Activeloop Hub - Dataset format for AI. Build, manage, & visualize datasets for deep learning. Stream data real-time to PyTorch/TensorFlow & version-control it. https://activeloop.ai
scikit-learn - scikit-learn: machine learning in Python
onnxruntime - ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator