mandala
dvc
mandala | dvc | |
---|---|---|
8 | 109 | |
228 | 13,139 | |
- | 0.8% | |
6.3 | 9.6 | |
about 2 months ago | 6 days ago | |
Python | Python | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
mandala
-
Mandala: A little plaground for testing pixel logic patterns
I was so confused, expecting this to be some trickery related to the computational-graph-memoization-and-exploration tool "mandala" https://github.com/amakelov/mandala
- Mandala: Notebook memoization on steroids, used by Anthropic
-
Improve Jupyter Notebook Reruns by Caching Cells
This is neat and self-contained! But as someone running experiments with a high degree of interactivity, I often have an orthogonal requirement: add more computations to the same cell without recomputing previous computations done in the cell (or in other cells).
For a concrete example, often in an ML project you want to study how several quantities vary across several parameters. A straightforward workflow for this is: write some nested loops, collect results in python dictionaries, finally put everything together in a dataframe and compare (by plotting or otherwise).
However, after looking at the results, maybe you spot some trend and wonder if it will continue if you tweak one of the parameters by using a new value for it; of course, you also want to look at the previous values and bring everything together in the same plot(s). You now have a problem: either re-run the cell (thus losing previous work, which is annoying even if you have to wait 1 minute - you know it's a wasted minute!), or write the new computation in a new cell, possibly with a lot of redundancy (which over time makes the notebook hard to navigate and keep consistent).
So, this and other considerations eventually convinced me that the function is more natural than the cell as an interface/boundary at which caching should be implemented, at least for my use cases (coming from ML research). I wrote a framework based on this idea, with lots of other features (some quite experimental/unusual) to turn this into a feasible experiment management tool - check it out at https://github.com/amakelov/mandala
P.S.: I notice you use `pickle` for the hashing - `joblib.dump` is faster with objects containing numpy arrays, which covers a lot of useful ML things
-
ML Experiments Management with Git
Another option, that manages versioning of your computational graph and its results and provides extremely elegant query-able memoization is Mandala https://github.com/amakelov/mandala
It is a much simpler and much more magical piece of software that truly expanded how I think about writing, exploring, and experimenting with code. Even if you never use it, you probably would really enjoy reading the blog posts the author wrote about the design of the tool https://amakelov.github.io/blog/pl/
-
Snakemake – A framework for reproducible data analysis
You might like mandala (https://github.com/amakelov/mandala) - it is not a build recipe tool, rather it is a tool that tracks the history of how your builds / computational graph has changed, and ties it to how the data looked like at each such step.
-
Piper: A proposal for a graphy pipe-based build system
u/rust4yy: I've been building mandala, a Python framework for (among other things) incremental computing. One way to think of it is "a build system for Python objects", except the units of computation are Python functions.
dvc
-
My Favorite DevTools to Build AI/ML Applications!
Collaboration and version control are crucial in AI/ML development projects due to the iterative nature of model development and the need for reproducibility. GitHub is the leading platform for source code management, allowing teams to collaborate on code, track issues, and manage project milestones. DVC (Data Version Control) complements Git by handling large data files, data sets, and machine learning models that Git can't manage effectively, enabling version control for the data and model files used in AI projects.
-
Why bad scientific code beats code following "best practices"
What you’re describing sounds like DVC (at a higher-ish—80%-solution level).
https://dvc.org/
See pachyderm too.
-
First 15 Open Source Advent projects
10. DVC by Iterative | Github | tutorial
-
Exploring Open-Source Alternatives to Landing AI for Robust MLOps
Platforms such as MLflow monitor the development stages of machine learning models. In parallel, Data Version Control (DVC) brings version control system-like functions to the realm of data sets and models.
- ML Experiments Management with Git
-
Git Version Controlled Datasets in S3
I was using DVC (https://dvc.org/) for some time to help solve this but it was getting hard to manage the storage connections and I would run into cache issues a lot, but this solves it using git-lfs itself.
- Ask HN: How do your ML teams version datasets and models?
-
Exploring MLOps Tools and Frameworks: Enhancing Machine Learning Operations
DVC (Data Version Control):
- Evaluate and Track Your LLM Experiments: Introducing TruLens for LLMs
-
[D] Is there a tool to keep track of my ML experiments?
I have been using DVC and MLflow since then DVC had only data tracking and MLflow only model tracking. I can say both are awesome now and maybe the only factor I would like to mention is that IMO, MLflow is a bit harder to learn while DVC is just a git practically.
What are some alternatives?
oxen-release - Lightning fast data version control system for structured and unstructured machine learning datasets. We aim to make versioning datasets as easy as versioning code.
MLflow - Open source platform for the machine learning lifecycle
snakemake-wrappers - This is the development home of the Snakemake wrapper repository, see
lakeFS - lakeFS - Data version control for your data lake | Git for data
beaver - Simple, but capable build system and command runner for any project
Activeloop Hub - Data Lake for Deep Learning. Build, manage, query, version, & visualize datasets. Stream data real-time to PyTorch/TensorFlow. https://activeloop.ai [Moved to: https://github.com/activeloopai/deeplake]
aim - Aim 💫 — An easy-to-use & supercharged open-source experiment tracker.
delta - An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
sdk - Metadata store for Production ML
ploomber - The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️
make-booster - Utility routines to simplify using GNU make and Python