Onboard AI learns any GitHub repo in minutes and lets you chat with it to locate functionality, understand different parts, and generate new code. Use it for free at www.getonboard.dev. Learn more →
Dvc Alternatives
Similar projects and alternatives to dvc
-
-
-
Onboard AI
Learn any GitHub repo in 59 seconds. Onboard AI learns any GitHub repo in minutes and lets you chat with it to locate functionality, understand different parts, and generate new code. Use it for free at www.getonboard.dev.
-
Activeloop Hub
Data Lake for Deep Learning. Build, manage, query, version, & visualize datasets. Stream data real-time to PyTorch/TensorFlow. https://activeloop.ai [Moved to: https://github.com/activeloopai/deeplake] (by activeloopai)
-
ploomber
The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️
-
-
aim
Aim 💫 — An easy-to-use & supercharged open-source AI metadata tracker (experiment tracking, AI agents tracing)
-
delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)
-
Sonar
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
-
-
-
dud
A lightweight CLI tool for versioning data alongside source code and building data pipelines.
-
git-submodules
Git Submodule alternative with equivalent features, but easier to use and maintain.
-
-
-
-
EdenSCM
A Scalable, User-Friendly Source Control System. [Moved to: https://github.com/facebook/sapling]
-
spock
spock is a framework that helps manage complex parameter configurations during research and development of Python applications (by fidelity)
-
-
Airflow
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
-
pre-commit
A framework for managing and maintaining multi-language pre-commit hooks.
-
-
InfluxDB
Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.
dvc reviews and mentions
- Ask HN: How do your ML teams version datasets and models?
-
Exploring MLOps Tools and Frameworks: Enhancing Machine Learning Operations
DVC (Data Version Control):
- Evaluate and Track Your LLM Experiments: Introducing TruLens for LLMs
-
[D] Is there a tool to keep track of my ML experiments?
I have been using DVC and MLflow since then DVC had only data tracking and MLflow only model tracking. I can say both are awesome now and maybe the only factor I would like to mention is that IMO, MLflow is a bit harder to learn while DVC is just a git practically.
-
Ask HN: Data Management for AI Training
* User interface for less tech savy people ( e.g just a git like command line is fine for engineers but not for field personell who are not in IT )
I know of tools like https://dvc.org/ but a) they are just layers on top of git b) break appart on huge datasets without a folder hierarchy ( git tree objects just don't work for linear lists of items ) are only useable by IT personell, and require checking out at least a part of the dataset.
Our datasets would be 100.000.000 x 100 MB = 10 PB of raw data. Training data should be delivered to training nodes via network etc.. we just can't have a full checkout of that data...
-
Do you wonder why MLOps is not at the same level as DevOps?
Hey, great find! However, it only explains concepts but not how to actually use any tool. I personally use DVC, but it's more focused on the model development/engineering phase. The different phases of ML are also done independently, which makes it even more difficult for an individual to have exposure to all the different areas. Moreover, the lack of standard tools and best practices makes it difficult, and the fact that every ML problem is different.
-
Oxen.ai: Fast Unstructured Data Version Control
How does this compare with other systems, like DVC (https://dvc.org/) for example?
-
Career advice for getting into NLP from a Computer Science background?
For the data cleaning and training parts, you might have projects where you've used kaggle datasets to train models and you've done appropriate feature engineering and data exploration to help you to understand whether data might need to be under or over sampled or cleaned in some other way. I'd give bonus points to someone who has thoughts about how training pipelines might be semi or fully automated in a production environment (e.g. use of scripts and tools like dvc to make things easy to reproduce. I'd want to see evidence of appropriate metrics (e.g. I know its 99% accurate and that might be great but if its a 10-way classification on a very unbalanced dataset, what can you tell me about performance on the smallest class?).
-
ML experiment tracking with DagsHub, MLFlow, and DVC
Here, we’ll implement the experimentation workflow using DagsHub, Google Colab, MLflow, and data version control (DVC). We’ll focus on how to do this without diving deep into the technicalities of building or designing a workbench from scratch. Going that route might increase the complexity involved, especially if you are in the early stages of understanding ML workflows, just working on a small project, or trying to implement a proof of concept.
-
Show HN: We scaled Git to support 1 TB repos
There are a couple of other contenders in this space. DVC (https://dvc.org/) seems most similar.
If you're interested in something you can self-host... I work on Pachyderm (https://github.com/pachyderm/pachyderm), which doesn't have a Git-like interface, but also implements data versioning. Our approach de-duplicates between files (even very small files), and our storage algorithm doesn't create objects proportional to O(n) directory nesting depth as Xet appears to. (Xet is very much like Git in that respect.)
The data versioning system enables us to run pipelines based on changes to your data; the pipelines declare what files they read, and that allows us to schedule processing jobs that only reprocess new or changed data, while still giving you a full view of what "would" have happened if all the data had been reprocessed. This, to me, is the key advantage of data versioning; you can save hundreds of thousands of dollars on compute. Being able to undo an oopsie is just icing on the cake.
Xet's system for mounting a remote repo as a filesystem is a good idea. We do that too :)
-
A note from our sponsor - Onboard AI
getonboard.dev | 4 Oct 2023
Stats
iterative/dvc is an open source project licensed under Apache License 2.0 which is an OSI approved license.
The primary programming language of dvc is Python.