Data Version Control

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Scout Monitoring - Free Django app performance insights with Scout Monitoring
Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
www.scoutapm.com
featured
InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
  • dvc

    🦉 ML Experiments and Data Management with Git

  • It was definitely a bad choice. I wasn't there so I can only speculate. My guess is because it is sort of ubiquitous and thus a low-hanging fruit and devs didn't know better, or the related corollary, it's what S3 uses for ETags, so it probably seemed logical. Either way, seems like someone did it and didn't know better, no one agrees on a fix or whether it's even necessary to change, and thus it's stuck for now.

    There's an ongoing discussion about replacing/configuring the hash function, but it looks like it hasn't gone anywhere substantial.

    https://github.com/iterative/dvc/issues/3069

  • Scout Monitoring

    Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

    Scout Monitoring logo
  • telemetry-python

    Common library to send usage telemetry

  • VS Code, etc

    > I think the challenge I have is that since you’re getting IP address that will be an opportunity to abuse.

    Yes! And we are migrating to the new package / infrastructure because of this - https://github.com/iterative/telemetry-python (DVC's sister tool MLEM is already on it and it's not sending (saving) IP addresses, nor using GA or any other third-party tools, data is saved into BigQuery and eventually we'll make publicly accessible - https://mlem.ai/doc/user-guide/analytics to be fully GDPR compatible). It's a legacy system that DVC had in place. There was no intention to use those IP addresses in some way.

    > I think perhaps the only other way would be to support an automated distro that doesn’t include it so users are at least able to easily choose a version.

    Thanks. To some extent brew-like policy (not sending anything significant before there is a chance to disable it and there is clear explicit message) should be mitigating this, but I'll check if it works this way now and if it can be improved.

  • Estranged.Lfs

    A Git LFS server implementation in C# designed to run in a serverless environment.

  • Did some more research to see if anything had changed in this space. I found two interesting projects (haven't used them myself yet though):

    One in C# (with support for auth)

    https://github.com/alanedwardes/Estranged.Lfs

    One in Rust (but no Auth, have to run reverse proxy)

    https://github.com/jasonwhite/rudolfs

    Both seem interesting. Anyone use these?

  • rudolfs

    A high-performance, caching Git LFS server with an AWS S3 and local storage back-end.

  • Did some more research to see if anything had changed in this space. I found two interesting projects (haven't used them myself yet though):

    One in C# (with support for auth)

    https://github.com/alanedwardes/Estranged.Lfs

    One in Rust (but no Auth, have to run reverse proxy)

    https://github.com/jasonwhite/rudolfs

    Both seem interesting. Anyone use these?

  • dupver

    Deduplicating VCS for large binary files in Go

  • I work with a lot of uncompressed structured binary files so I finally broke down and wrote my own system based on the Restic chunker: https://github.com/akbarnes/dupver

  • dud

    A lightweight CLI tool for versioning data alongside source code and building data pipelines.

  • Cura

    3D printer / slicing GUI built on top of the Uranium framework

  • I wonder what the GDPR implications of this are. I note other projects (for eg Cura) switched their telemetry to opt-in.

    https://github.com/Ultimaker/Cura/issues/2810

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • What are some good examples of well-engineered pipelines

    3 projects | /r/ScientificComputing | 5 Apr 2023
  • Eden

    16 projects | news.ycombinator.com | 12 Apr 2022
  • HPC Rocket - A tool to run Slurm jobs from CI pipelines

    4 projects | /r/Python | 3 Jan 2022
  • Streamlit: A faster way to build and share data apps

    1 project | news.ycombinator.com | 14 Jun 2024
  • A quick comparison: Streamlit, Dash, Reflex and Rio

    4 projects | dev.to | 30 May 2024