Show HN: We scaled Git to support 1 TB repos

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • sso-wall-of-shame

    A list of vendors that treat single sign-on as a luxury feature, not a core security requirement.

  • Please consider https://sso.tax/ before making that an "enterprise" feature.

  • pachyderm

    Data-Centric Pipelines and Data Versioning

  • There are a couple of other contenders in this space. DVC (https://dvc.org/) seems most similar.

    If you're interested in something you can self-host... I work on Pachyderm (https://github.com/pachyderm/pachyderm), which doesn't have a Git-like interface, but also implements data versioning. Our approach de-duplicates between files (even very small files), and our storage algorithm doesn't create objects proportional to O(n) directory nesting depth as Xet appears to. (Xet is very much like Git in that respect.)

    The data versioning system enables us to run pipelines based on changes to your data; the pipelines declare what files they read, and that allows us to schedule processing jobs that only reprocess new or changed data, while still giving you a full view of what "would" have happened if all the data had been reprocessed. This, to me, is the key advantage of data versioning; you can save hundreds of thousands of dollars on compute. Being able to undo an oopsie is just icing on the cake.

    Xet's system for mounting a remote repo as a filesystem is a good idea. We do that too :)

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • dvc

    🦉 ML Experiments and Data Management with Git

  • There are a couple of other contenders in this space. DVC (https://dvc.org/) seems most similar.

    If you're interested in something you can self-host... I work on Pachyderm (https://github.com/pachyderm/pachyderm), which doesn't have a Git-like interface, but also implements data versioning. Our approach de-duplicates between files (even very small files), and our storage algorithm doesn't create objects proportional to O(n) directory nesting depth as Xet appears to. (Xet is very much like Git in that respect.)

    The data versioning system enables us to run pipelines based on changes to your data; the pipelines declare what files they read, and that allows us to schedule processing jobs that only reprocess new or changed data, while still giving you a full view of what "would" have happened if all the data had been reprocessed. This, to me, is the key advantage of data versioning; you can save hundreds of thousands of dollars on compute. Being able to undo an oopsie is just icing on the cake.

    Xet's system for mounting a remote repo as a filesystem is a good idea. We do that too :)

  • dolt

    Dolt – Git for Data

  • Founder of DoltHub here. One of my team pointed me at this thread. Congrats on the launch. Great to see more folks tackling the data versioning problem.

    Dolt hasn't come up here yet, probably because we're focused on OLTP use cases, not MLOps, but we do have some customers using Dolt as the backing store for their training data.

    https://github.com/dolthub/dolt

    Dolt also scales to the 1TB range and offers you full SQL query capabilities on your data and diffs.

  • sapling

    A Scalable, User-Friendly Source Control System.

  • Bup

    Very efficient backup system based on the git packfile format, providing fast incremental saves and global deduplication (among and within files, including virtual machine images). Please post problems or patches to the mailing list for discussion (see the end of the README below).

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • cesium-unreal-samples

    Getting Started Sample Project for Cesium for Unreal

  • Most of my examples of that size are AAA game source that I can't share however, I think this is a project using similar files that is based on unreal. It should show if there is any benefit: https://github.com/CesiumGS/cesium-unreal-samples & where the .umap binaries have been updated and in this example where the .uasset blueprints have been updated https://github.com/renhaiyizhigou/Unreal-Blueprint-Project

  • Unreal-Blueprint-Project

    Replica of PlayerUnknown's Battlegrounds Game created using UE4's Blueprint Editor

  • Most of my examples of that size are AAA game source that I can't share however, I think this is a project using similar files that is based on unreal. It should show if there is any benefit: https://github.com/CesiumGS/cesium-unreal-samples & where the .umap binaries have been updated and in this example where the .uasset blueprints have been updated https://github.com/renhaiyizhigou/Unreal-Blueprint-Project

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts