Make your monorepo feel small with Git’s sparse index

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • git-branchless

    High-velocity, monorepo-scale workflow for Git

  • Thanks for the feedback. I also received this request today to document a relevant workflow: https://github.com/arxanas/git-branchless/issues/210. If you want to be notified when I write the documentation (hopefully today?), then you can watch that issue.

    There's a decent discussion here on "stacked changes": https://docs.graphite.dev/getting-started/why-use-stacked-ch..., with references to other articles. This workflow is sometimes called "patch stack" or "stacked diffs" development. But it's not the full workflow that git-branchless enables.

    I use git-branchless 1) simply to scale to a monorepo, because `git move` is a lot faster than `git rebase`, and 2) to do highly speculative work and jump between many different approaches to the same problem (a kind of "breadth-first" search). I always had this problem with Git where I wanted to make many speculative changes, but branch and stash management got in the way. (For example, it's hard to update a commit which is a common ancestor of two or more branches. `git move` solves this.) The branchless workflow lets me be more nimble and update the commit graph more deftly, so that I can do experimental work much more easily.

  • libgit2

    A cross-platform, linkable library implementation of Git that you can use in your application.

  • The index as a data structure is really starting to show its age, especially as developers adapt Git to monorepo scale. It's really fast for repositories up to a certain size, but big tech organizations grow exponentially, and start to suffer performance issues. At some point, you can't afford to use a data structure that scales with the size of the repo, and have to switch to one that scales with the size of the user's change.

    I spent a good chunk of time working around the lack of sparse indexes in libgit2, which produced speedups on the order of 500x for certain operations, because reading and writing the entire index is unnecessary for most users of a monorepo: https://github.com/libgit2/libgit2/issues/6036. I'm excited to see sparse indexes make their way into Git proper.

    Shameless plug: I'm working on improving monorepo-scale Git tooling at https://github.com/arxanas/git-branchless, such as with in-memory rebases: https://blog.waleedkhan.name/in-memory-rebases/. Try it out if you work in a Git monorepo.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • git

    A fork of Git containing Microsoft-specific patches. (by microsoft)

  • This is well written and deserves my upvote, because sparse-checkout is part of git and knowing how it works is useful.

    That said, there's absolutely no reason to structure your code in a monorepo.

    Here's what I think GitHub is doing:

    1) Encourage monorepo adoption

    2) Build tooling for monorepos

    3) Selling tooling to developers stranded in monorepos

    Microsoft, which owns GitHub, created the microsoft/git fork linked in the article, and they explain their justification here: https://github.com/microsoft/git#why-is-this-fork-needed

    > Well, because Git is a distributed version control system, each Git repository has a copy of all files in the entire history. As large repositories, aka monorepos grow, Git can struggle to manage all that data. As Git commands like status and fetch get slower, developers stop waiting and start switching context. And context switches harm developer productivity.

    I believe that Google's brand is so big that it led to this mass cognitive dissonance, which is being exploited by GitHub.

    To be clear, here are the two ideas in conflict:

    * Git is decentralized and fast, and Google famously doesn't use it.

    * Companies want to use "industry standard" tech, and Google is the standard for success.

    Now apply those observations to a world where your engineers only use "git".

    The result is market demand to misuse git for monorepos, which Microsoft is pouring huge amounts of resources into enabling via GitHub.

    It makes great sense that GitHub wants to lean into this. More centralization and being more reliant on GitHub's custom tooling is obviously better for GitHub.

    It just so happens that GitHub is building tools to enable monorepos, essentially normalizing their usage.

    Then GitHub can sell tools to deal with your enormous monorepo, because your traditional tools will feel slow and worse than GitHub's tools.

    In other words, GitHub is propping up the failed monorepo idea as a strategy to get people in the pipeline for things like CodeSpaces: https://github.com/features/codespaces

    Because if you have 100 projects and they're all separate, you can do development locally for each and it's fast and sensible. But if all your projects are in one repo, the tools grind to a halt, and suddenly you need to buy a solution that just works to meet your business goals.

  • Git

    Git Source Code Mirror - This is a publish-only repository but pull requests can be turned into patches to the mailing list via GitGitGadget (https://gitgitgadget.github.io/). Please follow Documentation/SubmittingPatches procedure for any of your improvements.

  • It makes more sense if you think of the index as a structure meant specifically to speed up `git status` operations. (It was originally called the "dircache"! See https://github.com/git/git/commit/5adf317b31729707fad4967c1a...) We desperately want to reduce the number of file accesses we have to make, so directly using the object database and a tree object would more than double the number of file accesses we have to make.

    There's performance-related metadata in the index which isn't in tree objects. For example, the modified-time of a given file exists in its index entry, which can be used to avoid reading the file from disk if it seems to be unmodified. If you have to do a disk lookup to decide whether to read a file from disk, then the overhead is as much as the operation itself.

    There's also semantic metadata, such as which stage the file is in (for merge conflict resolution).

    It's worth noting that you can turn on the cache tree extension (https://git-scm.com/docs/index-format#_cache_tree) in order to speed up commit operations. It doesn't replace objects in the index with trees, but it does keep ranges of the index cached, if they're known to correspond to a tree.

  • gecko-dev

    Read-only Git mirror of the Mercurial gecko repositories at https://hg.mozilla.org. How to contribute: https://firefox-source-docs.mozilla.org/contributing/contribution_quickref.html

  • There are decentralized monorepos, such as gecko-dev (https://github.com/mozilla/gecko-dev), which presumably has several forks in products like Iceweasel.

    I think the monorepo workflows which Git isn't good at are things like branching, code review, and feature/version management. But there's no reason that Git should have to be slow just because a repository is large. It's more things like "merge commits don't scale in a monorepo", which I would agree with.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts