Elfshaker: GiB – 100 MiB, with 1s access time

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • elfshaker

    elfshaker stores binary objects efficiently

  • We've just added an applicability section, which explains a bit more what we do. We don't have any ELF specific heuristics.

    https://github.com/elfshaker/elfshaker#applicability

    In summary, for manyclangs, we compile with -ffunction-sections and -fdata-sections, and store the resulting object files. These are fairly robust to insertions and deletions, since the addresses are section relative, so the damage of any addresses changing is contained within the sections. A somewhat surprising thing is that this works well enough when building many revisions of clang/llvm -- as you go from commit to commit, many commits have bit identical object files, even though the build system often wants to rebuild them because some input has changed.

    elfshaker packs use a heuristic of sorting all unique objects by size, before concatenating them and storing them with zstandard. This gives us an amortized cost-per-commit of something like 40kiB after compression with zstandard.

  • manyclangs

    Repository hosting unofficial binary pack files for many commits of LLVM

  • Author here. elfshaker itself does not have a dependency on any architecture to our knowledge. We support the architectures we have immediate use of.

    manyclangs provides binary pack files for aarch64 because that's what we have immediate use of. If elfshaker and manyclangs proves useful to people, I would love to see resource invested to make it more widely useful.

    You can still run the manyclangs binaries on other architectures using qemu [0], with some performance cost, which may be tolerable depending on your use case.

    [0] https://github.com/elfshaker/manyclangs/tree/main/docker-qem...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • nixpkgs

    Nix Packages collection & NixOS

  • I experimented with something similar with a Linux distribution's package binary cache.

    Using `bup` (deduplicating backup tool using git packfile format) I deduplicated 4 Chromium builds into the size of 1.

    Large download/storage requirements for updates are one of NixOS's few drawbacks, and I think deduplication could solve that pretty much completely.

    Details: https://github.com/NixOS/nixpkgs/issues/89380

  • dwarfs

    A fast high compression read-only file system for Linux, Windows and macOS

  • Somewhat related (and definitely born out of a very similar use case): https://github.com/mhx/dwarfs

    I initially built this for having access to 1000+ Perl installations (spanning decades of Perl releases). The compression in this case is not quite as impressive (50 GiB to around 300 MiB), but access times are typically in the millisecond region.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts