Xz format considered inadequate for long-term archiving

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • dwarfs

    A fast high compression read-only file system for Linux, Windows and macOS

  • https://github.com/mhx/dwarfs

    "DwarFS compression is an order of magnitude better than SquashFS compression, it's 6 times faster to build the file system, it's typically faster to access files on DwarFS and it uses less CPU resources."

  • pixz

    Parallel, indexed xz compressor

  • pixz (https://github.com/vasi/pixz) is a nice parallel xz that additionally creates an index of tar files so you can decompress individual files. I wonder if dpkg could be extended to do something similar.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • zstd

    Zstandard - Fast real-time compression algorithm

  • zstd still doesn't have a seekable format as part of the official standard (I wish it did): https://github.com/facebook/zstd/issues/395#issuecomment-535...

  • zpaqlpy

    Compiles a zpaqlpy source file (a Python-subset) to a ZPAQ configuration file for usage with zpaqd

  • ZPAQ is the name of the tool but ZPAQ is also the name of the container format that gets used. ZPAQ embeds the decompression algorithm in the archive. One could store zstd-compressed blocks in ZPAQ archives as soon as a zpaql decompressor exists (e.g., for brotli there is a slow one implemented in a python subset and compiled to zpaql https://github.com/pothos/zpaqlpy).

    I don't know exactly whether other formats are better for seeking and streaming, but since the baseline is tar, ZPAQ (in the 2.0 spec) is already better as it supports deduplication and files can even be updated append-only, and the compression is not an afterthought wrapped around it but well integrated.

  • zfec

    zfec -- an efficient, portable erasure coding tool

  • I disagree with the premise of the article. Archive formats are all inadequate for long-term resilience and making them adequate would be a violation of the “do one thing and do it right” principle.

    To support resilience, you don’t need an alternative to xz, you need hashes and forward error correction. Specifically, compress your file using xz for high compression ratio, optionally encrypt it, then take a SHA-256 hash to be used for detecting errors, then generate parity files using PAR[1] or zfec[2] to correct errors.

    [1] https://wiki.archlinux.org/title/Parchive

    [2] https://github.com/tahoe-lafs/zfec

  • BorgBackup

    Deduplicating archiver with compression and authenticated encryption.

  • Both Borg [0] and Restic [1] have long standing open issues for error-correction, but seem to consider it off strategy. I find that decision kind of strange, since to me the whole purpose of a backup solution is to restore your system to a correct state after any kind of incident.

    My current solution is an assembly of shell scripts that combine borg with par2, but I'm rather unhappy with it. For one, I trust my home-brewn solution rather faintly (i.e. similar to `don't roll your own crypto` I think there should be an adagium `don't roll your own back-up solutions`). In addition I think an error-correcting mechanism should be available also for the less technology-savvy.

    [0]: https://github.com/borgbackup/borg/issues/225

    [1]: https://github.com/restic/restic/issues/256

  • restic

    Fast, secure, efficient backup program

  • Both Borg [0] and Restic [1] have long standing open issues for error-correction, but seem to consider it off strategy. I find that decision kind of strange, since to me the whole purpose of a backup solution is to restore your system to a correct state after any kind of incident.

    My current solution is an assembly of shell scripts that combine borg with par2, but I'm rather unhappy with it. For one, I trust my home-brewn solution rather faintly (i.e. similar to `don't roll your own crypto` I think there should be an adagium `don't roll your own back-up solutions`). In addition I think an error-correcting mechanism should be available also for the less technology-savvy.

    [0]: https://github.com/borgbackup/borg/issues/225

    [1]: https://github.com/restic/restic/issues/256

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts