How Much Faster Is Making a Tar Archive Without Gzip?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Collect and Analyze Billions of Data Points in Real Time
  • Onboard AI - Learn any GitHub repo in 59 seconds
  • SaaSHub - Software Alternatives and Reviews
  • zstd

    Zstandard - Fast real-time compression algorithm

    For anyone who wants to try this, zstd -T0 uses all your threads to compress, and https://github.com/facebook/zstd has a lot more description. Brotli, https://github.com/google/brotli, is another modern format with some good features for high compression levels and Content-Encoding support in web browsers. You might also want to play with the compression level (-1 to -11 or more, zstd's --fast=n).

    One reason these modern compressors do better is not any particular mistake made defining DEFLATE in the 90s, but that new algos use a few MB of recently seen data as context instead of 32KB, and do other things impractical in the 90s but reasonable on modern hardware. The new algorithms also contain logs of smart ideas and have fine-tuned implementations, but that core difference seems important to note.

  • brotli

    Brotli compression format

    For anyone who wants to try this, zstd -T0 uses all your threads to compress, and https://github.com/facebook/zstd has a lot more description. Brotli, https://github.com/google/brotli, is another modern format with some good features for high compression levels and Content-Encoding support in web browsers. You might also want to play with the compression level (-1 to -11 or more, zstd's --fast=n).

    One reason these modern compressors do better is not any particular mistake made defining DEFLATE in the 90s, but that new algos use a few MB of recently seen data as context instead of 32KB, and do other things impractical in the 90s but reasonable on modern hardware. The new algorithms also contain logs of smart ideas and have fine-tuned implementations, but that core difference seems important to note.

  • InfluxDB

    Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.

  • isa-l

    Intelligent Storage Acceleration Library

    igzip (https://github.com/intel/isa-l) is much faster than gzip or pigz when it comes to decompression, 2-3x in my experience. There is also a Python module (isal) that provides a GzipFile-like wrapper class, for an easy speed-up of Python scripts that read gzipped files.

    However, it only supports up to level 3 when compressing data, so it can't be used as a drop-in replacement for gzip. You also need to make sure to use the latest version if you are going to use it in the context of bioinformatics, since older versions choke on concatenated gzip files common in that field.

  • rapidgzip

    Gzip Decompression and Random Access for Modern Multi-Core Machines

  • libslz

    Stateless, zlib-compatible, and very fast compression library -- http://libslz.org

  • indexed_gzip

    Fast random access of gzip files in Python

    Pragzip actually decompress in parallel and also access at random. I did a Show HN here: https://news.ycombinator.com/item?id=32366959

    indexed_gzip https://github.com/pauldmccarthy/indexed_gzip can also do random access but is not parallel.

    Both have to do a linear scan first though. The implementations however can do the linear scan on-demand, i.e., they scan only as far as needed.

    bzip2 works very well with this approach. xz only works with this approach when compressed with multiple blocks. Similar is true for zstd.

    For zstd, there also exists a seekable variant, which stores the block index at the end as metadata to avoid the linear scan. indexed_zstd offers random access to those files https://github.com/martinellimarco/indexed_zstd

    I wrote pragzip and also combined all of the other random access compression backends in ratarmount to offer random access to TAR files that is magnitudes faster than archivemount: https://github.com/mxmlnkn/ratarmount

  • indexed_zstd

    A bridge for libzstd-seek to python. Based on mxmlnkn/indexed_bzip2

    Pragzip actually decompress in parallel and also access at random. I did a Show HN here: https://news.ycombinator.com/item?id=32366959

    indexed_gzip https://github.com/pauldmccarthy/indexed_gzip can also do random access but is not parallel.

    Both have to do a linear scan first though. The implementations however can do the linear scan on-demand, i.e., they scan only as far as needed.

    bzip2 works very well with this approach. xz only works with this approach when compressed with multiple blocks. Similar is true for zstd.

    For zstd, there also exists a seekable variant, which stores the block index at the end as metadata to avoid the linear scan. indexed_zstd offers random access to those files https://github.com/martinellimarco/indexed_zstd

    I wrote pragzip and also combined all of the other random access compression backends in ratarmount to offer random access to TAR files that is magnitudes faster than archivemount: https://github.com/mxmlnkn/ratarmount

  • Onboard AI

    Learn any GitHub repo in 59 seconds. Onboard AI learns any GitHub repo in minutes and lets you chat with it to locate functionality, understand different parts, and generate new code. Use it for free at www.getonboard.dev.

  • ratarmount

    Access large archives as a filesystem efficiently, e.g., TAR, RAR, ZIP, GZ, BZ2, XZ, ZSTD archives

    Pragzip actually decompress in parallel and also access at random. I did a Show HN here: https://news.ycombinator.com/item?id=32366959

    indexed_gzip https://github.com/pauldmccarthy/indexed_gzip can also do random access but is not parallel.

    Both have to do a linear scan first though. The implementations however can do the linear scan on-demand, i.e., they scan only as far as needed.

    bzip2 works very well with this approach. xz only works with this approach when compressed with multiple blocks. Similar is true for zstd.

    For zstd, there also exists a seekable variant, which stores the block index at the end as metadata to avoid the linear scan. indexed_zstd offers random access to those files https://github.com/martinellimarco/indexed_zstd

    I wrote pragzip and also combined all of the other random access compression backends in ratarmount to offer random access to TAR files that is magnitudes faster than archivemount: https://github.com/mxmlnkn/ratarmount

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts