How Much Faster Is Making a Tar Archive Without Gzip?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • zstd

    Zstandard - Fast real-time compression algorithm

  • For anyone who wants to try this, zstd -T0 uses all your threads to compress, and https://github.com/facebook/zstd has a lot more description. Brotli, https://github.com/google/brotli, is another modern format with some good features for high compression levels and Content-Encoding support in web browsers. You might also want to play with the compression level (-1 to -11 or more, zstd's --fast=n).

    One reason these modern compressors do better is not any particular mistake made defining DEFLATE in the 90s, but that new algos use a few MB of recently seen data as context instead of 32KB, and do other things impractical in the 90s but reasonable on modern hardware. The new algorithms also contain logs of smart ideas and have fine-tuned implementations, but that core difference seems important to note.

  • brotli

    Brotli compression format

  • For anyone who wants to try this, zstd -T0 uses all your threads to compress, and https://github.com/facebook/zstd has a lot more description. Brotli, https://github.com/google/brotli, is another modern format with some good features for high compression levels and Content-Encoding support in web browsers. You might also want to play with the compression level (-1 to -11 or more, zstd's --fast=n).

    One reason these modern compressors do better is not any particular mistake made defining DEFLATE in the 90s, but that new algos use a few MB of recently seen data as context instead of 32KB, and do other things impractical in the 90s but reasonable on modern hardware. The new algorithms also contain logs of smart ideas and have fine-tuned implementations, but that core difference seems important to note.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • isa-l

    Intelligent Storage Acceleration Library

  • igzip (https://github.com/intel/isa-l) is much faster than gzip or pigz when it comes to decompression, 2-3x in my experience. There is also a Python module (isal) that provides a GzipFile-like wrapper class, for an easy speed-up of Python scripts that read gzipped files.

    However, it only supports up to level 3 when compressing data, so it can't be used as a drop-in replacement for gzip. You also need to make sure to use the latest version if you are going to use it in the context of bioinformatics, since older versions choke on concatenated gzip files common in that field.

  • rapidgzip

    Gzip Decompression and Random Access for Modern Multi-Core Machines

  • libslz

    Stateless, zlib-compatible, and very fast compression library -- http://libslz.org

  • indexed_gzip

    Fast random access of gzip files in Python

  • Pragzip actually decompress in parallel and also access at random. I did a Show HN here: https://news.ycombinator.com/item?id=32366959

    indexed_gzip https://github.com/pauldmccarthy/indexed_gzip can also do random access but is not parallel.

    Both have to do a linear scan first though. The implementations however can do the linear scan on-demand, i.e., they scan only as far as needed.

    bzip2 works very well with this approach. xz only works with this approach when compressed with multiple blocks. Similar is true for zstd.

    For zstd, there also exists a seekable variant, which stores the block index at the end as metadata to avoid the linear scan. indexed_zstd offers random access to those files https://github.com/martinellimarco/indexed_zstd

    I wrote pragzip and also combined all of the other random access compression backends in ratarmount to offer random access to TAR files that is magnitudes faster than archivemount: https://github.com/mxmlnkn/ratarmount

  • indexed_zstd

    A bridge for libzstd-seek to python. Based on mxmlnkn/indexed_bzip2

  • Pragzip actually decompress in parallel and also access at random. I did a Show HN here: https://news.ycombinator.com/item?id=32366959

    indexed_gzip https://github.com/pauldmccarthy/indexed_gzip can also do random access but is not parallel.

    Both have to do a linear scan first though. The implementations however can do the linear scan on-demand, i.e., they scan only as far as needed.

    bzip2 works very well with this approach. xz only works with this approach when compressed with multiple blocks. Similar is true for zstd.

    For zstd, there also exists a seekable variant, which stores the block index at the end as metadata to avoid the linear scan. indexed_zstd offers random access to those files https://github.com/martinellimarco/indexed_zstd

    I wrote pragzip and also combined all of the other random access compression backends in ratarmount to offer random access to TAR files that is magnitudes faster than archivemount: https://github.com/mxmlnkn/ratarmount

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • ratarmount

    Access large archives as a filesystem efficiently, e.g., TAR, RAR, ZIP, GZ, BZ2, XZ, ZSTD archives

  • Pragzip actually decompress in parallel and also access at random. I did a Show HN here: https://news.ycombinator.com/item?id=32366959

    indexed_gzip https://github.com/pauldmccarthy/indexed_gzip can also do random access but is not parallel.

    Both have to do a linear scan first though. The implementations however can do the linear scan on-demand, i.e., they scan only as far as needed.

    bzip2 works very well with this approach. xz only works with this approach when compressed with multiple blocks. Similar is true for zstd.

    For zstd, there also exists a seekable variant, which stores the block index at the end as metadata to avoid the linear scan. indexed_zstd offers random access to those files https://github.com/martinellimarco/indexed_zstd

    I wrote pragzip and also combined all of the other random access compression backends in ratarmount to offer random access to TAR files that is magnitudes faster than archivemount: https://github.com/mxmlnkn/ratarmount

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Ratarmount: Random Access Tar Mount

    1 project | news.ycombinator.com | 14 May 2023
  • Ratarmount: Access large archives as a filesystem efficiently

    1 project | news.ycombinator.com | 10 Apr 2024
  • Ratarmount – Fast transparent access to archives through FUSE

    2 projects | news.ycombinator.com | 10 Mar 2022
  • Is there a way to accelerate extracting .tar contents?

    1 project | /r/linuxquestions | 29 Jun 2021
  • Show HN: Rapidgzip – Parallel Gzip Decompressing with 10 GB/S

    3 projects | news.ycombinator.com | 4 Sep 2023