How Much Faster Is Making a Tar Archive Without Gzip?

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

zstd

109 22,480 9.7 C

Zstandard - Fast real-time compression algorithm

For anyone who wants to try this, zstd -T0 uses all your threads to compress, and https://github.com/facebook/zstd has a lot more description. Brotli, https://github.com/google/brotli, is another modern format with some good features for high compression levels and Content-Encoding support in web browsers. You might also want to play with the compression level (-1 to -11 or more, zstd's --fast=n).
One reason these modern compressors do better is not any particular mistake made defining DEFLATE in the 90s, but that new algos use a few MB of recently seen data as context instead of 32KB, and do other things impractical in the 90s but reasonable on modern hardware. The new algorithms also contain logs of smart ideas and have fine-tuned implementations, but that core difference seems important to note.

brotli

26 13,156 8.1 TypeScript

Brotli compression format

For anyone who wants to try this, zstd -T0 uses all your threads to compress, and https://github.com/facebook/zstd has a lot more description. Brotli, https://github.com/google/brotli, is another modern format with some good features for high compression levels and Content-Encoding support in web browsers. You might also want to play with the compression level (-1 to -11 or more, zstd's --fast=n).
One reason these modern compressors do better is not any particular mistake made defining DEFLATE in the 90s, but that new algos use a few MB of recently seen data as context instead of 32KB, and do other things impractical in the 90s but reasonable on modern hardware. The new algorithms also contain logs of smart ideas and have fine-tuned implementations, but that core difference seems important to note.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
isa-l

5 908 8.6 C

Intelligent Storage Acceleration Library

igzip (https://github.com/intel/isa-l) is much faster than gzip or pigz when it comes to decompression, 2-3x in my experience. There is also a Python module (isal) that provides a GzipFile-like wrapper class, for an easy speed-up of Python scripts that read gzipped files.
However, it only supports up to level 3 when compressing data, so it can't be used as a drop-in replacement for gzip. You also need to make sure to use the latest version if you are going to use it in the context of bioinformatics, since older versions choke on concatenated gzip files common in that field.

rapidgzip

14 320 9.5 C++

Gzip Decompression and Random Access for Modern Multi-Core Machines
libslz

1 27 5.0 C

Stateless, zlib-compatible, and very fast compression library -- http://libslz.org
indexed_gzip

2 93 8.3 C

Fast random access of gzip files in Python

Pragzip actually decompress in parallel and also access at random. I did a Show HN here: https://news.ycombinator.com/item?id=32366959
indexed_gzip https://github.com/pauldmccarthy/indexed_gzip can also do random access but is not parallel.
Both have to do a linear scan first though. The implementations however can do the linear scan on-demand, i.e., they scan only as far as needed.
bzip2 works very well with this approach. xz only works with this approach when compressed with multiple blocks. Similar is true for zstd.
For zstd, there also exists a seekable variant, which stores the block index at the end as metadata to avoid the linear scan. indexed_zstd offers random access to those files https://github.com/martinellimarco/indexed_zstd
I wrote pragzip and also combined all of the other random access compression backends in ratarmount to offer random access to TAR files that is magnitudes faster than archivemount: https://github.com/mxmlnkn/ratarmount

indexed_zstd

1 20 4.2 Python

A bridge for libzstd-seek to python. Based on mxmlnkn/indexed_bzip2

Pragzip actually decompress in parallel and also access at random. I did a Show HN here: https://news.ycombinator.com/item?id=32366959
indexed_gzip https://github.com/pauldmccarthy/indexed_gzip can also do random access but is not parallel.
Both have to do a linear scan first though. The implementations however can do the linear scan on-demand, i.e., they scan only as far as needed.
bzip2 works very well with this approach. xz only works with this approach when compressed with multiple blocks. Similar is true for zstd.
For zstd, there also exists a seekable variant, which stores the block index at the end as metadata to avoid the linear scan. indexed_zstd offers random access to those files https://github.com/martinellimarco/indexed_zstd
I wrote pragzip and also combined all of the other random access compression backends in ratarmount to offer random access to TAR files that is magnitudes faster than archivemount: https://github.com/mxmlnkn/ratarmount

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
ratarmount

10 637 9.1 Python

Access large archives as a filesystem efficiently, e.g., TAR, RAR, ZIP, GZ, BZ2, XZ, ZSTD archives

Pragzip actually decompress in parallel and also access at random. I did a Show HN here: https://news.ycombinator.com/item?id=32366959
indexed_gzip https://github.com/pauldmccarthy/indexed_gzip can also do random access but is not parallel.
Both have to do a linear scan first though. The implementations however can do the linear scan on-demand, i.e., they scan only as far as needed.
bzip2 works very well with this approach. xz only works with this approach when compressed with multiple blocks. Similar is true for zstd.
For zstd, there also exists a seekable variant, which stores the block index at the end as metadata to avoid the linear scan. indexed_zstd offers random access to those files https://github.com/martinellimarco/indexed_zstd
I wrote pragzip and also combined all of the other random access compression backends in ratarmount to offer random access to TAR files that is magnitudes faster than archivemount: https://github.com/mxmlnkn/ratarmount

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Ratarmount: Random Access Tar Mount

1 project | news.ycombinator.com | 14 May 2023
Ratarmount: Access large archives as a filesystem efficiently

1 project | news.ycombinator.com | 10 Apr 2024
Ratarmount – Fast transparent access to archives through FUSE

2 projects | news.ycombinator.com | 10 Mar 2022
Is there a way to accelerate extracting .tar contents?

1 project | /r/linuxquestions | 29 Jun 2021
Show HN: Rapidgzip – Parallel Gzip Decompressing with 10 GB/S

3 projects | news.ycombinator.com | 4 Sep 2023

How Much Faster Is Making a Tar Archive Without Gzip?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Compression Extract Tar mounting tar-archive
Post date: 10 Oct 2022

zstd

brotli

InfluxDB

isa-l

rapidgzip

libslz

indexed_gzip

indexed_zstd

SaaSHub

ratarmount

Related posts

Ratarmount: Random Access Tar Mount

Ratarmount: Access large archives as a filesystem efficiently

Ratarmount – Fast transparent access to archives through FUSE

Is there a way to accelerate extracting .tar contents?

Show HN: Rapidgzip – Parallel Gzip Decompressing with 10 GB/S

How Much Faster Is Making a Tar Archive Without Gzip?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Compression Extract Tar mounting tar-archive Post date: 10 Oct 2022

Related posts

Ratarmount: Random Access Tar Mount

Ratarmount: Access large archives as a filesystem efficiently

Ratarmount – Fast transparent access to archives through FUSE

Is there a way to accelerate extracting .tar contents?

Show HN: Rapidgzip – Parallel Gzip Decompressing with 10 GB/S

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Compression Extract Tar mounting tar-archive
Post date: 10 Oct 2022