Our great sponsors
- InfluxDB - Collect and Analyze Billions of Data Points in Real Time
- Onboard AI - Learn any GitHub repo in 59 seconds
- SaaSHub - Software Alternatives and Reviews
-
For anyone who wants to try this, zstd -T0 uses all your threads to compress, and https://github.com/facebook/zstd has a lot more description. Brotli, https://github.com/google/brotli, is another modern format with some good features for high compression levels and Content-Encoding support in web browsers. You might also want to play with the compression level (-1 to -11 or more, zstd's --fast=n).
One reason these modern compressors do better is not any particular mistake made defining DEFLATE in the 90s, but that new algos use a few MB of recently seen data as context instead of 32KB, and do other things impractical in the 90s but reasonable on modern hardware. The new algorithms also contain logs of smart ideas and have fine-tuned implementations, but that core difference seems important to note.
-
For anyone who wants to try this, zstd -T0 uses all your threads to compress, and https://github.com/facebook/zstd has a lot more description. Brotli, https://github.com/google/brotli, is another modern format with some good features for high compression levels and Content-Encoding support in web browsers. You might also want to play with the compression level (-1 to -11 or more, zstd's --fast=n).
One reason these modern compressors do better is not any particular mistake made defining DEFLATE in the 90s, but that new algos use a few MB of recently seen data as context instead of 32KB, and do other things impractical in the 90s but reasonable on modern hardware. The new algorithms also contain logs of smart ideas and have fine-tuned implementations, but that core difference seems important to note.
-
InfluxDB
Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.
-
igzip (https://github.com/intel/isa-l) is much faster than gzip or pigz when it comes to decompression, 2-3x in my experience. There is also a Python module (isal) that provides a GzipFile-like wrapper class, for an easy speed-up of Python scripts that read gzipped files.
However, it only supports up to level 3 when compressing data, so it can't be used as a drop-in replacement for gzip. You also need to make sure to use the latest version if you are going to use it in the context of bioinformatics, since older versions choke on concatenated gzip files common in that field.
-
-
-
Pragzip actually decompress in parallel and also access at random. I did a Show HN here: https://news.ycombinator.com/item?id=32366959
indexed_gzip https://github.com/pauldmccarthy/indexed_gzip can also do random access but is not parallel.
Both have to do a linear scan first though. The implementations however can do the linear scan on-demand, i.e., they scan only as far as needed.
bzip2 works very well with this approach. xz only works with this approach when compressed with multiple blocks. Similar is true for zstd.
For zstd, there also exists a seekable variant, which stores the block index at the end as metadata to avoid the linear scan. indexed_zstd offers random access to those files https://github.com/martinellimarco/indexed_zstd
I wrote pragzip and also combined all of the other random access compression backends in ratarmount to offer random access to TAR files that is magnitudes faster than archivemount: https://github.com/mxmlnkn/ratarmount
-
Pragzip actually decompress in parallel and also access at random. I did a Show HN here: https://news.ycombinator.com/item?id=32366959
indexed_gzip https://github.com/pauldmccarthy/indexed_gzip can also do random access but is not parallel.
Both have to do a linear scan first though. The implementations however can do the linear scan on-demand, i.e., they scan only as far as needed.
bzip2 works very well with this approach. xz only works with this approach when compressed with multiple blocks. Similar is true for zstd.
For zstd, there also exists a seekable variant, which stores the block index at the end as metadata to avoid the linear scan. indexed_zstd offers random access to those files https://github.com/martinellimarco/indexed_zstd
I wrote pragzip and also combined all of the other random access compression backends in ratarmount to offer random access to TAR files that is magnitudes faster than archivemount: https://github.com/mxmlnkn/ratarmount
-
Onboard AI
Learn any GitHub repo in 59 seconds. Onboard AI learns any GitHub repo in minutes and lets you chat with it to locate functionality, understand different parts, and generate new code. Use it for free at www.getonboard.dev.
-
ratarmount
Access large archives as a filesystem efficiently, e.g., TAR, RAR, ZIP, GZ, BZ2, XZ, ZSTD archives
Pragzip actually decompress in parallel and also access at random. I did a Show HN here: https://news.ycombinator.com/item?id=32366959
indexed_gzip https://github.com/pauldmccarthy/indexed_gzip can also do random access but is not parallel.
Both have to do a linear scan first though. The implementations however can do the linear scan on-demand, i.e., they scan only as far as needed.
bzip2 works very well with this approach. xz only works with this approach when compressed with multiple blocks. Similar is true for zstd.
For zstd, there also exists a seekable variant, which stores the block index at the end as metadata to avoid the linear scan. indexed_zstd offers random access to those files https://github.com/martinellimarco/indexed_zstd
I wrote pragzip and also combined all of the other random access compression backends in ratarmount to offer random access to TAR files that is magnitudes faster than archivemount: https://github.com/mxmlnkn/ratarmount