Why are tar.xz files 15x smaller when using Python's tar compared to macOS tar?

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

mangodb

20 861 0.0 Python

A database that operates at CLOUD SCALE (by dcramer)
H2

11 4,042 9.1 Java

H2 is an embeddable RDBMS written in Java.

Sorting chunks by similarity: commonly used tools don't do that. Most archive tools only sort by file type.
I wrote a tool that chunks the data (into variable-sized blocks, to re-sync if there are multiple files that have different length prefixes, but that's another story), and then sorts the chunks by LSH (locality sensitive hash). LSH is used by search engines to detect similar text. It can compress directories that contain multiple version of e.g. source code very well (e.g. trunk, branches). https://github.com/h2database/h2database/blob/master/h2/src/...
I discussed this approach with a researcher in this area in January 2020. AFAIK there is active research in this area, specially to compress DNA sequences. But he also wasn't aware of papers or research in this area for general-purpose data compression.
So, I think this area is largely uncharted. I would be interested (as a hobby side project) to help, if somebody is interested.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project