Why are tar.xz files 15x smaller when using Python's tar compared to macOS tar?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • mangodb

    A database that operates at CLOUD SCALE (by dcramer)

  • H2

    H2 is an embeddable RDBMS written in Java.

  • Sorting chunks by similarity: commonly used tools don't do that. Most archive tools only sort by file type.

    I wrote a tool that chunks the data (into variable-sized blocks, to re-sync if there are multiple files that have different length prefixes, but that's another story), and then sorts the chunks by LSH (locality sensitive hash). LSH is used by search engines to detect similar text. It can compress directories that contain multiple version of e.g. source code very well (e.g. trunk, branches). https://github.com/h2database/h2database/blob/master/h2/src/...

    I discussed this approach with a researcher in this area in January 2020. AFAIK there is active research in this area, specially to compress DNA sequences. But he also wasn't aware of papers or research in this area for general-purpose data compression.

    So, I think this area is largely uncharted. I would be interested (as a hobby side project) to help, if somebody is interested.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts