Hop: 25x faster than unzip and 10x faster than tar at reading individual files

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • hop

  • asar

    Simple extensive tar-like archive format with indexing

  • Since Hop doesn't do compression, the most appropriate comparison would be to asar

    https://github.com/electron/asar

    It's not hard being faster than zip if you are not compressing/uncompressing.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • tarindexer

    python module for indexing tar files for fast access

  • There exists a utility called tarindexer [0] that can be used for random access to tar files. An index text file is created (one time) that is used to record the position of the files in the tar archive. Random reads are done by loading the index file and then seeking to the location of the file in question.

    For random access to gzip'd files, bgzip [1] can be used. bgzip also uses an index file (one time creation) that is used to record key points for random access.

    [0] https://github.com/devsnd/tarindexer

    [1] http://www.htslib.org/doc/bgzip.html

  • peechy

    A fork of the Kiwi Message Format

  • Came to mention Bun when I saw this hit the front page. I’ve been following Jarred’s Twitter since I heard about Bun and it’s quite impressive (albeit incomplete). To folks wondering why another bundler/what makes Bun special:

    - Faster than ESBuild/SWC

    - Fast build-time macros written as JSX (likely friendlier to develop than say a Babel plugin/macro). These open up a lot of possibilities that could benefit end users too, by performing more work on build/server and less client side.

    - Targeting ecosystem compatibility (eg will probably support the new JSX transform, which ESBuild does not and may not in the future)

    - Support for integration with frameworks, eg Next.js

    - Other cool performance-focused tools like Hop and Peechy[1] (though that’s a fork of ESBuild creator’s project Kiwi)

    This focus on performance is good for the JS ecosystem and for the web generally.

    1: https://github.com/Jarred-Sumner/peechy

  • pixz

    Parallel, indexed xz compressor

  • Also relevant is pixz [1] which can do parallel LZMA/XZ decompression as well as tar file indexing.

    [1] https://github.com/vasi/pixz

  • fd

    A simple, fast and user-friendly alternative to 'find'

  • ouch

    Painless compression and decompression in the terminal

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • ratarmount

    Access large archives as a filesystem efficiently, e.g., TAR, RAR, ZIP, GZ, BZ2, XZ, ZSTD archives

  • I've recently been looking into this same issue because I analyse a lot of data like sosreports or other tar/compressed data from customer systems. Currently I untar these onto my zfs filesystem which works out OK because it has zstd compression enabled but I end up decompressing and recompressing which is quite expensive as often the files are GBs or more compressed.

    But I've started using a tool called "ratarmount" (https://github.com/mxmlnkn/ratarmount) which creates an index once (and something I could automate our upload system to generate in advance, but you can also just process it lcoally) and then lets you fuse mount the file. This works pretty great with the only exception that I can't create scratch files inside the directory layout which in the past I'd wanted to do.

    I was surprised how hard a problem to solve it is to get a bundle file format that is indexable and compressed with a good and fast compression algorithm which mostly boils down to zstd at this point.

    While it works quite well, especially with gzip and bzip2, sadly the zstd and xz (and some other compression formats) don't allow for decompressing only parts of a file by default, even though it's possible the default tools aren't doing it. The nitty gritty details are summarised here:

  • mozilla-central-old

    Unofficial import of Mozilla's mozilla-central hg repository using hg-git

  • Yes, it's in python :)

    https://github.com/humphd/mozilla-central-old/blob/9d4d9f265...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts