The tar archive format, and why GNU tar extracts in quadratic time

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • unix-history-repo

    Continuous Unix commit history from 1970 until today

  • While GNU tar inherits this from the original tar, it doesn't seem to be something intrinsic to the PDP-11 or DEC given that the earlier Unix 'tap' archival system uses 'plain binary numbers ("base 256")', not octals.

    Here's the V7 tar.c: https://github.com/dspinellis/unix-history-repo/blob/Researc...

    and it's documentation: https://github.com/dspinellis/unix-history-repo/blob/Researc...

    The C code clearly shows the octal (I've never used "%o" myself!):

    sscanf(dblock.dbuf.mode, "%o", &i);

  • unxip

    A fast Xcode unarchiver

  • > You may think that the numeric values (file_mode, file_size, file_mtime, ...) would be encoded in base 10, or maybe in hex, or using plain binary numbers ("base 256"). But no, they're actually encoded as octal strings (with a NUL terminator, or sometimes a space terminator). Tar is the only file format I know of which uses base 8 to encode numbers.

    Tar’s “competitor”, cpio, does this as well (at least in one of the popular implementations). The Xcode XIP file, if you’re familiar with that particular format, is a couple layers of wrapping on top of a cpio archive, so deep inside my tool to expand these there’s a spot where I read these all out in a similar fashion: https://github.com/saagarjha/unxip/blob/5dcfe2c0f9578bc43070...

    More generally, though, both tar and cpio doesn’t have an index because they don’t need them, they’re meant to be decompressed in a streaming fashion where this information isn’t necessary. It’s kind of inconvenient if you want to extract a single file, but for large archives you really want to compress files together and maintaining an index for this is both expensive and nontrivial, so I guess these formats are just “good enough”.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • libarchive

    Multi-format archive and compression library

  • "In any case, the result is that a tar extractor which wants to support both pax tar files and GNU tar files needs to support 5 different ways of reading the file path, 5 different ways of reading the link path, and 3 different ways of reading the file size."

    https://github.com/libarchive/libarchive

  • libpathrs

    C-friendly API to make path resolution safer on Linux.

  • Author of openat2 here, yes it would (as well as protecting against races where the target directory is having path components changed to symlinks during extraction). RESOLVE_IN_ROOT would be more akin to extracting in a chroot(2).

    I've been working on a userspace library[1] which would make writing userspace programs that interact with these kinds of dangerous paths more safe (though it's been on the backburner recently).

    [1]: https://github.com/openSUSE/libpathrs

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts