The tar archive format, and why GNU tar extracts in quadratic time

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

unix-history-repo

51 6,434 0.0 Assembly

Continuous Unix commit history from 1970 until today

While GNU tar inherits this from the original tar, it doesn't seem to be something intrinsic to the PDP-11 or DEC given that the earlier Unix 'tap' archival system uses 'plain binary numbers ("base 256")', not octals.
Here's the V7 tar.c: https://github.com/dspinellis/unix-history-repo/blob/Researc...
and it's documentation: https://github.com/dspinellis/unix-history-repo/blob/Researc...
The C code clearly shows the octal (I've never used "%o" myself!):
sscanf(dblock.dbuf.mode, "%o", &i);

unxip

5 820 7.0 Swift

A fast Xcode unarchiver

> You may think that the numeric values (file_mode, file_size, file_mtime, ...) would be encoded in base 10, or maybe in hex, or using plain binary numbers ("base 256"). But no, they're actually encoded as octal strings (with a NUL terminator, or sometimes a space terminator). Tar is the only file format I know of which uses base 8 to encode numbers.
Tar’s “competitor”, cpio, does this as well (at least in one of the popular implementations). The Xcode XIP file, if you’re familiar with that particular format, is a couple layers of wrapping on top of a cpio archive, so deep inside my tool to expand these there’s a spot where I read these all out in a similar fashion: https://github.com/saagarjha/unxip/blob/5dcfe2c0f9578bc43070...
More generally, though, both tar and cpio doesn’t have an index because they don’t need them, they’re meant to be decompressed in a streaming fashion where this information isn’t necessary. It’s kind of inconvenient if you want to extract a single file, but for large archives you really want to compress files together and maintaining an index for this is both expensive and nontrivial, so I guess these formats are just “good enough”.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
libarchive

33 2,878 9.1 C

Multi-format archive and compression library

"In any case, the result is that a tar extractor which wants to support both pax tar files and GNU tar files needs to support 5 different ways of reading the file path, 5 different ways of reading the link path, and 3 different ways of reading the file size."
https://github.com/libarchive/libarchive

libpathrs

1 66 0.0 Rust

C-friendly API to make path resolution safer on Linux.

Author of openat2 here, yes it would (as well as protecting against races where the target directory is having path components changed to symlinks during extraction). RESOLVE_IN_ROOT would be more akin to extracting in a chroot(2).
I've been working on a userspace library[1] which would make writing userspace programs that interact with these kinds of dangerous paths more safe (though it's been on the backburner recently).
[1]: https://github.com/openSUSE/libpathrs

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project