Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
I once foolishly thought, I'll write a tar parser because, "how hard can it be" [1].
I simply tried to follow the tar(5) man page[2], and got a reference test set from another website posted previously on HN[3].
Along the way I discovered that NetBSD pax apparently cannot handle the PAX format[3] and my parser inadvertently uncovered that git-archive was doing the checksums wrong, but nobody noticed because other tar parsers were more lax about it[4].
As the article describes (as does the man page), tar is actually a really simple format, but there are just so many variants to choose from.
Turns out, if you strive for maximum compatibility, it's easiest to stick to what GNU tar does. If you think about it, IMO in many ways the GNU project ended up doing "embrace, extend, extinguish" with Unix.
[1] https://github.com/AgentD/squashfs-tools-ng/tree/master/lib/...
[2] https://www.freebsd.org/cgi/man.cgi?query=tar&sektion=5
[3] https://mgorny.pl/articles/portability-of-tar-features.html
[4] https://www.spinics.net/lists/git/msg363049.html
I also have written my own tar parser, in https://github.com/bestouff/genext2fs - oh boy it's not easy ! My parser is of course incomplete and won't handle corner cases, which are really plenty. In the end I added libarchive as an alternative to my hand-rolled code because that's how it works best.
(source: https://unix.stackexchange.com/a/266090; also mentioned there is a skipcpio tool: https://github.com/dracutdevs/dracut/blob/master/skipcpio/sk...)
This hack would also work for extracting concatenating tarballs (without GNU tar's --ignore-zeros option). One annoyance is the warning about the end of the archive.