An Introduction to the WARC File

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • ArchiveBox

    🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

  • API is coming soon (relatively, it's still a one-man project)! Stay tuned https://github.com/ArchiveBox/ArchiveBox/issues/496

    I have an event-sourcing refactor in progress now to allow us to pluginize functionality like the API (similar to Home Assistant with a plugin app sotre), it will take a month or two. Next up is the REST API using the new plugin system.

  • linkwarden

    ⚡️⚡️⚡️Self-hosted collaborative bookmark manager to collect, organize, and preserve webpages and articles.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • monolith

    ⬛️ CLI tool for saving complete web pages as a single HTML file

  • I have never used monolith to say with any certainty, but two things in your description are worth highlighting between the goals of WARC versus the umpteen bazillion "save this one page I'm looking at as a single file" type projects:

    1. WARC is designed, as a goal, to archive the request-response handshake. It does not get into the business of trying to make it easy for a browser to subsequently display that content, since that's a browser's problem

    2. Using your cited project specifically, observe the number of "well, save it but ..." options <https://github.com/Y2Z/monolith#options> which is in stark contrast to the archiving goals I just spoke about. It's not a good snapshot of history if the server responded with `content-type: text/html;charset=iso-8859-1` back in the 90s but "modern tools" want everything to be UTF-8 so we'll just convert it, shall we? Bah, I don't like JavaScript, so we'll just toss that out, shall we? And so on

    For 100% clarity: monolith, and similar, may work fantastic for any individual's workflow, and I'm not here to yuck anyone's yum; but I do want to highlight that all things being equal it should always be possible to derive monolith files from warc files because the warc files are (or at least have the goal of) perfect fidelity of what the exchange was. I would guess only pcap files would be of higher fidelity, but also a lot more extraneous or potentially privacy violating details

  • warc

    Python library for reading and writing warc files (by internetarchive)

  • I wrote a library to handle these and the older ARC files to use with an archiving proxy that a friend and I built for the Internet Archive. https://github.com/internetarchive/warc. He wrote the WARC parts while I did the ARC one using this https://archive.org/web/researcher/ArcFileFormat.php

    Good memories.

  • WarcDB

    WarcDB: Web crawl data as SQLite databases.

  • full-text-tabs-forever

    Full text search all your browsing history

  • A bit of a late response, but yes I've been storing full text of every website I visit and it's excellent for finding stuff again.

    The idea is to index pages as you visit them using a browser extension, thus avoiding all the pitfalls of being treated like a bot.

    Here's the project: https://github.com/iansinnott/full-text-tabs-forever

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts