An Introduction to the WARC File

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

ArchiveBox

248 19,737 9.7 Python

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

API is coming soon (relatively, it's still a one-man project)! Stay tuned https://github.com/ArchiveBox/ArchiveBox/issues/496
I have an event-sourcing refactor in progress now to allow us to pluginize functionality like the API (similar to Home Assistant with a plugin app sotre), it will take a month or two. Next up is the REST API using the new plugin system.

linkwarden

18 6,045 9.8 TypeScript

⚡️⚡️⚡️Self-hosted collaborative bookmark manager to collect, organize, and preserve webpages and articles.
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
monolith

23 9,870 6.9 Rust

⬛️ CLI tool for saving complete web pages as a single HTML file

I have never used monolith to say with any certainty, but two things in your description are worth highlighting between the goals of WARC versus the umpteen bazillion "save this one page I'm looking at as a single file" type projects:
1. WARC is designed, as a goal, to archive the request-response handshake. It does not get into the business of trying to make it easy for a browser to subsequently display that content, since that's a browser's problem
2. Using your cited project specifically, observe the number of "well, save it but ..." options <https://github.com/Y2Z/monolith#options> which is in stark contrast to the archiving goals I just spoke about. It's not a good snapshot of history if the server responded with `content-type: text/html;charset=iso-8859-1` back in the 90s but "modern tools" want everything to be UTF-8 so we'll just convert it, shall we? Bah, I don't like JavaScript, so we'll just toss that out, shall we? And so on
For 100% clarity: monolith, and similar, may work fantastic for any individual's workflow, and I'm not here to yuck anyone's yum; but I do want to highlight that all things being equal it should always be possible to derive monolith files from warc files because the warc files are (or at least have the goal of) perfect fidelity of what the exchange was. I would guess only pcap files would be of higher fidelity, but also a lot more extraneous or potentially privacy violating details

warc

1 230 10.0 Python

Python library for reading and writing warc files (by internetarchive)

I wrote a library to handle these and the older ARC files to use with an archiving proxy that a friend and I built for the Internet Archive. https://github.com/internetarchive/warc. He wrote the WARC parts while I did the ARC one using this https://archive.org/web/researcher/ArcFileFormat.php
Good memories.

WarcDB

7 383 7.2 Python

WarcDB: Web crawl data as SQLite databases.
full-text-tabs-forever

4 52 8.5 TypeScript

Full text search all your browsing history

A bit of a late response, but yes I've been storing full text of every website I visit and it's excellent for finding stuff again.
The idea is to index pages as you visit them using a browser extension, thus avoiding all the pitfalls of being treated like a bot.
Here's the project: https://github.com/iansinnott/full-text-tabs-forever

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Ask HN: How can I back up an old vBulletin forum without admin access?
9 projects | news.ycombinator.com | 29 Jan 2024
Best practices for archiving websites
2 projects | /r/datacurator | 6 Dec 2023
An easy solution to save entire websites for my dad?
7 projects | /r/DataHoarder | 17 Oct 2021
Looking for open source software to scrape webpages but also make them searchable with a webui. (locally hosted)
4 projects | /r/DataHoarder | 16 Jan 2021
Alternative to HTTrack (website copier) as of 2023?
4 projects | /r/DataHoarder | 10 Feb 2023

An Introduction to the WARC File

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
warc web-archiving Archiving and Digital Preservation (DP) save-the-internet Crawling
Post date: 29 Jan 2024

ArchiveBox

linkwarden

WorkOS

monolith

warc

WarcDB

full-text-tabs-forever

Related posts

An Introduction to the WARC File

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com warc web-archiving Archiving and Digital Preservation (DP) save-the-internet Crawling Post date: 29 Jan 2024

ArchiveBox

linkwarden

WorkOS

monolith

warc

WarcDB

full-text-tabs-forever

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
warc web-archiving Archiving and Digital Preservation (DP) save-the-internet Crawling
Post date: 29 Jan 2024