Top 3 Python internet-archiving Projects

ArchiveBox

248 19,672 9.7 Python

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Project mention: Ask HN: What Underrated Open Source Project Deserves More Recognition? | news.ycombinator.com | 2024-03-07

Two projects I greatly appreciate, allowing me to easily archive my bandcamp and GOG purchases (after the initial setup anyways):
https://github.com/easlice/bandcamp-downloader
https://github.com/Kalanyr/gogrepoc
And I recently learned about archivebox, which I think is going to be a fast favorite and finally let me clear out my mess of tabs/bookmarks: https://github.com/ArchiveBox/ArchiveBox

waybackpy

6 405 0.0 Python

Wayback Machine API interface & a command-line tool

Project mention: download all captures of a page in archive.org | /r/Archiveteam | 2023-06-05

I ended up using waybackpy python module to retrieve archived URLs, it worked well. I think the feature you want for this is the "snapshots", but I didn't test this myself

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
forum-dl

4 59 9.0 Python

Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC

Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29

You can try forum-dl, a forum scraping tool I've been writing for this purpose: https://github.com/mikwielgus/forum-dl
It's single-threaded, alpha-quality software, and still isn't compatible with many forums and themes. But it can export WARCs and may just happen to work for you.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-03-07.