The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 3 Python internet-archiving Projects
-
ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
forum-dl
Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC
Project mention: Ask HN: What Underrated Open Source Project Deserves More Recognition? | news.ycombinator.com | 2024-03-07Two projects I greatly appreciate, allowing me to easily archive my bandcamp and GOG purchases (after the initial setup anyways):
https://github.com/easlice/bandcamp-downloader
https://github.com/Kalanyr/gogrepoc
And I recently learned about archivebox, which I think is going to be a fast favorite and finally let me clear out my mess of tabs/bookmarks: https://github.com/ArchiveBox/ArchiveBox
I ended up using waybackpy python module to retrieve archived URLs, it worked well. I think the feature you want for this is the "snapshots", but I didn't test this myself
Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29You can try forum-dl, a forum scraping tool I've been writing for this purpose: https://github.com/mikwielgus/forum-dl
It's single-threaded, alpha-quality software, and still isn't compatible with many forums and themes. But it can export WARCs and may just happen to work for you.
Python internet-archiving related posts
- download all captures of a page in archive.org
- Well worth the price
- any way to archive all my bookmarks on archive.org?
- Comex update 9/27/2022
- Is there a way to download all the files Internet Archive has captured for a domain? I am trying to recover tweets from a suspended twitter account, but the account as a whole was never captured in the Wayback Machine, just some individual tweets and json files.
- 简单run个脚本使用 wayback machine 接口批量备份知乎问题冲塔回答
- The Good Karma Kit: A Docker bundle you can run on servers with spare capacity to help good causes (Tor, ArchiveWarrior, BOINC, etc.)
-
A note from our sponsor - WorkOS
workos.com | 19 Apr 2024
Index
What are some of the best open-source internet-archiving projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | ArchiveBox | 19,672 |
2 | waybackpy | 405 |
3 | forum-dl | 59 |