SaaSHub helps you find the best software and product alternatives Learn more →
Top 8 internet-archiving Open-Source Projects
-
ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
wikipedia-mirror
🌐 Guide and tools to run a full offline mirror of Wikipedia.org with three different approaches: Nginx caching proxy, Kiwix + ZIM dump, and MediaWiki/XOWA + XML dump
-
good-karma-kit
😇 A Docker Compose bundle to run on servers with spare CPU, RAM, disk, and bandwidth to help the world. Includes Tor, ArchiveWarrior, BOINC, and more...
-
archivebox-browser-extension
Official ArchiveBox browser extension: automatically/manually preserve your browsing history using ArchiveBox.
-
forum-dl
Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
readability-extractor
Javascript/Node wrapper around Mozilla's Readability library so that ArchiveBox can call it as a oneshot CLI command to extract each page's article text.
Project mention: Ask HN: What Underrated Open Source Project Deserves More Recognition? | news.ycombinator.com | 2024-03-07Two projects I greatly appreciate, allowing me to easily archive my bandcamp and GOG purchases (after the initial setup anyways):
https://github.com/easlice/bandcamp-downloader
https://github.com/Kalanyr/gogrepoc
And I recently learned about archivebox, which I think is going to be a fast favorite and finally let me clear out my mess of tabs/bookmarks: https://github.com/ArchiveBox/ArchiveBox
I ended up using waybackpy python module to retrieve archived URLs, it worked well. I think the feature you want for this is the "snapshots", but I didn't test this myself
https://chromewebstore.google.com/detail/habonpimjphpdnmcfka... (or https://github.com/tjhorner/archivebox-exporter for source)
Pushes your history to ArchiveBox, which does the heavy lifting storing/processing the content.
Alas, might not work with Epiphany because there's no complete extension support.
But IIRC, it stores its urls in $XDG_DATA_HOME/epiphany/ephy-history.db - so a bit of sqlite and ArchiveBox might do the trick for you.
Note: I'm running something similar, but find that I'd rather not rely on my history, I tend to click on a lot of garbage ;) You might want to curate a bit.
Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29You can try forum-dl, a forum scraping tool I've been writing for this purpose: https://github.com/mikwielgus/forum-dl
It's single-threaded, alpha-quality software, and still isn't compatible with many forums and themes. But it can export WARCs and may just happen to work for you.
internet-archiving related posts
- download all captures of a page in archive.org
- Well worth the price
- any way to archive all my bookmarks on archive.org?
- Comex update 9/27/2022
- Is there a way to download all the files Internet Archive has captured for a domain? I am trying to recover tweets from a suspended twitter account, but the account as a whole was never captured in the Wayback Machine, just some individual tweets and json files.
- Vandal – Browser extension that helps you navigate back in time using Wayback Machine without leaving your current tab
- Vandal – Browser extension that helps you navigate back in time using Wayback Machine without leaving your current tab
-
A note from our sponsor - SaaSHub
www.saashub.com | 28 Apr 2024
Index
What are some of the best open-source internet-archiving projects? This list will help you:
Project | Stars | |
---|---|---|
1 | ArchiveBox | 19,737 |
2 | waybackpy | 405 |
3 | wikipedia-mirror | 324 |
4 | good-karma-kit | 295 |
5 | archivebox-browser-extension | 158 |
6 | vandal | 150 |
7 | forum-dl | 59 |
8 | readability-extractor | 32 |
Sponsored