SaaSHub helps you find the best software and product alternatives Learn more →
Top 18 web-archiving Open-Source Projects
-
ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
archivebox-browser-extension
Official ArchiveBox browser extension: automatically/manually preserve your browsing history using ArchiveBox.
-
cdx_toolkit
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
-
browsertrix
Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
-
sandcrawler
Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Ask HN: What Underrated Open Source Project Deserves More Recognition? | news.ycombinator.com | 2024-03-07Two projects I greatly appreciate, allowing me to easily archive my bandcamp and GOG purchases (after the initial setup anyways):
https://github.com/easlice/bandcamp-downloader
https://github.com/Kalanyr/gogrepoc
And I recently learned about archivebox, which I think is going to be a fast favorite and finally let me clear out my mess of tabs/bookmarks: https://github.com/ArchiveBox/ArchiveBox
Project mention: YaCy, a distributed Web Search Engine, based on a peer-to-peer network | news.ycombinator.com | 2024-03-05
Project mention: Webrecorder: Capture interactive websites and replay them at a later time | news.ycombinator.com | 2023-08-01
Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29You can try https://replayweb.page/ as a test for viewing a WARC file. I do think you'll run into problems though with wanting to browse interconnected links in a forum format, but try this as a first step.
One potential option but definitely a bit more work would be, once you have all the warc files downloaded, you can open them all in python using the warctools module and maybe beautifulsoup and potentially parse/extract all of the data embedded in the WARC archives into your own "fresh" HTML webserver.
https://github.com/internetarchive/warctools
I ended up using waybackpy python module to retrieve archived URLs, it worked well. I think the feature you want for this is the "snapshots", but I didn't test this myself
https://chromewebstore.google.com/detail/habonpimjphpdnmcfka... (or https://github.com/tjhorner/archivebox-exporter for source)
Pushes your history to ArchiveBox, which does the heavy lifting storing/processing the content.
Alas, might not work with Epiphany because there's no complete extension support.
But IIRC, it stores its urls in $XDG_DATA_HOME/epiphany/ephy-history.db - so a bit of sqlite and ArchiveBox might do the trick for you.
Note: I'm running something similar, but find that I'd rather not rely on my history, I tend to click on a lot of garbage ;) You might want to curate a bit.
web-archiving related posts
-
YaCy, a distributed Web Search Engine, based on a peer-to-peer network
-
Ask HN: How can I back up an old vBulletin forum without admin access?
-
An Introduction to the WARC File
-
Mozilla "MemoryCache" Local AI
-
Best practices for archiving websites
-
Webrecorder: Capture interactive websites and replay them at a later time
-
phpBB3 forum owner dead. Webhost purging soon. Need to quickly archive a site
-
A note from our sponsor - SaaSHub
www.saashub.com | 6 May 2024
Index
What are some of the best open-source web-archiving projects? This list will help you:
Project | Stars | |
---|---|---|
1 | ArchiveBox | 19,861 |
2 | conifer | 1,456 |
3 | pywb | 1,303 |
4 | archiveweb.page | 737 |
5 | replayweb.page | 624 |
6 | ipwb | 590 |
7 | waybackpy | 405 |
8 | archivenow | 391 |
9 | WarcDB | 384 |
10 | warcio | 345 |
11 | archivebox-browser-extension | 159 |
12 | cdx_toolkit | 152 |
13 | browsertrix | 123 |
14 | warc-parquet | 99 |
15 | warrick | 80 |
16 | Collect | 75 |
17 | web-snap | 26 |
18 | sandcrawler | 23 |
Sponsored