Our great sponsors
-
ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
-
22120
Discontinued 💾 Diskernet - Your preferred backup solution. It's like you're still online! Full text search archive from your browsing and bookmarks. Weclome! to the Diskernet: an internet on yer disk. Disconnect with Diskernet, an internet for the post-online apocalypse. Or the airplane WiFi. Or the site goes down. Or ... You get the picture. Get Diskernet. 80s logo. Formerly 22120 (project codename) ;P ;) xx;p [Moved to: https://github.com/i5ik/Diskernet]
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
ArchiveBox
Discontinued 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more... [Moved to: https://github.com/ArchiveBox/ArchiveBox] (by pirate)
Yeup, just the reason why we expose the USER_AGENT options in ArchiveBox config ;)
https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overv...
I don't want to officially endorse using the Google bot user agent, but you're welcome to try it on your own and see if it improves the experience.
I'm working in that in my "self host the internet offline from your browsing history" project
https://github.com/c9fe/22120
It makes a web archive from everything your browse, and lately I've been working on the full text search
it doesn't show in the Screenshot in the article, but ArchiveBox in Aug 2020 implemented the "readability article text extractor", see description in the release notes: https://github.com/pirate/ArchiveBox/releases/tag/v0.4.14 and the module that does the work https://github.com/pirate/readability-extractor
By only extracting text and article images you could go deep into an archive. If you skip images, much more so