Our great sponsors
-
ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
-
SurveyJS
Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
-
DownloadNet
💾 DownloadNet - All content you browse online available offline. Search through the full-text of all pages in your browser history. ⭐️ Star to support our work!
-
activitywatch
The best free and open-source automated time tracker. Cross-platform, extensible, privacy-focused.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
You can use Archivebox. Set it to grab URLs from your browser history database and it will archive them all to disk in whatever formats you want. You can then use whatever tools you want on those local files.
https://archivebox.io/
Not exactly what you're asking for but you can setup SingleFile[1] to automatically save each page you visit. Then there's also ArchiveBox[2] which can convert your browser history into various formats.
[1] https://github.com/gildas-lormeau/SingleFile
In Chrome, you can install it from source:
https://github.com/CennoxX/falcon#transparent-installation
As for Firefox, you can right-click the "Add to Firefox" button and save and inspect the extension.
Hey. My project Diskernet does this: full text search over browser history.
Put it in "save" mode when using Chrome (linux is fine) and it automatically saves every page you browse (so you can read it offline), and also indexes it for full text search. It's a work in progress and there are bugs (so my advice initialize a git repo in your archive directory, and make regular syncs to a remote in case of failure -- that also gives you a nice snapshotted archive).
Anyway, best of luck to you! :)
Diskernet: https://github.com/crisdosyago/Diskernet
If it’s just the metadata that you want, you can use activity watcher [1]. They have a browser plug-in.
[1] https://activitywatch.net/
Nyxt browser is doing this pretty well! https://nyxt.atlas.engineer
Chromium and Firefox have all your history stored in a sqlite database.
I have a script to extract the last visited website from chrome for example: https://github.com/BarbUk/dotfiles/blob/master/bin/chrome_hi...
For firefox, you can use something like:
sqlite3 ~/.mozilla/firefox/.[dD]efault/places.sqlite "SELECT strftime('%d.%m.%Y %H:%M:%S', visit_date/1000000, 'unixepoch', 'localtime'),url FROM moz_places, moz_historyvisits WHERE moz_places.id = moz_historyvisits.place_id ORDER BY visit_date;"
That's the whole purpose behind ZAP and I use it for archiving pages all the time (they use hsqldb as the file format); it works fantastic for that purpose, but does -- as you correctly pointed out -- require MITM-ing the browser to trust their locally generated CA: https://github.com/zaproxy/zaproxy#readme
I've had a lot of success by running HTML pages through mozilla's readability[0] tool (actually the go port of it[1]) before indexing it.
[0]: https://github.com/mozilla/readability
[1]: https://github.com/go-shiori/go-readability
I've had a lot of success by running HTML pages through mozilla's readability[0] tool (actually the go port of it[1]) before indexing it.
[0]: https://github.com/mozilla/readability
[1]: https://github.com/go-shiori/go-readability
You can pipe the URLs through something like monolith[1].
https://github.com/Y2Z/monolith
Related posts
- Ask HN: How do you save web articles for later reading?
- Lost something? Search through 91.7 million files from the 80s, 90s, and 2000s
- Is there a browser addon which locally archives every website I visit?
- Any way to archive the wiki/megathread all at once?
- Is there a way to make my bookmarks available offline to preserve them from future deletion?