-
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
ArchiveBox
π Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
You're using it wrong, rtfm, wget is still the standard. It's also extensible beyond the base feature set, take for example wget-lua ArchiveTeams well maintained go to for near all scraping projects by the group.
I wonder if that's a job for rancherOS since everything in rancherOS is a docker container, https://rancher.com/docs/os/v1.x/en/ . Or is there some better compact OS?
Archivebox is a no-go for my needs because I often want to crawl entire domains, and as far as I can tell, they donβt support that: https://github.com/ArchiveBox/ArchiveBox/issues/191
I have started using the tools from https://webrecorder.net like Browsertrix Crawler and they have been working great. The web archive format is open source and very portable. The crawler even crawls and saves YouTube videos embedded on pages which is awesome.