Our great sponsors
-
ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
-
fetchurls
A bash script to spider a site, follow links, and fetch urls (with built-in filtering) into a generated text file.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
You could check out some of the options Archivebox has to offer if you like running servers.
Welcome. Since Archivebox doesn't crawl pages, you might be interested in something like fetchurls as well.
YaCy is also a decent webcrawler. Probably overkill in most cases since it also wants to create a search index, but it can also export a txt list of URLs that could be fed into Archivebox. (need to opt out of settings to not store random site data for others)