Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
I'm not in the pirate archivist space, but sections 3 and 5 are relevant to my interests. I've had great luck with ZAP (https://github.com/zaproxy/zaproxy#readme) glued to a copy of Firefox (because it allows monkeying with the _browser_'s proxy without having to alter the system one as other browsers do) for archiving all content seen while surfing around a site. It even achieves the stated goal of preserving the HTML (etc) in a database since ZAP uses hsqldb
Then, section 5 reads like an advertisement for Scrapy since it is just stellar at following all pagination links and then either emitting the extracted payload as your own data structure and/or by telling Scrapy you want to download some media as-is. It will, by default, put the local content in a directory of your choice and hash the url to make the local filename. A separate json file serves as the "accounting" between the things it downloaded and their hashed on-disk filename
Scrapy is also able to glue 3 and 5 together because it has a pluggable (everything, heh) dupe detection hook and also HTTP cache support that can be backed by anything, including the aforementioned hsqldb operating in network mode. Scrapy is also very test friendly, since each method accepts a well known python object and emits either a follow-on request, zero or more extracted objects, or nothing if pagination has ended
I can appreciate there may be other scraping frameworks, but of the ones I've tried Scrapy makes everything that I've asked it to do simple and transparent