How to Become a Pirate Archivist

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • ZAP

    The ZAP core project

  • I'm not in the pirate archivist space, but sections 3 and 5 are relevant to my interests. I've had great luck with ZAP (https://github.com/zaproxy/zaproxy#readme) glued to a copy of Firefox (because it allows monkeying with the _browser_'s proxy without having to alter the system one as other browsers do) for archiving all content seen while surfing around a site. It even achieves the stated goal of preserving the HTML (etc) in a database since ZAP uses hsqldb

    Then, section 5 reads like an advertisement for Scrapy since it is just stellar at following all pagination links and then either emitting the extracted payload as your own data structure and/or by telling Scrapy you want to download some media as-is. It will, by default, put the local content in a directory of your choice and hash the url to make the local filename. A separate json file serves as the "accounting" between the things it downloaded and their hashed on-disk filename

    Scrapy is also able to glue 3 and 5 together because it has a pluggable (everything, heh) dupe detection hook and also HTTP cache support that can be backed by anything, including the aforementioned hsqldb operating in network mode. Scrapy is also very test friendly, since each method accepts a well known python object and emits either a follow-on request, zero or more extracted objects, or nothing if pagination has ended

    I can appreciate there may be other scraping frameworks, but of the ones I've tried Scrapy makes everything that I've asked it to do simple and transparent

  • content-seeder

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts