Our great sponsors
-
ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
archivy
Archivy is a self-hostable knowledge repository that allows you to learn and retain information in your own personal and extensible wiki.
-
savepagenow
A simple Python wrapper and command-line interface for archive.org’s "Save Page Now" capturing service
Run a small server with some storage at his house (or yours with it public facing) that runs ArchiveBox. It's basically a locally hosted Archive.org clone. https://github.com/ArchiveBox/ArchiveBox
github.com/Archiveteam/grab-site is quite simple, and you could probably easily whip up a script. It does use WARC, but there's a very good site called https://replayweb.page that renders most pages well ... catch is, grab-site doesn't run JavaScript, so sites that require JS to load the images will probably not get the images.
Archivy might be to your liking!
You can use Firefox to download Web pages in HTML easily, just press "F10" and a menu should appear on the top, then click on "File" and then "save as" to save it where you want. Although this doesn't do crawling, because it's very quick, you could save each link manualy. An other option which does crawling would be to save the pages in the wayback machine, although it doesn't save the pages in your computer, it makes them available for everyone to see.
Related posts
- Ask HN: How can I back up an old vBulletin forum without admin access?
- Best practices for archiving websites
- An Introduction to the WARC File
- Can anyone familiar with databases of Youtube archives help me? I don't know how to find what I'm looking for. Details in post.
- A distributed and persistent archive replay system using IPFS