Our great sponsors
-
You might also be interested in this list, those alternatives listed are really great and better, some support the WARC format (that my program doesn't).
-
I created Collect a few years ago and still use it today.
-
InfluxDB
Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Platform where developers build real-time applications for analytics, IoT and cloud-native services. Easy to start, it is available in the cloud or on-premises.
-
grab-site
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
I use grab-site to crawl website and pack it into warc archive and then feed this archive into pywb
-
ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
I landed on an opensource project called Archivebox. Its pretty amazing (basically like a locally hosted wayback machine and crawler. https://github.com/ArchiveBox/ArchiveBox It also captures different ways to ensure data integrity and can schedule! Thanks everyone for your input and apps for me to research!