Is there a good list of up-to-date data archiving tools for different websites?

This page summarizes the projects mentioned and recommended in the original post on /r/datacurator

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • reddit-save

    A Python tool for backing up your saved and upvoted posts on reddit to your computer.

  • I'm mostly on reddit and I use reddit-save. It works well! Biggest issue is I'd like to be able to archive a thread to an arbitrary length.

  • TumblThree

    A Tumblr and Twitter Blog Backup Application

  • For Tumblr, I've found, but not tried TumblThree. It looks like it's built for Windows.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • monolith

    ⬛️ CLI tool for saving complete web pages as a single HTML file

  • besides wget, for single pages I use monolith https://github.com/Y2Z/monolith

  • orgdown

  • Information management based on text files is quite common. There are cloud-based solutions which I loathe for reasons. Desktop wikis are another potential candidate and I was using some for myself a couple of years until I found the solution that has the most features, the greatest flexibility and a very large community: GNU Emacs with its Org-mode. The chosen file format will be then Orgdown.

  • Memacs

    What did I do on February 14th 2007? Visualize your (digital) life in Org-mode

  • Back to the original question. In order to get as much content as possible into a common format to be displayed in a common temporal view, I've created a framework that consists of some general functionality and a set of modules that deal with different input sources and formats. This project is called Memacs. You can also read a whitepaper about it.

  • SingleFileZ

    Web Extension to save a faithful copy of an entire web page in a self-extracting ZIP file

  • If you do have files whose names begin with an ISO 8601 compliant date- or timestamp, the filenametimestamps module with do the trick. This way, I index all photographs, all web downloads, emails, usenet postings, ... just by choosing a specific file name prefix format. Same holds true for web pages which are automatically saved using SingleFileZ to files matching that filename prefix format. There you go, this is how I solve your original question.

  • I already use orgmode a bit - mostly through org-roam! I have a pretty inefficient set-up where I save a webpages via SingleFileZ, then use readability-cli on it, then convert the readable output to an orgmode file. Definitely not efficient because I need to manually complete each step, but haven't bothered to try to automate it yet.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts