Starting my own Data Hoarding

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

reddit-save

6 125 4.4 Python

A Python tool for backing up your saved and upvoted posts on reddit to your computer.

Reddit: yes, we are on reddit and there is considerable amount of science and art. Already archiving all saved posts using reddit-save and some low-traffic subreddits in their entirety using unknown tool.

youtube-dl

1,082 128,460 8.5 Python

Command-line program to download videos from YouTube.com and other video sites

Youtube: bunch of science there (Cody et.al). The idea is to first download all my personal playlists using youtube-dl archive mode and then perhaps start downloading select entire channels in low-res mode. Many videos were already deleted from my playlist. Youtube comments should also be saved, but not sure how yet.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
floccus

98 5,018 9.4 JavaScript

:cloud: Sync your bookmarks privately across browsers and devices

Bookmarks: these are at less risk, because archive.org has been doing a good job of keeping sites alive, but still. Bookmarks will be synchronized to selfhosted webdav using floccus addon and then archived using Archive Box. There are few notes to it:

ArchiveBox

248 19,737 9.7 Python

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Bookmarks: these are at less risk, because archive.org has been doing a good job of keeping sites alive, but still. Bookmarks will be synchronized to selfhosted webdav using floccus addon and then archived using Archive Box. There are few notes to it:

awesome-web-archiving

13 1,803 6.1

An Awesome List for getting started with web archiving

Web site crawls: some personal websites are so packed with info that it is worth saving them in their entirety. But on a more complex sites, a naive crawl would produce ungodly amount of duplicate data due to cgi parameters like pagination and sorting. I have only like 3 websites crawled with wget. This will require more though and reading.

wikiteam

23 687 3.8 Python

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2023, WikiTeam has preserved more than 350,000 wikis.

Wikis - looks easy with Wiki Team.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

What sites do you guys use for archiving?
4 projects | /r/OSINT | 22 Sep 2022
Destiny should back up all the manifesto videos and images and not rely on Streamable, YouTube, or Imgur
5 projects | /r/Destiny | 9 Jul 2022
Reflections as the Internet Archive turns 25
6 projects | news.ycombinator.com | 21 Jul 2021
Need help downloading a Gofundme page before it gets deleted!
4 projects | /r/DataHoarder | 12 May 2021
Do you download single webpages? If so, how? And how do you organize them?
4 projects | /r/datacurator | 21 Feb 2021

Starting my own Data Hoarding

This page summarizes the projects mentioned and recommended in the original post on /r/DataHoarder
Firefox browser-bookmarks Python Downloader Archiving and Digital Preservation (DP)
Post date: 16 Apr 2021

reddit-save

youtube-dl

WorkOS

floccus

ArchiveBox

awesome-web-archiving

wikiteam

Related posts

Starting my own Data Hoarding

This page summarizes the projects mentioned and recommended in the original post on /r/DataHoarder Firefox browser-bookmarks Python Downloader Archiving and Digital Preservation (DP) Post date: 16 Apr 2021

reddit-save

youtube-dl

WorkOS

floccus

ArchiveBox

awesome-web-archiving

wikiteam

Related posts

This page summarizes the projects mentioned and recommended in the original post on /r/DataHoarder
Firefox browser-bookmarks Python Downloader Archiving and Digital Preservation (DP)
Post date: 16 Apr 2021