Python Archiving

Open-source Python projects categorized as Archiving

Top 12 Python Archiving Projects

  1. paperless-ngx

    A community-supported supercharged version of paperless: scan, index and archive all your physical documents

    Project mention: Is there a way to run an LLM as a better local search engine? | news.ycombinator.com | 2025-06-18

    Interesting, I had found Paperless:

    https://github.com/icereed/paperless-gpt

    https://docs.paperless-ngx.com/#features

    These options seem far from... user friendly. Another concern is resource usage, I wonder how low LLMs will go (especially as far RAM and GPU requirements are concerned).

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. wal-e

    Continuous Archiving for Postgres

    Project mention: Ask HN: What do you use to backup your VMs? | news.ycombinator.com | 2024-09-27

    For me, it's case-by-case. I don't back up the VMs directly, just the date of the stateful applications running on the VMs (or bare metal servers, I do identical stuff for them).

    For postgres, I used to just have a systemd timer that would `pg_dumpall` and throw it in s3.

    Now I use https://github.com/wal-e/wal-e to backup my postgresql databases.

    For other local files, I use borg backup for personal files and services I just run for myself, and I use restic to backup server files to s3.

    The operating system's configuration is all stored in git via the magic of NixOS, so I don't have to worry about files in /etc, they all are 100% reproducible from my NixOS configuration.

  4. grab-site

    The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

    Project mention: ArchiveBox is evolving: the future of self-hosted internet archives | news.ycombinator.com | 2024-10-16

    https://github.com/ArchiveTeam/grab-site might be helpful. I'm a fan of the ability to create WARC archives, put them in object storage (whether that is IA, S3, Backblaze B2, etc), and then keep them in cold storage or serve them up via HTTPS or a torrent (mutable, preferred).

  5. URS

    Universal Reddit Scraper - A comprehensive Reddit scraping/archival command-line tool.

  6. ArchiveBot

    ArchiveBot, an IRC bot for archiving websites

    Project mention: ArchiveBox is evolving: the future of self-hosted internet archives | news.ycombinator.com | 2024-10-16

    grab-site is a cleaned up version of https://github.com/ArchiveTeam/ArchiveBot. I argue ArchiveBot and grab-site are superior, but I am biased as an ArchiveTeam participant.

  7. archiveis

    A simple Python wrapper for the archive.is capturing service

    Project mention: New York Times shut down Tor Onion service | news.ycombinator.com | 2025-03-14

    "https://archive.is is still the de-facto way to read NYT articles..."

    It's popular on HN for sure, but archive.is only works where the NYT article is archived

    Lighter weight, open source, local solutions like Bypass Paywalls Clean work on any NYT article regardless of whether archived

    archive.is may be blocked for some internet subscribers in some countries

    archive.is webpages are quite large and contain telemetry

    For example,

    https://[USER-IP-ADDRESS].us.TBT3.379913754.pixel.archive.is... type="text/javascript">

  8. savepagenow

    A simple Python wrapper and command-line interface for archive.org’s "Save Page Now" capturing service

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. reddit_export_userdata

    Export userdata from your reddit accounts. Submissions, comments, saved, upvoted contents are supported.

  11. wattpad-archiver

    Downloads your wattpad library to disk

  12. Archivist-Tools

    These python scripts help you manage a large amount of files.

  13. archive-to-images

    Python CLI to transform archives into images and reverse.

  14. VideoCheck

    Automated tool to check consistency of your video library.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Archiving discussion

Log in or Post with

Python Archiving related posts

  • New York Times shut down Tor Onion service

    3 projects | news.ycombinator.com | 14 Mar 2025
  • Automattic's "nuclear war" over WordPress access sparks potential class action

    2 projects | news.ycombinator.com | 27 Feb 2025
  • It's the Most Indispensable Machine in the World–and It Depends on This Woman

    1 project | news.ycombinator.com | 3 Jan 2025
  • Ask HN: What's a good alternative to The Verge now it's login-gated?

    1 project | news.ycombinator.com | 14 Dec 2024
  • ArchiveBox is evolving: the future of self-hosted internet archives

    14 projects | news.ycombinator.com | 16 Oct 2024
  • The FBI created a coin to investigate crypto pump-and-dump schemes

    1 project | news.ycombinator.com | 11 Oct 2024
  • We're losing our digital history. Can the Internet Archive save it?

    1 project | news.ycombinator.com | 18 Sep 2024
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 18 Jun 2025
    InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →

Index

What are some of the best open-source Archiving projects in Python? This list will help you:

# Project Stars
1 paperless-ngx 28,086
2 wal-e 3,469
3 grab-site 1,489
4 URS 875
5 ArchiveBot 386
6 archiveis 203
7 savepagenow 180
8 reddit_export_userdata 21
9 wattpad-archiver 7
10 Archivist-Tools 6
11 archive-to-images 6
12 VideoCheck 2

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?