ArchiveBox is evolving: the future of self-hosted internet archives

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video.
Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
getstream.io
featured
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com
featured
  1. ArchiveBox

    🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

    I love ArchiveTeam warrior, it's such a good idea! We run several instances ourselves, and it's part of our Good Karma Kit for computers with spare capacity: https://github.com/ArchiveBox/good-karma-kit

    There are a bunch of other alternatives like ReadDeck listed on our wiki too, we encourage people to check it out!

    https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...

  2. Stream

    Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

    Stream logo
  3. grab-site

    The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

    https://github.com/ArchiveTeam/grab-site might be helpful. I'm a fan of the ability to create WARC archives, put them in object storage (whether that is IA, S3, Backblaze B2, etc), and then keep them in cold storage or serve them up via HTTPS or a torrent (mutable, preferred).

  4. ArchiveBot

    ArchiveBot, an IRC bot for archiving websites

    grab-site is a cleaned up version of https://github.com/ArchiveTeam/ArchiveBot. I argue ArchiveBot and grab-site are superior, but I am biased as an ArchiveTeam participant.

  5. good-karma-kit

    😇 A Docker Compose bundle to run on servers with spare CPU, RAM, disk, and bandwidth to help the world. Includes Tor, ArchiveWarrior, BOINC, and more...

    I love ArchiveTeam warrior, it's such a good idea! We run several instances ourselves, and it's part of our Good Karma Kit for computers with spare capacity: https://github.com/ArchiveBox/good-karma-kit

    There are a bunch of other alternatives like ReadDeck listed on our wiki too, we encourage people to check it out!

    https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...

  6. python-opentimestamps

    You really should add timestamping to ArchiveBox. The easiest way to do that would be via my OpenTimestamps protocol, https://opentimestamps.org It's open source and free to use, and uses Bitcoin for the actual timestamps. Users of it do not need to make Bitcoin transactions themselves as a set of community calendar servers do that for you. You also don't need a Bitcoin node to create an OTS timestamp, and you can validate an OTS timestamp without a Bitcoin node as well by trusting someone else to do that for you.

    The big thing that ArchiveBox can't do, and the Internet Archive can, is attest to the accuracy of the archive. Being at least able to prove that the archive was created in the past, prior to there being a reason to tamper it, is the best we can realistically do with current cryptography. So it'd be really good if support for timestamping was added.

    IIUC ArchiveBox is written in Python; OTS has a Python library that should work fine for you: https://github.com/opentimestamps/python-opentimestamps

  7. hunter-dkim

    Discusses how to verify DKIM signatures in old emails, namely one of the Hunter Biden emails in the news

    > OpenTimestamps alone can not currently prove anything because TLS session keys are symmetric.

    Timestamps can prove that the data existed prior to there being a known reason to modify it. While that's not as good as direct signing, that's often still enough to be very useful. The statement that OTS "can not currently prove anything" is simply wrong.

    A really good example of this is the Hunter Biden email verification. I used OpenTimestamps to prove that the DKIM key that signed the email was in fact used by Google at the time, by providing a Google-signed email that had been timestamped years ago: https://github.com/robertdavidgraham/hunter-dkim/tree/main/o...

    That's convincing evidence, because it's highly implausible that I would have been working to fake Hunter's emails years before they even came up as an election issue.

  8. urlwatch

    Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.

    I recommend urlwatch, you run it from cron on your local system and get an email.

    https://thp.io/2008/urlwatch/

  9. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  10. vectordb

    A minimal Python package for storing and retrieving text using chunking, embeddings, and vector search. (by kagisearch)

    How did you build it?

    I can imagine an architecture where I throw everything into ArchiveBox, then run VectorDB as a plugin with Gradio or some such as the client.

    https://vectordb.com/

  11. abx-dl

    ⬇️ A simple all-in-one CLI tool to download EVERYTHING from a URL (like youtube-dl/yt-dlp, forum-dl, gallery-dl, simpler ArchiveBox). 🎭 Uses headless Chrome to get HTML, JS, CSS, images/video/audio/subtitles, PDFs, screenshots, article text, git repos, and more...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • New York Times shut down Tor Onion service

    3 projects | news.ycombinator.com | 14 Mar 2025
  • Automattic's "nuclear war" over WordPress access sparks potential class action

    2 projects | news.ycombinator.com | 27 Feb 2025
  • It's the Most Indispensable Machine in the World–and It Depends on This Woman

    1 project | news.ycombinator.com | 3 Jan 2025
  • Ask HN: What's a good alternative to The Verge now it's login-gated?

    1 project | news.ycombinator.com | 14 Dec 2024
  • The FBI created a coin to investigate crypto pump-and-dump schemes

    1 project | news.ycombinator.com | 11 Oct 2024

Did you know that Python is
the 2nd most popular programming language
based on number of references?