-
ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
I love ArchiveTeam warrior, it's such a good idea! We run several instances ourselves, and it's part of our Good Karma Kit for computers with spare capacity: https://github.com/ArchiveBox/good-karma-kit
There are a bunch of other alternatives like ReadDeck listed on our wiki too, we encourage people to check it out!
https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...
-
Stream
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
-
grab-site
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
https://github.com/ArchiveTeam/grab-site might be helpful. I'm a fan of the ability to create WARC archives, put them in object storage (whether that is IA, S3, Backblaze B2, etc), and then keep them in cold storage or serve them up via HTTPS or a torrent (mutable, preferred).
-
grab-site is a cleaned up version of https://github.com/ArchiveTeam/ArchiveBot. I argue ArchiveBot and grab-site are superior, but I am biased as an ArchiveTeam participant.
-
good-karma-kit
😇 A Docker Compose bundle to run on servers with spare CPU, RAM, disk, and bandwidth to help the world. Includes Tor, ArchiveWarrior, BOINC, and more...
I love ArchiveTeam warrior, it's such a good idea! We run several instances ourselves, and it's part of our Good Karma Kit for computers with spare capacity: https://github.com/ArchiveBox/good-karma-kit
There are a bunch of other alternatives like ReadDeck listed on our wiki too, we encourage people to check it out!
https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...
-
You really should add timestamping to ArchiveBox. The easiest way to do that would be via my OpenTimestamps protocol, https://opentimestamps.org It's open source and free to use, and uses Bitcoin for the actual timestamps. Users of it do not need to make Bitcoin transactions themselves as a set of community calendar servers do that for you. You also don't need a Bitcoin node to create an OTS timestamp, and you can validate an OTS timestamp without a Bitcoin node as well by trusting someone else to do that for you.
The big thing that ArchiveBox can't do, and the Internet Archive can, is attest to the accuracy of the archive. Being at least able to prove that the archive was created in the past, prior to there being a reason to tamper it, is the best we can realistically do with current cryptography. So it'd be really good if support for timestamping was added.
IIUC ArchiveBox is written in Python; OTS has a Python library that should work fine for you: https://github.com/opentimestamps/python-opentimestamps
-
hunter-dkim
Discusses how to verify DKIM signatures in old emails, namely one of the Hunter Biden emails in the news
> OpenTimestamps alone can not currently prove anything because TLS session keys are symmetric.
Timestamps can prove that the data existed prior to there being a known reason to modify it. While that's not as good as direct signing, that's often still enough to be very useful. The statement that OTS "can not currently prove anything" is simply wrong.
A really good example of this is the Hunter Biden email verification. I used OpenTimestamps to prove that the DKIM key that signed the email was in fact used by Google at the time, by providing a Google-signed email that had been timestamped years ago: https://github.com/robertdavidgraham/hunter-dkim/tree/main/o...
That's convincing evidence, because it's highly implausible that I would have been working to fake Hunter's emails years before they even came up as an election issue.
-
urlwatch
Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.
I recommend urlwatch, you run it from cron on your local system and get an email.
https://thp.io/2008/urlwatch/
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
vectordb
A minimal Python package for storing and retrieving text using chunking, embeddings, and vector search. (by kagisearch)
How did you build it?
I can imagine an architecture where I throw everything into ArchiveBox, then run VectorDB as a plugin with Gradio or some such as the client.
https://vectordb.com/
-
abx-dl
⬇️ A simple all-in-one CLI tool to download EVERYTHING from a URL (like youtube-dl/yt-dlp, forum-dl, gallery-dl, simpler ArchiveBox). 🎭 Uses headless Chrome to get HTML, JS, CSS, images/video/audio/subtitles, PDFs, screenshots, article text, git repos, and more...
Related posts
-
New York Times shut down Tor Onion service
-
Automattic's "nuclear war" over WordPress access sparks potential class action
-
It's the Most Indispensable Machine in the World–and It Depends on This Woman
-
Ask HN: What's a good alternative to The Verge now it's login-gated?
-
The FBI created a coin to investigate crypto pump-and-dump schemes