Top 15 wayback-machine Open-Source Projects

ArchiveBox

248 19,790 9.8 Python

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Project mention: Ask HN: What Underrated Open Source Project Deserves More Recognition? | news.ycombinator.com | 2024-03-07

Two projects I greatly appreciate, allowing me to easily archive my bandcamp and GOG purchases (after the initial setup anyways):
https://github.com/easlice/bandcamp-downloader
https://github.com/Kalanyr/gogrepoc
And I recently learned about archivebox, which I think is going to be a fast favorite and finally let me clear out my mess of tabs/bookmarks: https://github.com/ArchiveBox/ArchiveBox

gau

6 3,557 6.0 Go

Fetch known URLs from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl.
InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
wayback

22 1,643 6.4 Go

A bot for Telegram, Mastodon, Slack, and other messaging platforms archives webpages.

Project mention: If we lose the Internet Archive, we’re screwed | /r/opensource | 2023-05-14

I wish there was an alternative to the Internet Archive with collaborative curation. You share files and people who tag and sort them into albums can download them. And if it was federated it could be just as extensive as the Internet Archive by searching files on many instances at the same time. Sadly the closest thing are ArchiveBox and wayback which won't replace the Internet Archive.

web-archives

35 1,067 6.7 JavaScript

Browser extension for viewing archived and cached versions of web pages, available for Chrome, Edge and Safari

Project mention: An Ugly Single-Page Website Makes $5k a Month with Affiliate Marketing | news.ycombinator.com | 2024-01-20

replayweb.page

24 620 8.7 TypeScript

Serverless replay of web archives directly in the browser

Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29

You can try https://replayweb.page/ as a test for viewing a WARC file. I do think you'll run into problems though with wanting to browse interconnected links in a forum format, but try this as a first step.
One potential option but definitely a bit more work would be, once you have all the warc files downloaded, you can open them all in python using the warctools module and maybe beautifulsoup and potentially parse/extract all of the data embedded in the WARC archives into your own "fresh" HTML webserver.
https://github.com/internetarchive/warctools

wayback-machine-scraper

6 406 0.0 Python

A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.

Project mention: wayback-machine-scraper: NEW Data - star count:380.0 | /r/algoprojects | 2023-12-10

waybackpy

6 405 0.0 Python

Wayback Machine API interface & a command-line tool

Project mention: download all captures of a page in archive.org | /r/Archiveteam | 2023-06-05

I ended up using waybackpy python module to retrieve archived URLs, it worked well. I think the feature you want for this is the "snapshots", but I didn't test this myself

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
cancel-culture

7 395 0.0 Rust

Tools for fighting abuse on Twitter
oldweb-today

22 234 4.2 JavaScript

Browse emulated browsers connected to old web sites in your browser!

Project mention: Ask HN: What products other than Obsidian share the file over app philosophy? | news.ycombinator.com | 2024-04-03

There are flat-file CMSes (content management systems) like Grav: https://getgrav.org/
I guess, in some vague/broad sense, config-as-code systems also implement something similar? Maybe even OpenAPI schemas could count to some degree...?
In the old days, the "semantic web" movement was an attempt to make more webpages both human- and machine-readable indefinitely by tagging them with proper schema: https://en.wikipedia.org/wiki/Resource_Description_Framework. Even Google was on board for a while, but I guess it never saw much uptake. As far as I can tell it's basically dead now, both because of non-semantic HTML (everything as a React div), general laziness, and LLMs being able to parse things loosely.
-------------
Side thoughts...
Philosophically, I don't know that capturing raw data alone as files is really sufficient to capture the nuances of any particular experience, or the overall zeitgeist of an era. You can archive Geocities pages, but that doesn't really capture the novelty and indie-ness of that era. Similarly, you can save TikTok videos, but absent the cultural environment that created them (and a faithful recreation of the recommendation algorithm), they wouldn't really show future archaeologists how teenagers today lived.
I worked for a natural history museum for a while, and while we were there, one of the interesting questions (well, to me anyway) was whether our web content was in and of itself worth preserving as a cultural artifact -- both so that future generations can see what exhibits were interesting/apropos for the cultures of our times, but also so they could see how our generation found out about those exhibitions to begin with (who knows what the Web will morph into 50 years later). It wasn't enough to simply save the HTML of our web pages, both because they tie into various other APIs and databases (like zoological collections) and because some were interactive experiences, like games designed to be played with a mouse (before phones were popular), or phone chatbots with some of our specimens. To really capture the experience authentically would've required emulating not just our tech stacks and devices, among other things.
Like for the earlier Geocities example, sure you could just save the old HTML and render it with a modern browser, but that's not the same as something like https://oldweb.today/?browser=ns3-mac#http://geocities.com/ , which emulates the whole OS and browser too. And that still isn't the same as having to sit in front of a tiny CRT and wait minutes for everything to download over a 14.4k modem, only to be interrupted when mom had to make a call.
I guess that's a longwinded of critiquing "file over app": It only makes sense for things that are originally files/documents to begin with. Much of our lives now are not flat docs but "experiences" that take much more thought and effort to archive. If the goal is truly to preserve that posterity, it's not enough to just archive their raw data, but to develop ways to record and later emulate entire experiences, both technological and cultural. It ain't easy!

vandal

4 150 0.0 JavaScript

Navigator for Web Archive
gogetcrawl

1 126 5.2 Go

Extract web archive data using Wayback Machine and Common Crawl

Project mention: A tool/package for Web Archive data extraction | /r/golang | 2023-05-31

I've developed yet another solution that can help you extract data from web archives :) You can use it as a separate tool, or import it into your Go project. Github: https://github.com/karust/gogetcrawl

Discord-Recon

1 69 4.1 Python

Discord bot created to automate bug bounty recon, automated scans and information gathering via a discord server
wayback_archiver

2 55 0.0 Ruby

Ruby gem to send URLs to Wayback Machine
hn_bot

3 21 0.0 Python

A Bot that searches for and posts links to archived versions of articles after scanning all of HackerNews' top articles for those that contain a link to a site that requires a subscription.
on-heroku

1 5 0.0 Shell

Deploy and maintaining wayback service as a Heroku app easily and quickly.
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

wayback-machine related posts

Ask HN: It's 1997, how do you build a website?

1 project | news.ycombinator.com | 6 Feb 2024
Ask HN: How can I back up an old vBulletin forum without admin access?

9 projects | news.ycombinator.com | 29 Jan 2024
wayback-machine-scraper: NEW Data - star count:380.0

1 project | /r/algoprojects | 10 Dec 2023
Best practices for archiving websites

2 projects | /r/datacurator | 6 Dec 2023
Is there a website that shows chronological versions of websites/apps/aplatforms?

1 project | /r/UI_Design | 14 Jul 2023
download all captures of a page in archive.org

2 projects | /r/Archiveteam | 5 Jun 2023
phpBB3 forum owner dead. Webhost purging soon. Need to quickly archive a site

1 project | /r/DataHoarder | 23 May 2023
A note from our sponsor - SaaSHub
www.saashub.com | 4 May 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source wayback-machine projects? This list will help you:

	Project	Stars
1	ArchiveBox	19,790
2	gau	3,557
3	wayback	1,643
4	web-archives	1,067
5	replayweb.page	620
6	wayback-machine-scraper	406
7	waybackpy	405
8	cancel-culture	395
9	oldweb-today	234
10	vandal	150
11	gogetcrawl	126
12	Discord-Recon	69
13	wayback_archiver	55
14	hn_bot	21
15	on-heroku	5

wayback-machine

Top 15 wayback-machine Open-Source Projects

wayback-machine related posts

Ask HN: It's 1997, how do you build a website?

Ask HN: How can I back up an old vBulletin forum without admin access?

wayback-machine-scraper: NEW Data - star count:380.0

Best practices for archiving websites

Is there a website that shows chronological versions of websites/apps/aplatforms?

download all captures of a page in archive.org

phpBB3 forum owner dead. Webhost purging soon. Need to quickly archive a site

Index