SaaSHub helps you find the best software and product alternatives Learn more โ
Top 15 wayback-machine Open-Source Projects
-
ArchiveBox
๐ Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
web-archives
Browser extension for viewing archived and cached versions of web pages, available for Chrome, Edge and Safari
-
wayback-machine-scraper
A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
Discord-Recon
Discord bot created to automate bug bounty recon, automated scans and information gathering via a discord server
-
hn_bot
A Bot that searches for and posts links to archived versions of articles after scanning all of HackerNews' top articles for those that contain a link to a site that requires a subscription.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Ask HN: What Underrated Open Source Project Deserves More Recognition? | news.ycombinator.com | 2024-03-07Two projects I greatly appreciate, allowing me to easily archive my bandcamp and GOG purchases (after the initial setup anyways):
https://github.com/easlice/bandcamp-downloader
https://github.com/Kalanyr/gogrepoc
And I recently learned about archivebox, which I think is going to be a fast favorite and finally let me clear out my mess of tabs/bookmarks: https://github.com/ArchiveBox/ArchiveBox
I wish there was an alternative to the Internet Archive with collaborative curation. You share files and people who tag and sort them into albums can download them. And if it was federated it could be just as extensive as the Internet Archive by searching files on many instances at the same time. Sadly the closest thing are ArchiveBox and wayback which won't replace the Internet Archive.
Project mention: An Ugly Single-Page Website Makes $5k a Month with Affiliate Marketing | news.ycombinator.com | 2024-01-20
Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29You can try https://replayweb.page/ as a test for viewing a WARC file. I do think you'll run into problems though with wanting to browse interconnected links in a forum format, but try this as a first step.
One potential option but definitely a bit more work would be, once you have all the warc files downloaded, you can open them all in python using the warctools module and maybe beautifulsoup and potentially parse/extract all of the data embedded in the WARC archives into your own "fresh" HTML webserver.
https://github.com/internetarchive/warctools
Project mention: wayback-machine-scraper: NEW Data - star count:380.0 | /r/algoprojects | 2023-12-10
I ended up using waybackpy python module to retrieve archived URLs, it worked well. I think the feature you want for this is the "snapshots", but I didn't test this myself
Project mention: Ask HN: What products other than Obsidian share the file over app philosophy? | news.ycombinator.com | 2024-04-03There are flat-file CMSes (content management systems) like Grav: https://getgrav.org/
I guess, in some vague/broad sense, config-as-code systems also implement something similar? Maybe even OpenAPI schemas could count to some degree...?
In the old days, the "semantic web" movement was an attempt to make more webpages both human- and machine-readable indefinitely by tagging them with proper schema: https://en.wikipedia.org/wiki/Resource_Description_Framework. Even Google was on board for a while, but I guess it never saw much uptake. As far as I can tell it's basically dead now, both because of non-semantic HTML (everything as a React div), general laziness, and LLMs being able to parse things loosely.
-------------
Side thoughts...
Philosophically, I don't know that capturing raw data alone as files is really sufficient to capture the nuances of any particular experience, or the overall zeitgeist of an era. You can archive Geocities pages, but that doesn't really capture the novelty and indie-ness of that era. Similarly, you can save TikTok videos, but absent the cultural environment that created them (and a faithful recreation of the recommendation algorithm), they wouldn't really show future archaeologists how teenagers today lived.
I worked for a natural history museum for a while, and while we were there, one of the interesting questions (well, to me anyway) was whether our web content was in and of itself worth preserving as a cultural artifact -- both so that future generations can see what exhibits were interesting/apropos for the cultures of our times, but also so they could see how our generation found out about those exhibitions to begin with (who knows what the Web will morph into 50 years later). It wasn't enough to simply save the HTML of our web pages, both because they tie into various other APIs and databases (like zoological collections) and because some were interactive experiences, like games designed to be played with a mouse (before phones were popular), or phone chatbots with some of our specimens. To really capture the experience authentically would've required emulating not just our tech stacks and devices, among other things.
Like for the earlier Geocities example, sure you could just save the old HTML and render it with a modern browser, but that's not the same as something like https://oldweb.today/?browser=ns3-mac#http://geocities.com/ , which emulates the whole OS and browser too. And that still isn't the same as having to sit in front of a tiny CRT and wait minutes for everything to download over a 14.4k modem, only to be interrupted when mom had to make a call.
I guess that's a longwinded of critiquing "file over app": It only makes sense for things that are originally files/documents to begin with. Much of our lives now are not flat docs but "experiences" that take much more thought and effort to archive. If the goal is truly to preserve that posterity, it's not enough to just archive their raw data, but to develop ways to record and later emulate entire experiences, both technological and cultural. It ain't easy!
I've developed yet another solution that can help you extract data from web archives :) You can use it as a separate tool, or import it into your Go project. Github: https://github.com/karust/gogetcrawl
wayback-machine related posts
-
Ask HN: It's 1997, how do you build a website?
-
Ask HN: How can I back up an old vBulletin forum without admin access?
-
wayback-machine-scraper: NEW Data - star count:380.0
-
Best practices for archiving websites
-
Is there a website that shows chronological versions of websites/apps/aplatforms?
-
download all captures of a page in archive.org
-
phpBB3 forum owner dead. Webhost purging soon. Need to quickly archive a site
-
A note from our sponsor - SaaSHub
www.saashub.com | 4 May 2024
Index
What are some of the best open-source wayback-machine projects? This list will help you:
Project | Stars | |
---|---|---|
1 | ArchiveBox | 19,790 |
2 | gau | 3,557 |
3 | wayback | 1,643 |
4 | web-archives | 1,067 |
5 | replayweb.page | 620 |
6 | wayback-machine-scraper | 406 |
7 | waybackpy | 405 |
8 | cancel-culture | 395 |
9 | oldweb-today | 234 |
10 | vandal | 150 |
11 | gogetcrawl | 126 |
12 | Discord-Recon | 69 |
13 | wayback_archiver | 55 |
14 | hn_bot | 21 |
15 | on-heroku | 5 |
Sponsored