wayback-machine

Top 15 wayback-machine Open-Source Projects

  • ArchiveBox

    ๐Ÿ—ƒ Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

  • Project mention: Ask HN: What Underrated Open Source Project Deserves More Recognition? | news.ycombinator.com | 2024-03-07

    Two projects I greatly appreciate, allowing me to easily archive my bandcamp and GOG purchases (after the initial setup anyways):

    https://github.com/easlice/bandcamp-downloader

    https://github.com/Kalanyr/gogrepoc

    And I recently learned about archivebox, which I think is going to be a fast favorite and finally let me clear out my mess of tabs/bookmarks: https://github.com/ArchiveBox/ArchiveBox

  • gau

    Fetch known URLs from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • wayback

    A bot for Telegram, Mastodon, Slack, and other messaging platforms archives webpages.

  • Project mention: If we lose the Internet Archive, weโ€™re screwed | /r/opensource | 2023-05-14

    I wish there was an alternative to the Internet Archive with collaborative curation. You share files and people who tag and sort them into albums can download them. And if it was federated it could be just as extensive as the Internet Archive by searching files on many instances at the same time. Sadly the closest thing are ArchiveBox and wayback which won't replace the Internet Archive.

  • web-archives

    Browser extension for viewing archived and cached versions of web pages, available for Chrome, Edge and Safari

  • Project mention: An Ugly Single-Page Website Makes $5k a Month with Affiliate Marketing | news.ycombinator.com | 2024-01-20
  • replayweb.page

    Serverless replay of web archives directly in the browser

  • Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29

    You can try https://replayweb.page/ as a test for viewing a WARC file. I do think you'll run into problems though with wanting to browse interconnected links in a forum format, but try this as a first step.

    One potential option but definitely a bit more work would be, once you have all the warc files downloaded, you can open them all in python using the warctools module and maybe beautifulsoup and potentially parse/extract all of the data embedded in the WARC archives into your own "fresh" HTML webserver.

    https://github.com/internetarchive/warctools

  • wayback-machine-scraper

    A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.

  • Project mention: wayback-machine-scraper: NEW Data - star count:380.0 | /r/algoprojects | 2023-12-10
  • waybackpy

    Wayback Machine API interface & a command-line tool

  • Project mention: download all captures of a page in archive.org | /r/Archiveteam | 2023-06-05

    I ended up using waybackpy python module to retrieve archived URLs, it worked well. I think the feature you want for this is the "snapshots", but I didn't test this myself

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • cancel-culture

    Tools for fighting abuse on Twitter

  • oldweb-today

    Browse emulated browsers connected to old web sites in your browser!

  • Project mention: Ask HN: What products other than Obsidian share the file over app philosophy? | news.ycombinator.com | 2024-04-03

    There are flat-file CMSes (content management systems) like Grav: https://getgrav.org/

    I guess, in some vague/broad sense, config-as-code systems also implement something similar? Maybe even OpenAPI schemas could count to some degree...?

    In the old days, the "semantic web" movement was an attempt to make more webpages both human- and machine-readable indefinitely by tagging them with proper schema: https://en.wikipedia.org/wiki/Resource_Description_Framework. Even Google was on board for a while, but I guess it never saw much uptake. As far as I can tell it's basically dead now, both because of non-semantic HTML (everything as a React div), general laziness, and LLMs being able to parse things loosely.

    -------------

    Side thoughts...

    Philosophically, I don't know that capturing raw data alone as files is really sufficient to capture the nuances of any particular experience, or the overall zeitgeist of an era. You can archive Geocities pages, but that doesn't really capture the novelty and indie-ness of that era. Similarly, you can save TikTok videos, but absent the cultural environment that created them (and a faithful recreation of the recommendation algorithm), they wouldn't really show future archaeologists how teenagers today lived.

    I worked for a natural history museum for a while, and while we were there, one of the interesting questions (well, to me anyway) was whether our web content was in and of itself worth preserving as a cultural artifact -- both so that future generations can see what exhibits were interesting/apropos for the cultures of our times, but also so they could see how our generation found out about those exhibitions to begin with (who knows what the Web will morph into 50 years later). It wasn't enough to simply save the HTML of our web pages, both because they tie into various other APIs and databases (like zoological collections) and because some were interactive experiences, like games designed to be played with a mouse (before phones were popular), or phone chatbots with some of our specimens. To really capture the experience authentically would've required emulating not just our tech stacks and devices, among other things.

    Like for the earlier Geocities example, sure you could just save the old HTML and render it with a modern browser, but that's not the same as something like https://oldweb.today/?browser=ns3-mac#http://geocities.com/ , which emulates the whole OS and browser too. And that still isn't the same as having to sit in front of a tiny CRT and wait minutes for everything to download over a 14.4k modem, only to be interrupted when mom had to make a call.

    I guess that's a longwinded of critiquing "file over app": It only makes sense for things that are originally files/documents to begin with. Much of our lives now are not flat docs but "experiences" that take much more thought and effort to archive. If the goal is truly to preserve that posterity, it's not enough to just archive their raw data, but to develop ways to record and later emulate entire experiences, both technological and cultural. It ain't easy!

  • vandal

    Navigator for Web Archive

  • gogetcrawl

    Extract web archive data using Wayback Machine and Common Crawl

  • Project mention: A tool/package for Web Archive data extraction | /r/golang | 2023-05-31

    I've developed yet another solution that can help you extract data from web archives :) You can use it as a separate tool, or import it into your Go project. Github: https://github.com/karust/gogetcrawl

  • Discord-Recon

    Discord bot created to automate bug bounty recon, automated scans and information gathering via a discord server

  • wayback_archiver

    Ruby gem to send URLs to Wayback Machine

  • hn_bot

    A Bot that searches for and posts links to archived versions of articles after scanning all of HackerNews' top articles for those that contain a link to a site that requires a subscription.

  • on-heroku

    Deploy and maintaining wayback service as a Heroku app easily and quickly.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

wayback-machine related posts

  • Ask HN: It's 1997, how do you build a website?

    1 project | news.ycombinator.com | 6 Feb 2024
  • Ask HN: How can I back up an old vBulletin forum without admin access?

    9 projects | news.ycombinator.com | 29 Jan 2024
  • wayback-machine-scraper: NEW Data - star count:380.0

    1 project | /r/algoprojects | 10 Dec 2023
  • Best practices for archiving websites

    2 projects | /r/datacurator | 6 Dec 2023
  • Is there a website that shows chronological versions of websites/apps/aplatforms?

    1 project | /r/UI_Design | 14 Jul 2023
  • download all captures of a page in archive.org

    2 projects | /r/Archiveteam | 5 Jun 2023
  • phpBB3 forum owner dead. Webhost purging soon. Need to quickly archive a site

    1 project | /r/DataHoarder | 23 May 2023
  • A note from our sponsor - SaaSHub
    www.saashub.com | 4 May 2024
    SaaSHub helps you find the best software and product alternatives Learn more โ†’

Index

What are some of the best open-source wayback-machine projects? This list will help you:

Project Stars
1 ArchiveBox 19,790
2 gau 3,557
3 wayback 1,643
4 web-archives 1,067
5 replayweb.page 620
6 wayback-machine-scraper 406
7 waybackpy 405
8 cancel-culture 395
9 oldweb-today 234
10 vandal 150
11 gogetcrawl 126
12 Discord-Recon 69
13 wayback_archiver 55
14 hn_bot 21
15 on-heroku 5

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com