wayback-machine-scraper VS waybackpy

Compare wayback-machine-scraper vs waybackpy and see what are their differences.

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
wayback-machine-scraper waybackpy
6 6
405 405
- -
0.0 0.0
2 months ago over 1 year ago
Python Python
ISC License MIT License
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

wayback-machine-scraper

Posts with mentions or reviews of wayback-machine-scraper. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2021-05-20.

waybackpy

Posts with mentions or reviews of waybackpy. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-06-05.
  • download all captures of a page in archive.org
    2 projects | /r/Archiveteam | 5 Jun 2023
    I ended up using waybackpy python module to retrieve archived URLs, it worked well. I think the feature you want for this is the "snapshots", but I didn't test this myself
  • Well worth the price
    1 project | /r/OSINT | 9 Apr 2023
    I have read it as someone told me that it has a section for using waybackpy, a tool/library that I wrote and maintain.
  • any way to archive all my bookmarks on archive.org?
    1 project | /r/DataHoarder | 6 Oct 2022
  • Comex update 9/27/2022
    1 project | /r/wallstreetplatinum | 27 Sep 2022
    import requests from datetime import datetime from pathlib import Path from waybackpy import WaybackMachineSaveAPI #https://github.com/akamhy/waybackpy ARCHIVE = False LOCAL_SAVE = True #https://www.cmegroup.com/clearing/operations-and-deliveries/nymex-delivery-notices.html urls = [ #COMEX & NYMEX Metal Delivery Notices "https://www.cmegroup.com/delivery_reports/MetalsIssuesAndStopsReport.pdf", "https://www.cmegroup.com/delivery_reports/MetalsIssuesAndStopsMTDReport.pdf", "https://www.cmegroup.com/delivery_reports/MetalsIssuesAndStopsYTDReport.pdf", #NYMEX Energy Delivery Notice "https://www.cmegroup.com/delivery_reports/EnergiesIssuesAndStopsReport.pdf", "https://www.cmegroup.com/delivery_reports/EnergiesIssuesAndStopsYTDReport.pdf", #Warehouse & Depository Stocks "https://www.cmegroup.com/delivery_reports/Gold_Stocks.xls", "https://www.cmegroup.com/delivery_reports/Gold_Kilo_Stocks.xls", "https://www.cmegroup.com/delivery_reports/Silver_stocks.xls", "https://www.cmegroup.com/delivery_reports/Copper_Stocks.xls", "https://www.cmegroup.com/delivery_reports/PA-PL_Stck_Rprt.xls", "https://www.cmegroup.com/delivery_reports/Aluminum_Stocks.xls", "https://www.cmegroup.com/delivery_reports/Zinc_Stocks.xls", "https://www.cmegroup.com/delivery_reports/Lead_Stocks.xls" ] user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36' #required for both wayback and cmegroup.com headers = {'User-Agent': user_agent} #present yourself as an updated Chrome browser if ARCHIVE: for url in urls: filename = url.split("/")[-1] print(f"Archiving {filename} on Wayback Machine...") save_api = WaybackMachineSaveAPI(url, user_agent) #limited to 15 requests / minute / IP. My VPN IP was already throttled :( Couldn't even get this to work with normal IP. Returned 429 error.... res = save_api.save() print(f"Res: {res}") if LOCAL_SAVE: datestr = datetime.now().strftime('%m-%d-%Y') datedir = Path(datestr) datedir.mkdir(exist_ok=True) for url in urls: filename = url.split("/")[-1] print(f"Fetching {filename}...") try: resp = requests.get(url, timeout=3, allow_redirects=True, headers=headers) if resp.ok: filepath = datedir / filename if not filepath.exists(): with open(filepath, mode="wb") as f: f.write(resp.content) else: print(f"ERROR: Filepath already exists: {filepath}") else: print(f"ERROR: response for {filename}: {resp}") except requests.ReadTimeout: print("timeout")
  • Is there a way to download all the files Internet Archive has captured for a domain? I am trying to recover tweets from a suspended twitter account, but the account as a whole was never captured in the Wayback Machine, just some individual tweets and json files.
    1 project | /r/DataHoarder | 27 Jun 2021
  • 简单run个脚本使用 wayback machine 接口批量备份知乎问题冲塔回答
    1 project | /r/CLTV | 29 Apr 2021
    一个封装 wayback machine 接口的 package, github地址:https://github.com/akamhy/waybackpy

What are some alternatives?

When comparing wayback-machine-scraper and waybackpy you can also consider the following projects:

cancel-culture - Tools for fighting abuse on Twitter

TikUp - An auto downloader and uploader for TikTok videos.

autoscraper - A Smart, Automatic, Fast and Lightweight Web Scraper for Python

wayback - A bot for Telegram, Mastodon, Slack, and other messaging platforms archives webpages.

ArchiveBox - 🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

WordPress - WordPress, Git-ified. This repository is just a mirror of the WordPress subversion repository. Please do not send pull requests. Submit pull requests to https://github.com/WordPress/wordpress-develop and patches to https://core.trac.wordpress.org/ instead.

wayback_archiver - Ruby gem to send URLs to Wayback Machine

ArchiveBox - 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more... [Moved to: https://github.com/ArchiveBox/ArchiveBox]

pywb - Core Python Web Archiving Toolkit for replay and recording of web archives