wayback-machine-scraper
waybackpy
Our great sponsors
wayback-machine-scraper | waybackpy | |
---|---|---|
6 | 6 | |
405 | 405 | |
- | - | |
0.0 | 0.0 | |
2 months ago | over 1 year ago | |
Python | Python | |
ISC License | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
wayback-machine-scraper
- wayback-machine-scraper: NEW Data - star count:380.0
- Anyone have a simple useful guide so I can get this scraper working .
- wayback-machine-scraper: NEW Data - star count:295.0
- Anyone have a simple useful guide so I can get this scraper working?
-
Retrieving images from archived pages?
You can very easily scrape pages from the web archive with this small package: wayback-machine-scraper. Getting historic snapshots of a webpage becomes a matter of a one-liner like:
-
How can I get my old blog back (Wordpress)?
There are tools that allow you to scrape websites archived by the Wayback Machine. Like this one for example: https://github.com/sangaline/wayback-machine-scraper
waybackpy
-
download all captures of a page in archive.org
I ended up using waybackpy python module to retrieve archived URLs, it worked well. I think the feature you want for this is the "snapshots", but I didn't test this myself
-
Well worth the price
I have read it as someone told me that it has a section for using waybackpy, a tool/library that I wrote and maintain.
- any way to archive all my bookmarks on archive.org?
-
Comex update 9/27/2022
import requests from datetime import datetime from pathlib import Path from waybackpy import WaybackMachineSaveAPI #https://github.com/akamhy/waybackpy ARCHIVE = False LOCAL_SAVE = True #https://www.cmegroup.com/clearing/operations-and-deliveries/nymex-delivery-notices.html urls = [ #COMEX & NYMEX Metal Delivery Notices "https://www.cmegroup.com/delivery_reports/MetalsIssuesAndStopsReport.pdf", "https://www.cmegroup.com/delivery_reports/MetalsIssuesAndStopsMTDReport.pdf", "https://www.cmegroup.com/delivery_reports/MetalsIssuesAndStopsYTDReport.pdf", #NYMEX Energy Delivery Notice "https://www.cmegroup.com/delivery_reports/EnergiesIssuesAndStopsReport.pdf", "https://www.cmegroup.com/delivery_reports/EnergiesIssuesAndStopsYTDReport.pdf", #Warehouse & Depository Stocks "https://www.cmegroup.com/delivery_reports/Gold_Stocks.xls", "https://www.cmegroup.com/delivery_reports/Gold_Kilo_Stocks.xls", "https://www.cmegroup.com/delivery_reports/Silver_stocks.xls", "https://www.cmegroup.com/delivery_reports/Copper_Stocks.xls", "https://www.cmegroup.com/delivery_reports/PA-PL_Stck_Rprt.xls", "https://www.cmegroup.com/delivery_reports/Aluminum_Stocks.xls", "https://www.cmegroup.com/delivery_reports/Zinc_Stocks.xls", "https://www.cmegroup.com/delivery_reports/Lead_Stocks.xls" ] user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36' #required for both wayback and cmegroup.com headers = {'User-Agent': user_agent} #present yourself as an updated Chrome browser if ARCHIVE: for url in urls: filename = url.split("/")[-1] print(f"Archiving {filename} on Wayback Machine...") save_api = WaybackMachineSaveAPI(url, user_agent) #limited to 15 requests / minute / IP. My VPN IP was already throttled :( Couldn't even get this to work with normal IP. Returned 429 error.... res = save_api.save() print(f"Res: {res}") if LOCAL_SAVE: datestr = datetime.now().strftime('%m-%d-%Y') datedir = Path(datestr) datedir.mkdir(exist_ok=True) for url in urls: filename = url.split("/")[-1] print(f"Fetching {filename}...") try: resp = requests.get(url, timeout=3, allow_redirects=True, headers=headers) if resp.ok: filepath = datedir / filename if not filepath.exists(): with open(filepath, mode="wb") as f: f.write(resp.content) else: print(f"ERROR: Filepath already exists: {filepath}") else: print(f"ERROR: response for {filename}: {resp}") except requests.ReadTimeout: print("timeout")
- Is there a way to download all the files Internet Archive has captured for a domain? I am trying to recover tweets from a suspended twitter account, but the account as a whole was never captured in the Wayback Machine, just some individual tweets and json files.
-
简单run个脚本使用 wayback machine 接口批量备份知乎问题冲塔回答
一个封装 wayback machine 接口的 package, github地址:https://github.com/akamhy/waybackpy
What are some alternatives?
cancel-culture - Tools for fighting abuse on Twitter
TikUp - An auto downloader and uploader for TikTok videos.
autoscraper - A Smart, Automatic, Fast and Lightweight Web Scraper for Python
wayback - A bot for Telegram, Mastodon, Slack, and other messaging platforms archives webpages.
ArchiveBox - 🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
WordPress - WordPress, Git-ified. This repository is just a mirror of the WordPress subversion repository. Please do not send pull requests. Submit pull requests to https://github.com/WordPress/wordpress-develop and patches to https://core.trac.wordpress.org/ instead.
wayback_archiver - Ruby gem to send URLs to Wayback Machine
ArchiveBox - 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more... [Moved to: https://github.com/ArchiveBox/ArchiveBox]
pywb - Core Python Web Archiving Toolkit for replay and recording of web archives