wayback-machine-scraper vs ArchiveBox

wayback-machine-scraper

A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine. (by sangaline)

Source Code

sangaline.com

Suggest alternative

Edit details

🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more... [Moved to: https://github.com/ArchiveBox/ArchiveBox] (by pirate)

DISCONTINUED

Suggest alternative

Edit details

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

wayback-machine-scraper		ArchiveBox
	Project
6	Mentions	2
405	Stars	8,085
-	Growth	-
0.0	Activity	9.7
2 months ago	Latest Commit	over 3 years ago
Python	Language	Python
ISC License	License	MIT License

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

wayback-machine-scraper

Posts with mentions or reviews of wayback-machine-scraper. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2021-05-20.

wayback-machine-scraper: NEW Data - star count:380.0
1 project | /r/algoprojects | 10 Dec 2023
Anyone have a simple useful guide so I can get this scraper working .
1 project | /r/github | 6 Nov 2022
wayback-machine-scraper: NEW Data - star count:295.0
1 project | /r/algoprojects | 26 Jun 2022
Anyone have a simple useful guide so I can get this scraper working?
1 project | /r/github | 6 Nov 2021
Retrieving images from archived pages?
1 project | /r/webscraping | 26 Jun 2021

You can very easily scrape pages from the web archive with this small package: wayback-machine-scraper. Getting historic snapshots of a webpage becomes a matter of a one-liner like:
How can I get my old blog back (Wordpress)?
2 projects | /r/Wordpress | 20 May 2021

There are tools that allow you to scrape websites archived by the Wayback Machine. Like this one for example: https://github.com/sangaline/wayback-machine-scraper

ArchiveBox

Posts with mentions or reviews of ArchiveBox. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2021-04-12.

An Emacs wallabag client - the Emacser way to manage web pages!
13 projects | /r/emacs | 12 Apr 2021
Make Your Own Internet Archive with Archive Box
9 projects | news.ycombinator.com | 19 Jan 2021

it doesn't show in the Screenshot in the article, but ArchiveBox in Aug 2020 implemented the "readability article text extractor", see description in the release notes: https://github.com/pirate/ArchiveBox/releases/tag/v0.4.14 and the module that does the work https://github.com/pirate/readability-extractor
By only extracting text and article images you could go deep into an archive. If you skip images, much more so

What are some alternatives?

When comparing wayback-machine-scraper and ArchiveBox you can also consider the following projects:

waybackpy - Wayback Machine API interface & a command-line tool

Wallabag - wallabag is a self hostable application for saving web pages: Save and classify articles. Read them later. Freely.

cancel-culture - Tools for fighting abuse on Twitter

youtube-dl-webui - Another webui for youtube-dl powered by Flask.

autoscraper - A Smart, Automatic, Fast and Lightweight Web Scraper for Python

archivy - Archivy is a self-hostable knowledge repository that allows you to learn and retain information in your own personal and extensible wiki.

ArchiveBox - 🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

pinboard-notes-backup - Back up the notes you’ve saved to Pinboard

WordPress - WordPress, Git-ified. This repository is just a mirror of the WordPress subversion repository. Please do not send pull requests. Submit pull requests to https://github.com/WordPress/wordpress-develop and patches to https://core.trac.wordpress.org/ instead.

promnesia - Another piece of your extended mind

grasp - A reliable org-capture browser extension for Chrome/Firefox

wallabag.el - Emacs wallabag client - A Read It Later/Web Archiving Solution in Emacs.

wayback-machine-scraper vs waybackpy ArchiveBox vs Wallabag wayback-machine-scraper vs cancel-culture ArchiveBox vs youtube-dl-webui wayback-machine-scraper vs autoscraper ArchiveBox vs archivy wayback-machine-scraper vs ArchiveBox ArchiveBox vs pinboard-notes-backup wayback-machine-scraper vs WordPress ArchiveBox vs promnesia ArchiveBox vs grasp ArchiveBox vs wallabag.el

Compare wayback-machine-scraper vs ArchiveBox and see what are their differences.

wayback-machine-scraper

ArchiveBox

wayback-machine-scraper

ArchiveBox

What are some alternatives?