cdx_toolkit
ArchiveBox
cdx_toolkit | ArchiveBox | |
---|---|---|
1 | 2 | |
153 | 8,085 | |
3.9% | - | |
0.0 | 9.7 | |
3 months ago | over 3 years ago | |
Python | Python | |
Apache License 2.0 | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
cdx_toolkit
-
How to extract particular domain webpage from CommonCrawl dataset efficiently?
The easiest way is to use cdx_toolkit which lets you query the CommonCrawl Index and download the warc from a CLI.
ArchiveBox
- An Emacs wallabag client - the Emacser way to manage web pages!
-
Make Your Own Internet Archive with Archive Box
it doesn't show in the Screenshot in the article, but ArchiveBox in Aug 2020 implemented the "readability article text extractor", see description in the release notes: https://github.com/pirate/ArchiveBox/releases/tag/v0.4.14 and the module that does the work https://github.com/pirate/readability-extractor
By only extracting text and article images you could go deep into an archive. If you skip images, much more so
What are some alternatives?
ipwb - InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
Wallabag - wallabag is a self hostable application for saving web pages: Save and classify articles. Read them later. Freely.
conifer - Collect and revisit web pages.
youtube-dl-webui - Another webui for youtube-dl powered by Flask.
ArchiveBox - 🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
archivy - Archivy is a self-hostable knowledge repository that allows you to learn and retain information in your own personal and extensible wiki.
comcrawl - A python utility for downloading Common Crawl data
pinboard-notes-backup - Back up the notes you’ve saved to Pinboard
promnesia - Another piece of your extended mind
grasp - A reliable org-capture browser extension for Chrome/Firefox
wallabag.el - Emacs wallabag client - A Read It Later/Web Archiving Solution in Emacs.
22120