sandcrawler
warrick
sandcrawler | warrick | |
---|---|---|
2 | 2 | |
23 | 80 | |
- | - | |
0.0 | 0.0 | |
over 1 year ago | about 3 years ago | |
HTML | HTML | |
- | - |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
sandcrawler
-
Internet Archive Infrastructure
Check this experiment they did https://github.com/internetarchive/sandcrawler/blob/master/p... . This was just a part of the infrastructure.
Point is ceph & friends have a lot of overhead. Example in ceph: by default, a file in S3 layer is split into 4MB chunks, and each of those chunks is replicated or erasure-coded. Using the same erasure coding as wasabi,b2-cloud, which is 16+4=20 (or 17+3=20), each of those 4MB chunks is split into 20 shards of ~200KB each. Each of those shards ends up having ~512B to 4KB of metadata.
So from 10KB to 80KB of metadata for single 4MB chunk.
- Fast and easily scalable self-hosted storage solution
warrick
-
Any good tools/API to facilitate download from WaybackMachine at archive.org ?
I've used warrick but it hasn't been updated in 10 years..
- Wayback Machine Downloader โ Download an Entire Website from the Wayback Machine
What are some alternatives?
pywb - Core Python Web Archiving Toolkit for replay and recording of web archives
wayback-machine-downloader - Download an entire website from the Wayback Machine.
Seaweed File System - SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding. [Moved to: https://github.com/seaweedfs/seaweedfs]
savepagenow - A simple Python wrapper and command-line interface for archive.orgโs "Save Page Now" capturing service
ArchiveBox - ๐ Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
wayback-machine-spn-scripts - Bash scripts which interact with Internet Archive Wayback Machine's Save Page Now
waybackpack - Download the entire Wayback Machine archive for a given URL.
wayback - IA's public Wayback Machine (moved from SourceForge)
ipwb - InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS