Sandcrawler Alternatives
Similar projects and alternatives to sandcrawler based on common topics and language
-
Seaweed File System
Discontinued SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding. [Moved to: https://github.com/seaweedfs/seaweedfs] (by chrislusf)
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
-
ArchiveBox
Discontinued 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more... [Moved to: https://github.com/ArchiveBox/ArchiveBox] (by pirate)
sandcrawler reviews and mentions
-
Internet Archive Infrastructure
Check this experiment they did https://github.com/internetarchive/sandcrawler/blob/master/p... . This was just a part of the infrastructure.
Point is ceph & friends have a lot of overhead. Example in ceph: by default, a file in S3 layer is split into 4MB chunks, and each of those chunks is replicated or erasure-coded. Using the same erasure coding as wasabi,b2-cloud, which is 16+4=20 (or 17+3=20), each of those 4MB chunks is split into 20 shards of ~200KB each. Each of those shards ends up having ~512B to 4KB of metadata.
So from 10KB to 80KB of metadata for single 4MB chunk.
- Fast and easily scalable self-hosted storage solution
Stats
The primary programming language of sandcrawler is HTML.
Popular Comparisons
Sponsored