Sandcrawler Alternatives

Similar projects and alternatives to sandcrawler based on common topics and language

Seaweed File System

49 14,960 9.9 Go sandcrawler VS Seaweed File System

Discontinued SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding. [Moved to: https://github.com/seaweedfs/seaweedfs] (by chrislusf)
pywb

7 1,300 4.8 JavaScript sandcrawler VS pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
ArchiveBox

248 19,737 9.7 Python sandcrawler VS ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
warrick

2 80 0.0 HTML sandcrawler VS warrick

Recover lost websites from the Web Infrastructure
ArchiveBox

2 8,085 9.7 Python sandcrawler VS ArchiveBox

Discontinued 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more... [Moved to: https://github.com/ArchiveBox/ArchiveBox] (by pirate)
conifer

5 1,459 0.0 Python sandcrawler VS conifer

Collect and revisit web pages.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better sandcrawler alternative or higher similarity.

Suggest an alternative to sandcrawler

sandcrawler reviews and mentions

Posts with mentions or reviews of sandcrawler. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2020-12-24.

Internet Archive Infrastructure
1 project | news.ycombinator.com | 1 Mar 2021

Check this experiment they did https://github.com/internetarchive/sandcrawler/blob/master/p... . This was just a part of the infrastructure.
Point is ceph & friends have a lot of overhead. Example in ceph: by default, a file in S3 layer is split into 4MB chunks, and each of those chunks is replicated or erasure-coded. Using the same erasure coding as wasabi,b2-cloud, which is 16+4=20 (or 17+3=20), each of those 4MB chunks is split into 20 shards of ~200KB each. Each of those shards ends up having ~512B to 4KB of metadata.
So from 10KB to 80KB of metadata for single 4MB chunk.
Fast and easily scalable self-hosted storage solution
2 projects | /r/selfhosted | 24 Dec 2020

Stats

Basic sandcrawler repo stats

Mentions

Stars

Activity

0.0

Last Commit

over 1 year ago

The primary programming language of sandcrawler is HTML.

Popular Comparisons

sandcrawler

Sandcrawler Alternatives

Similar projects and alternatives to sandcrawler based on common topics and language

Seaweed File System

pywb

WorkOS

ArchiveBox

warrick

ArchiveBox

conifer

sandcrawler reviews and mentions

Stats

Popular Comparisons