sandcrawler
conifer
sandcrawler | conifer | |
---|---|---|
2 | 5 | |
23 | 1,456 | |
- | -0.3% | |
0.0 | 0.0 | |
over 1 year ago | 6 months ago | |
HTML | Python | |
- | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
sandcrawler
-
Internet Archive Infrastructure
Check this experiment they did https://github.com/internetarchive/sandcrawler/blob/master/p... . This was just a part of the infrastructure.
Point is ceph & friends have a lot of overhead. Example in ceph: by default, a file in S3 layer is split into 4MB chunks, and each of those chunks is replicated or erasure-coded. Using the same erasure coding as wasabi,b2-cloud, which is 16+4=20 (or 17+3=20), each of those 4MB chunks is split into 20 shards of ~200KB each. Each of those shards ends up having ~512B to 4KB of metadata.
So from 10KB to 80KB of metadata for single 4MB chunk.
- Fast and easily scalable self-hosted storage solution
conifer
- YaCy, a distributed Web Search Engine, based on a peer-to-peer network
- I have no idea how Github works. I need to download a Conifer archiving tool. Can someone explain how to do? https://github.com/Rhizome-Conifer/conifer
-
[HELP] I´m looking for some self-hosted solution where users can connect to the website, connect to a page, browse it and save it.
I saw Rhizome-conifer(https://github.com/Rhizome-Conifer/conifer). It look great but there is a lot of open issues on github, some from 2016, and i´m afraid if it will not keep up (it would be very sad).
- Looking for a tool that generates reader mode of articles
-
Ask HN: What browser extensions are a must-have for HNers in 2021?
I would personally recommend
https://github.com/rhizome-conifer/conifer
The intent is a webrecorder for the internet.
It records all js libraries, loads videos, and all else.
Once stored, you can review a snapshot at that point in time.
They have a service option, webrecorder.io, but this one let's you store directly locally.
What are some alternatives?
pywb - Core Python Web Archiving Toolkit for replay and recording of web archives
Seaweed File System - SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding. [Moved to: https://github.com/seaweedfs/seaweedfs]
Wallabag - wallabag is a self hostable application for saving web pages: Save and classify articles. Read them later. Freely.
ArchiveBox - 🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Reddit-Enhancement-Suite - Reddit Enhancement Suite
SingleFile - Web Extension for saving a faithful copy of a complete web page in a single HTML file
temporal-shift-module - [ICCV 2019] TSM: Temporal Shift Module for Efficient Video Understanding
nmigen - A refreshed Python toolbox for building complex digital hardware. See https://gitlab.com/nmigen/nmigen
qkeras - QKeras: a quantization deep learning library for Tensorflow Keras
syncthing-android - Wrapper of syncthing for Android.