cdx_toolkit
conifer
cdx_toolkit | conifer | |
---|---|---|
1 | 5 | |
153 | 1,458 | |
3.9% | -0.2% | |
0.0 | 0.0 | |
3 months ago | 6 months ago | |
Python | Python | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
cdx_toolkit
-
How to extract particular domain webpage from CommonCrawl dataset efficiently?
The easiest way is to use cdx_toolkit which lets you query the CommonCrawl Index and download the warc from a CLI.
conifer
- YaCy, a distributed Web Search Engine, based on a peer-to-peer network
- I have no idea how Github works. I need to download a Conifer archiving tool. Can someone explain how to do? https://github.com/Rhizome-Conifer/conifer
-
[HELP] I´m looking for some self-hosted solution where users can connect to the website, connect to a page, browse it and save it.
I saw Rhizome-conifer(https://github.com/Rhizome-Conifer/conifer). It look great but there is a lot of open issues on github, some from 2016, and i´m afraid if it will not keep up (it would be very sad).
- Looking for a tool that generates reader mode of articles
-
Ask HN: What browser extensions are a must-have for HNers in 2021?
I would personally recommend
https://github.com/rhizome-conifer/conifer
The intent is a webrecorder for the internet.
It records all js libraries, loads videos, and all else.
Once stored, you can review a snapshot at that point in time.
They have a service option, webrecorder.io, but this one let's you store directly locally.
What are some alternatives?
ipwb - InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
pywb - Core Python Web Archiving Toolkit for replay and recording of web archives
ArchiveBox - 🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Wallabag - wallabag is a self hostable application for saving web pages: Save and classify articles. Read them later. Freely.
comcrawl - A python utility for downloading Common Crawl data
Reddit-Enhancement-Suite - Reddit Enhancement Suite
SingleFile - Web Extension for saving a faithful copy of a complete web page in a single HTML file
temporal-shift-module - [ICCV 2019] TSM: Temporal Shift Module for Efficient Video Understanding
nmigen - A refreshed Python toolbox for building complex digital hardware. See https://gitlab.com/nmigen/nmigen
qkeras - QKeras: a quantization deep learning library for Tensorflow Keras
syncthing-android - Wrapper of syncthing for Android.