Top 10 crawl Open-Source Projects

InfoSpider

1 7,122 4.6 Python

INFO-SPIDER 是一个集众多数据源于一身的爬虫工具箱🧰，旨在安全快捷的帮助用户拿回自己的数据，工具代码开源，流程透明。支持数据源包括GitHub、QQ邮箱、网易邮箱、阿里邮箱、新浪邮箱、Hotmail邮箱、Outlook邮箱、京东、淘宝、支付宝、中国移动、中国联通、中国电信、知乎、哔哩哔哩、网易云音乐、QQ好友、QQ群、生成朋友圈相册、浏览器浏览历史、12306、博客园、CSDN博客、开源中国博客、简书。
grab-site

30 1,260 3.8 Python

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29

The format you want is WARC. Even the Library of Congress uses it. There are many many WARC scrapers. I'd look at what the Internet Archive recommends. A quick search turned up this from the Archive Team and Jason Scott https://github.com/ArchiveTeam/grab-site (https://wiki.archiveteam.org/index.php/Who_We_Are) but I found that in less than 15 seconds of searching so do your own diligence.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
x-crawl

8 1,016 8.8 TypeScript

x-crawl is a flexible Node.js multifunctional crawler library. Flexible usage and numerous functions can help you quickly, safely, and stably crawl pages, interfaces, and files.

Project mention: Flexible Node.js AI-assisted crawler library | news.ycombinator.com | 2024-04-24

stweet

3 568 0.0 Python

Advanced python library to scrap Twitter (tweets, users) from unofficial API

Project mention: Failed using the new twitter API or alternatives | /r/learnpython | 2023-05-11

clipper.js

1 415 6.8 TypeScript

HTML to Markdown converter and crawler.

Project mention: Mozilla: Readability.js | news.ycombinator.com | 2024-02-25

Clipper.js is built on top of Mozilla's Readability library, Turndown to convert HTML to Markdown https://github.com/philschmid/clipper.js

gospider

0 203 3.6 Go

⚡ Light weight Golang spider framework | 轻量的 Golang 爬虫框架
fetchurls

4 123 0.0 Shell

A bash script to spider a site, follow links, and fetch urls (with built-in filtering) into a generated text file.
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
wget-lua

2 81 6.1 C

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
squint

1 27 6.4 TypeScript

Makes visual reviews of web app releases easy. (by kimmobrunfeldt)
splatstats

1 2 0.0 Perl

DCSS tournament TeamSplat statistics

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).