The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 10 crawl Open-Source Projects
-
InfoSpider
INFO-SPIDER 是一个集众多数据源于一身的爬虫工具箱🧰,旨在安全快捷的帮助用户拿回自己的数据,工具代码开源,流程透明。支持数据源包括GitHub、QQ邮箱、网易邮箱、阿里邮箱、新浪邮箱、Hotmail邮箱、Outlook邮箱、京东、淘宝、支付宝、中国移动、中国联通、中国电信、知乎、哔哩哔哩、网易云音乐、QQ好友、QQ群、生成朋友圈相册、浏览器浏览历史、12306、博客园、CSDN博客、开源中国博客、简书。
-
grab-site
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
x-crawl
x-crawl is a flexible Node.js multifunctional crawler library. Flexible usage and numerous functions can help you quickly, safely, and stably crawl pages, interfaces, and files.
-
fetchurls
A bash script to spider a site, follow links, and fetch urls (with built-in filtering) into a generated text file.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29The format you want is WARC. Even the Library of Congress uses it. There are many many WARC scrapers. I'd look at what the Internet Archive recommends. A quick search turned up this from the Archive Team and Jason Scott https://github.com/ArchiveTeam/grab-site (https://wiki.archiveteam.org/index.php/Who_We_Are) but I found that in less than 15 seconds of searching so do your own diligence.
Clipper.js is built on top of Mozilla's Readability library, Turndown to convert HTML to Markdown https://github.com/philschmid/clipper.js
crawl related posts
- Flexible Node.js AI-assisted crawler library
- Traditional crawler or AI-assisted crawler? How to choose?
- AI+Node.js x-crawl crawler: Why are traditional crawlers no longer the first choice for data crawling?
- AI combined with Node.js x-crawl crawler
- Recommend a flexible Node.js multi-functional crawler library —— x-crawl
- struggling to download websites
- Internet Archive Down, will be up and running soon (i hope).
-
A note from our sponsor - WorkOS
workos.com | 25 Apr 2024
Index
What are some of the best open-source crawl projects? This list will help you:
Project | Stars | |
---|---|---|
1 | InfoSpider | 7,122 |
2 | grab-site | 1,260 |
3 | x-crawl | 1,016 |
4 | stweet | 568 |
5 | clipper.js | 415 |
6 | gospider | 203 |
7 | fetchurls | 123 |
8 | wget-lua | 81 |
9 | squint | 27 |
10 | splatstats | 2 |
Sponsored