crawl

Open-source projects categorized as crawl

Top 10 crawl Open-Source Projects

  • InfoSpider

    INFO-SPIDER 是一个集众多数据源于一身的爬虫工具箱🧰,旨在安全快捷的帮助用户拿回自己的数据,工具代码开源,流程透明。支持数据源包括GitHub、QQ邮箱、网易邮箱、阿里邮箱、新浪邮箱、Hotmail邮箱、Outlook邮箱、京东、淘宝、支付宝、中国移动、中国联通、中国电信、知乎、哔哩哔哩、网易云音乐、QQ好友、QQ群、生成朋友圈相册、浏览器浏览历史、12306、博客园、CSDN博客、开源中国博客、简书。

  • grab-site

    The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

  • Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29

    The format you want is WARC. Even the Library of Congress uses it. There are many many WARC scrapers. I'd look at what the Internet Archive recommends. A quick search turned up this from the Archive Team and Jason Scott https://github.com/ArchiveTeam/grab-site (https://wiki.archiveteam.org/index.php/Who_We_Are) but I found that in less than 15 seconds of searching so do your own diligence.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • x-crawl

    x-crawl is a flexible Node.js multifunctional crawler library. Flexible usage and numerous functions can help you quickly, safely, and stably crawl pages, interfaces, and files.

  • Project mention: Flexible Node.js AI-assisted crawler library | news.ycombinator.com | 2024-04-24
  • stweet

    Advanced python library to scrap Twitter (tweets, users) from unofficial API

  • Project mention: Failed using the new twitter API or alternatives | /r/learnpython | 2023-05-11
  • clipper.js

    HTML to Markdown converter and crawler.

  • Project mention: Mozilla: Readability.js | news.ycombinator.com | 2024-02-25

    Clipper.js is built on top of Mozilla's Readability library, Turndown to convert HTML to Markdown https://github.com/philschmid/clipper.js

  • gospider

    ⚡ Light weight Golang spider framework | 轻量的 Golang 爬虫框架

  • fetchurls

    A bash script to spider a site, follow links, and fetch urls (with built-in filtering) into a generated text file.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • wget-lua

    Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

  • squint

    Makes visual reviews of web app releases easy. (by kimmobrunfeldt)

  • splatstats

    DCSS tournament TeamSplat statistics

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

crawl related posts

Index

What are some of the best open-source crawl projects? This list will help you:

Project Stars
1 InfoSpider 7,122
2 grab-site 1,260
3 x-crawl 1,016
4 stweet 568
5 clipper.js 415
6 gospider 203
7 fetchurls 123
8 wget-lua 81
9 squint 27
10 splatstats 2

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com