Top 23 web-crawler Open-Source Projects

crawlee

29 12,044 9.8 TypeScript

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Project mention: How to scrape Amazon products | dev.to | 2024-04-01

In this guide, we'll be extracting information from Amazon product pages using the power of TypeScript in combination with the Cheerio and Crawlee libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process.

crawlab

4 10,788 6.8 Go

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
awesome-crawler

7 6,059 1.7

A collection of awesome web crawler,spider in different languages
Apache Nutch

3 2,807 8.0 Java

Apache Nutch is an extensible and scalable web crawler
abot

1 2,204 0.0 C#

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
PSpider

0 1,811 0.0 Python

简单易用的Python爬虫框架，QQ交流群：597510560
storm-crawler

0 856 8.7 HTML

A scalable, mature and versatile web crawler based on Apache Storm
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
MarginaliaSearch

59 826 9.9 HTML

Internet search engine for text-oriented websites. Indexing the small, old and weird web.

Project mention: Marginalia: 3 Years | news.ycombinator.com | 2024-02-25

> I think a larger concern is how you'll address the Bus Factor going forward
I can't speak to how much energy it is to go from code to serving requests, but FWIW the code is AGPLv3 and seems to be updated regularly https://github.com/MarginaliaSearch/MarginaliaSearch/blob/v2...

spidr

0 792 6.4 Ruby

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use. (by postmodern)
kochat

1 442 0.0 Python

Opensource Korean chatbot framework
ache

1 434 5.7 Java

ACHE is a web crawler for domain-specific search.
Sparkler

0 409 3.0 Java

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
spidy Web Crawler

0 323 0.0 Python

The simple, easy to use command line web crawler.
crawler

0 299 8.1 PHP

Library for Rapid (Web) Crawler and Scraper Development (by crwlrsoft)
ant

4 276 0.0 Go

A web crawler for Go (by yields)
antch

0 255 0.0 Go

Antch, a fast, powerful and extensible web crawling & scraping framework for Go
Infinity Crawler

0 237 5.0 C#

A simple but powerful web crawler library for .NET
awesome-web-scraper

0 236 4.3

A collection of awesome web scaper, crawler.
crawley

8 227 6.7 Go

The unix-way web crawler (by s0rg)
firecrawl

2 1,659 7.5 TypeScript

🔥 Turn entire websites into LLM-ready markdown

Project mention: Tutorial: Extracting structured data from websites using Groq and Firecrawl | news.ycombinator.com | 2024-04-22

Ignareo-ISML-auto-voter

1 189 4.5 Python

Ignareo the Carillon, a web crawler/spider template of ultimate high concurrency built for leprechauns. Carillons as the best web spiders; Long live the golden years of leprechauns! (ISML=international saimoe; 2022 ISML is last ISML)
krawler

0 130 0.0 Kotlin

A web crawling framework written in Kotlin
GoodreadsScraper

1 115 0.0 Python

Scrape data from Goodreads using Scrapy and Selenium :books:
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

web-crawler related posts

Multiparadigmatic Web Scraping Tool!
1 project | /r/computerscience | 14 May 2023
Distributed Web Crawler
1 project | /r/webdev | 31 Dec 2022
How impossible is this task that's been assigned to my coworkers and I?
2 projects | /r/cscareerquestions | 14 Dec 2022
Dark web monitoring suggestions
1 project | /r/cybersecurity | 27 Jul 2021
Hardware setup for continuous scraping
1 project | /r/webscraping | 8 Jun 2021
Meet Crusty, Fast Scalable && Polite Broad Web Crawler Built With Rust!
3 projects | /r/rust | 4 Jun 2021
Show HN: A fast, feature-rich crawler for Go
1 project | news.ycombinator.com | 26 Mar 2021
A note from our sponsor - SaaSHub
www.saashub.com | 24 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source web-crawler projects? This list will help you:

	Project	Stars
1	crawlee	12,044
2	crawlab	10,788
3	awesome-crawler	6,059
4	Apache Nutch	2,807
5	abot	2,204
6	PSpider	1,811
7	storm-crawler	856
8	MarginaliaSearch	826
9	spidr	792
10	kochat	442
11	ache	434
12	Sparkler	409
13	spidy Web Crawler	323
14	crawler	299
15	ant	276
16	antch	255
17	Infinity Crawler	237
18	awesome-web-scraper	236
19	crawley	227
20	firecrawl	1,659
21	Ignareo-ISML-auto-voter	189
22	krawler	130
23	GoodreadsScraper	115