Top 23 Crawler Open-Source Projects

Scrapy

180 50,896 9.6 Python

Scrapy, a fast high-level web crawling & scraping framework for Python.

Project mention: Scrapy: A Fast and Powerful Scraping and Web Crawling Framework | news.ycombinator.com | 2024-02-16

lux

13 25,247 7.4 Go

👾 Fast and simple video download library and CLI tool written in Go

Project mention: Bilibili download stalls at around 30-60% | /r/youtubedl | 2023-05-18

Not a fix, but I tend to use lux when downloading from bilibili. It is faster too.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
colly

39 22,165 6.0 Go

Elegant Scraper and Crawler Framework for Golang

Project mention: Scraping the full snippet from Google search result | dev.to | 2024-01-01

SerpApi focuses on scraping search results. That's why we need extra help to scrape individual sites. We'll use GoColly package.

pyspider

0 16,319 0.0 Python

A Powerful Spider(Web Crawler) System in Python.
newspaper

13 13,720 0.0 Python

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
crawlee

29 12,129 9.8 TypeScript

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Project mention: How to scrape Amazon products | dev.to | 2024-04-01

In this guide, we'll be extracting information from Amazon product pages using the power of TypeScript in combination with the Cheerio and Crawlee libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process.

webmagic

1 11,239 7.0 Java

A scalable web crawler framework for Java.
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
crawlab

4 10,788 6.8 Go

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架
Photon

3 10,501 0.0 Python

Incredibly fast crawler designed for OSINT. (by s0md3v)
katana

9 8,661 9.1 Go

A next-generation crawling and spidering framework.
Pholcus

0 7,504 0.0 Go

Pholcus is a distributed high-concurrency crawler software written in pure golang
Douyin_TikTok_Download_API

3 6,780 8.2 Python

🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具，支持API调用，在线批量解析及下载。

Project mention: TikTok video scraper | /r/webscraping | 2023-05-23

At the moment I am working on a web scraper for TikTok. At the moment, I am able to retrieve data about the first 16 videos from a channel. The way I achieved this was to make requests to an unofficial API https://github.com/Evil0ctal/Douyin_TikTok_Download_API. My problem is that the requirements for this project do not allow me to use any package that would extract data from TikTok. I would like to ask you all, how should I go about this task. Already tried getting data from the HTML, but is not sufficient since most of it is not displayed when I use requests.get(URL). Could you please recommend some repositories that could help or some way of extracting the data? Thank you!

node-crawler

3 6,615 4.8 JavaScript

Web Crawler/Spider for NodeJS + server-side jQuery ;-)
awesome-web-scraping

6 6,308 5.1 Makefile

List of libraries, tools and APIs for web scraping and data processing.
awesome-crawler

7 6,078 1.5

A collection of awesome web crawler,spider in different languages
autoscraper

9 5,937 0.0 Python

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Ferret

0 5,616 3.1 Go

Declarative web scraping
scrapy-redis

4 5,451 5.0 Python

Redis-based components for Scrapy.

Project mention: How to make scrapy run multiple times on the same URLs? | /r/scrapy | 2023-06-26

myGPTReader

6 4,379 6.4 Python

A community-driven way to read and chat with AI bots - powered by chatGPT.
browser-fingerprinting

8 3,830 1.0 JavaScript

Analysis of Bot Protection systems with available countermeasures 🚿. How to defeat anti-bot system 👻 and get around browser fingerprinting scripts 🕵️‍♂️ when scraping the web?

Project mention: A site that tracks the price of a Big Mac in every US McDonald's | news.ycombinator.com | 2024-01-13

Yes, there is a lot written about it. Here is one link I have saved:
https://github.com/niespodd/browser-fingerprinting

ProxyBroker

0 3,714 0.0 Python

Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS :performing_arts:
arachni

2 3,639 1.5 Ruby

Web Application Security Scanner Framework

Project mention: Self-Host Vulnerability Scanner | /r/selfhosted | 2023-07-09

toapi

0 3,462 0.0 Python

Every web site provides APIs.
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Crawler related posts

Flexible Node.js AI-assisted crawler library
3 projects | news.ycombinator.com | 24 Apr 2024
Traditional crawler or AI-assisted crawler? How to choose?
1 project | dev.to | 22 Apr 2024
Tutorial: Extracting structured data from websites using Groq and Firecrawl
1 project | news.ycombinator.com | 22 Apr 2024
How to Build a Robust Web Scraper with Laravel: and Catch 'Em All
1 project | dev.to | 22 Apr 2024
AI+Node.js x-crawl crawler: Why are traditional crawlers no longer the first choice for data crawling?
1 project | dev.to | 16 Apr 2024
AI combined with Node.js x-crawl crawler
1 project | dev.to | 10 Apr 2024
Show HN: Nebula – A network agnostic DHT crawler
1 project | news.ycombinator.com | 20 Mar 2024
A note from our sponsor - SaaSHub
www.saashub.com | 26 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Crawler projects? This list will help you:

	Project	Stars
1	Scrapy	50,896
2	lux	25,247
3	colly	22,165
4	pyspider	16,319
5	newspaper	13,720
6	crawlee	12,129
7	webmagic	11,239
8	crawlab	10,788
9	Photon	10,501
10	katana	8,661
11	Pholcus	7,504
12	Douyin_TikTok_Download_API	6,780
13	node-crawler	6,615
14	awesome-web-scraping	6,308
15	awesome-crawler	6,078
16	autoscraper	5,937
17	Ferret	5,616
18	scrapy-redis	5,451
19	myGPTReader	4,379
20	browser-fingerprinting	3,830
21	ProxyBroker	3,714
22	arachni	3,639
23	toapi	3,462