Top 23 Web Crawling Open-Source Projects

Scrapy

180 50,896 9.6 Python

Scrapy, a fast high-level web crawling & scraping framework for Python.

Project mention: Scrapy: A Fast and Powerful Scraping and Web Crawling Framework | news.ycombinator.com | 2024-02-16

pyspider

0 16,319 0.0 Python

A Powerful Spider(Web Crawler) System in Python.
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
requests-html

14 13,575 0.0 Python

Pythonic HTML Parsing for Humans™
crawlee

29 12,129 9.8 TypeScript

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Project mention: How to scrape Amazon products | dev.to | 2024-04-01

In this guide, we'll be extracting information from Amazon product pages using the power of TypeScript in combination with the Cheerio and Crawlee libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process.

webmagic

1 11,239 7.0 Java

A scalable web crawler framework for Java.
jsoup

27 10,625 9.1 Java

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.

Project mention: FLaNK Stack Weekly for 20 June 2023 | dev.to | 2023-06-20

portia

0 9,166 0.0 Python

Visual scraping for Scrapy
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
MechanicalSoup

4 4,552 5.9 Python

A Python library for automating interaction with websites.

Project mention: How to scrape a website with Python (Beginner tutorial) | dev.to | 2024-02-22

MechanicalSoup is a Python library for web scraping that combines the simplicity of Requests with the convenience of BeautifulSoup. It's particularly useful for interacting with web forms, like login pages. Here's a basic example to illustrate how you can use MechanicalSoup for web scraping:

Crawler4j

2 4,469 0.0 Java

Open Source Web Crawler for Java
Mechanize

1 4,354 7.2 Ruby

Mechanize is a ruby library that makes automated web interaction easy.
RoboBrowser

0 3,689 0.0 Python
Apache Nutch

3 2,809 8.0 Java

Apache Nutch is an extensible and scalable web crawler
Grab

0 2,354 3.0 Python

Web Scraping Framework
gain

0 2,031 0.0 Python

Web crawling framework based on asyncio.
Scrapely

0 1,846 0.0 HTML

A pure-python HTML screen-scraping library
feedparser

6 1,829 7.7 Python

Parse feeds in Python

Project mention: RSS can be used to distribute all sorts of information | news.ycombinator.com | 2023-11-20

There is JSON Feed¹ already. One of the spec writers is behind micro.blog, which is the first place I saw it(and also one of the few places I've seen it). I don't think it is a bad idea, and it doesn't take all that long to implement it.
I have long hoped it would pick up with the JSON-ify everything crowd, just so I'd never see a non-Atom feed again. We perhaps wouldn't need sooo much of the magic that is wrapped up in packages like feedparser² to deal with all the brokeness of RSS in the wild then.
¹ https://www.jsonfeed.org/
² https://github.com/kurtmckee/feedparser

PSpider

0 1,811 0.0 Python

简单易用的Python爬虫框架，QQ交流群：597510560
Upton

0 1,614 0.0 HTML

A batteries-included framework for easy web-scraping. Just add CSS! (Or do more.)
anemone

0 1,613 0.0 Ruby

Anemone web-spider framework
cola

0 1,488 0.0 Python

A high-level distributed crawling framework.
FastImage

2 1,355 6.4 Ruby

FastImage finds the size or type of an image given its uri by fetching as little as needed
Wombat

1 1,303 2.2 Ruby

Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
MetaInspector

1 1,021 6.4 Ruby

Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Web Crawling related posts

Launching Crawlee Blog: Your Node.js resource hub for web scraping and automation.
1 project | dev.to | 26 Feb 2024
How to scrape a website with Python (Beginner tutorial)
1 project | dev.to | 22 Feb 2024
Scrapy: A Fast and Powerful Scraping and Web Crawling Framework
1 project | news.ycombinator.com | 16 Feb 2024
Seven Python Projects to Elevate Your Coding Skills
3 projects | dev.to | 15 Feb 2024
What is SERP? Meaning, Use Cases and Approaches
3 projects | dev.to | 11 Dec 2023
Anything like scrapy in other languages?
1 project | /r/webscraping | 10 Dec 2023
RSS can be used to distribute all sorts of information
9 projects | news.ycombinator.com | 20 Nov 2023
A note from our sponsor - SaaSHub
www.saashub.com | 28 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Web Crawling projects? This list will help you:

	Project	Stars
1	Scrapy	50,896
2	pyspider	16,319
3	requests-html	13,575
4	crawlee	12,129
5	webmagic	11,239
6	jsoup	10,625
7	portia	9,166
8	MechanicalSoup	4,552
9	Crawler4j	4,469
10	Mechanize	4,354
11	RoboBrowser	3,689
12	Apache Nutch	2,809
13	Grab	2,354
14	gain	2,031
15	Scrapely	1,846
16	feedparser	1,829
17	PSpider	1,811
18	Upton	1,614
19	anemone	1,613
20	cola	1,488
21	FastImage	1,355
22	Wombat	1,303
23	MetaInspector	1,021