Top 23 Crawling Open-Source Projects

Scrapy

180 50,824 9.6 Python

Scrapy, a fast high-level web crawling & scraping framework for Python.

Project mention: Scrapy: A Fast and Powerful Scraping and Web Crawling Framework | news.ycombinator.com | 2024-02-16

colly

39 22,120 6.0 Go

Elegant Scraper and Crawler Framework for Golang

Project mention: Scraping the full snippet from Google search result | dev.to | 2024-01-01

SerpApi focuses on scraping search results. That's why we need extra help to scrape individual sites. We'll use GoColly package.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
newspaper

13 13,703 0.0 Python

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
crawlee

29 12,044 9.8 TypeScript

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Project mention: How to scrape Amazon products | dev.to | 2024-04-01

In this guide, we'll be extracting information from Amazon product pages using the power of TypeScript in combination with the Cheerio and Crawlee libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process.

awesome-web-scraping

6 6,299 5.1 Makefile

List of libraries, tools and APIs for web scraping and data processing.
Ferret

0 5,616 3.1 Go

Declarative web scraping
rod

20 4,750 7.9 Go

A Devtools driver for web automation and scraping

Project mention: Need help authenticating to Okta programatically. | /r/okta | 2023-07-03

I have tried the following. 1. Login to Okta via browser programatically using go-rod. Which I managed to do so successfully, but I'm failing to load up Slack as it's stuck in the browser loader screen for Slack. 2. I tried to authenticate via Okta RESTful API. So far, I have managed to authenticate using {{domain}}/api/v1/authn, and then subsequently using MFA via the verify endpoint {{domain}}/api/v1/authn/factors/{{factorID}}/verify which returns me a sessionToken. From here, I can successfully create a sessionCookie which have proven quite useless to me. Perhaps I am doing it wrongly.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
hakrawler

3 4,225 0.0 Go

Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
PuppeteerSharp

16 3,149 9.2 C#

Headless Chrome .NET API

Project mention: What do .NET devs use for web scraping these days? | /r/dotnet | 2023-06-13

PuppeteerSharp

Apache Nutch

3 2,807 8.0 Java

Apache Nutch is an extensible and scalable web crawler
Grab

0 2,353 3.0 Python

Web Scraping Framework
awesome-puppeteer

1 2,318 3.0

A curated list of awesome puppeteer resources.
cariddi

7 1,351 6.7 Go

Take a list of domains, crawl urls and scan for endpoints, secrets, api keys, file extensions, tokens and more
core

2 1,319 7.9 PHP

The complete web scraping toolkit for PHP. (by roach-php)
mlscraper

10 1,219 0.6 Python

🤖 Scrape data from HTML websites automatically by just providing examples
botasaurus

5 870 9.2 Python

The All in One Framework to build Awesome Scrapers.

Project mention: This Week In Python | dev.to | 2024-04-05

botasaurus – The All in One Framework to build Awesome Scrapers

Crawly

2 839 5.2 Elixir

Crawly, a high-level web crawling & scraping framework for Elixir.

Project mention: Crawly – Elixir web scraping framework | news.ycombinator.com | 2023-08-26

scrapyrt

3 816 6.8 Python

HTTP API for Scrapy spiders
Dataflow kit

0 636 0.0 Go

Extract structured data from web sites. Web sites scraping.
isp-data-pollution

2 566 0.0 Python

ISP Data Pollution to Protect Private Browsing History with Obfuscation
spidermon

2 510 6.9 Python

Scrapy Extension for monitoring spiders execution.
WarcDB

7 383 7.2 Python

WarcDB: Web crawl data as SQLite databases.

Project mention: An Introduction to the WARC File | news.ycombinator.com | 2024-01-29

LinkedInDumper

1 362 6.8 Python

Python 3 script to dump/scrape/extract company employees from LinkedIn API
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Crawling related posts

How To Scrape TikTok in 2024
1 project | dev.to | 8 Mar 2024
Scrapy: A Fast and Powerful Scraping and Web Crawling Framework
1 project | news.ycombinator.com | 16 Feb 2024
Modern automated data miner (scrapper)
1 project | news.ycombinator.com | 8 Feb 2024
An Introduction to the WARC File
7 projects | news.ycombinator.com | 29 Jan 2024
New webcrawler for bug-hunters and data-miners
1 project | news.ycombinator.com | 18 Oct 2023
🌐 "WebPalm: Unleash Websites" 🌐
1 project | /r/netsec | 29 Jul 2023
Real challenge for web scrapers- Need Help!
1 project | /r/webscraping | 13 Jul 2023
A note from our sponsor - WorkOS
workos.com | 23 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source Crawling projects? This list will help you:

	Project	Stars
1	Scrapy	50,824
2	colly	22,120
3	newspaper	13,703
4	crawlee	12,044
5	awesome-web-scraping	6,299
6	Ferret	5,616
7	rod	4,750
8	hakrawler	4,225
9	PuppeteerSharp	3,149
10	Apache Nutch	2,807
11	Grab	2,353
12	awesome-puppeteer	2,318
13	cariddi	1,351
14	core	1,319
15	mlscraper	1,219
16	botasaurus	870
17	Crawly	839
18	scrapyrt	816
19	Dataflow kit	636
20	isp-data-pollution	566
21	spidermon	510
22	WarcDB	383
23	LinkedInDumper	362