Top 23 Spider Open-Source Projects

colly

39 22,165 6.0 Go

Elegant Scraper and Crawler Framework for Golang

Project mention: Scraping the full snippet from Google search result | dev.to | 2024-01-01

SerpApi focuses on scraping search results. That's why we need extra help to scrape individual sites. We'll use GoColly package.

crawlab

4 10,803 6.0 Go

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架
InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Photon

3 10,513 0.0 Python

Incredibly fast crawler designed for OSINT. (by s0md3v)
Pholcus

0 7,504 0.0 Go

Pholcus is a distributed high-concurrency crawler software written in pure golang
InfoSpider

1 7,134 4.3 Python

INFO-SPIDER 是一个集众多数据源于一身的爬虫工具箱🧰，旨在安全快捷的帮助用户拿回自己的数据，工具代码开源，流程透明。支持数据源包括GitHub、QQ邮箱、网易邮箱、阿里邮箱、新浪邮箱、Hotmail邮箱、Outlook邮箱、京东、淘宝、支付宝、中国移动、中国联通、中国电信、知乎、哔哩哔哩、网易云音乐、QQ好友、QQ群、生成朋友圈相册、浏览器浏览历史、12306、博客园、CSDN博客、开源中国博客、简书。
Douyin_TikTok_Download_API

3 6,925 9.2 Python

🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具，支持API调用，在线批量解析及下载。

Project mention: TikTok video scraper | /r/webscraping | 2023-05-23

At the moment I am working on a web scraper for TikTok. At the moment, I am able to retrieve data about the first 16 videos from a channel. The way I achieved this was to make requests to an unofficial API https://github.com/Evil0ctal/Douyin_TikTok_Download_API. My problem is that the requirements for this project do not allow me to use any package that would extract data from TikTok. I would like to ask you all, how should I go about this task. Already tried getting data from the HTML, but is not sufficient since most of it is not displayed when I use requests.get(URL). Could you please recommend some repositories that could help or some way of extracting the data? Thank you!

node-crawler

3 6,615 4.8 JavaScript

Web Crawler/Spider for NodeJS + server-side jQuery ;-)
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
awesome-web-scraping

6 6,323 4.6 Makefile

List of libraries, tools and APIs for web scraping and data processing.
awesome-crawler

7 6,107 1.5

A collection of awesome web crawler,spider in different languages
browser-fingerprinting

8 3,830 1.0 JavaScript

Analysis of Bot Protection systems with available countermeasures 🚿. How to defeat anti-bot system 👻 and get around browser fingerprinting scripts 🕵️‍♂️ when scraping the web?

Project mention: A site that tracks the price of a Big Mac in every US McDonald's | news.ycombinator.com | 2024-01-13

Yes, there is a lot written about it. Here is one link I have saved:
https://github.com/niespodd/browser-fingerprinting

toapi

0 3,462 0.0 Python

Every web site provides APIs.
Gerapy

1 3,211 6.8 Python

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js
scrapydweb

6 3,004 3.6 Python

Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO :point_right:
SpiderKeeper

1 2,705 0.0 Python

admin ui for scrapy/open source scrapinghub
DHT

1 2,668 0.0 Go

BitTorrent DHT Protocol && DHT Spider.
TorBot

1 2,618 8.7 Python

Dark Web OSINT Tool
Geziyor

2 2,480 0.6 Go

Geziyor, blazing fast web crawling & scraping framework for Go. Supports JS rendering.

Project mention: Show HN: I scraped 25M Shopify products to build a search engine | news.ycombinator.com | 2023-12-13

As someone who has scraped millions of items myself, I had success using Geziyor (https://github.com/geziyor/geziyor) built in Go. Shopify sites are especially easy to scrape because they tend to share the same product data formatting and don't hide it behind JS rendering.

Grab

0 2,355 3.0 Python

Web Scraping Framework
abot

1 2,205 0.0 C#

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
PSpider

0 1,811 0.0 Python

简单易用的Python爬虫框架，QQ交流群：597510560
cariddi

7 1,352 7.6 Go

Take a list of domains, crawl urls and scan for endpoints, secrets, api keys, file extensions, tokens and more
grab-site

30 1,261 3.8 Python

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29

The format you want is WARC. Even the Library of Congress uses it. There are many many WARC scrapers. I'd look at what the Internet Archive recommends. A quick search turned up this from the Archive Team and Jason Scott https://github.com/ArchiveTeam/grab-site (https://wiki.archiveteam.org/index.php/Who_We_Are) but I found that in less than 15 seconds of searching so do your own diligence.

x-crawl

8 1,176 9.3 TypeScript

x-crawl is a flexible Node.js multifunctional crawler library. Flexible usage and numerous functions can help you quickly, safely, and stably crawl pages, interfaces, and files.

Project mention: Flexible Node.js AI-assisted crawler library | news.ycombinator.com | 2024-04-24

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Spider related posts

Flexible Node.js AI-assisted crawler library

3 projects | news.ycombinator.com | 24 Apr 2024
Traditional crawler or AI-assisted crawler? How to choose?

1 project | dev.to | 22 Apr 2024
AI+Node.js x-crawl crawler: Why are traditional crawlers no longer the first choice for data crawling?

1 project | dev.to | 16 Apr 2024
AI combined with Node.js x-crawl crawler

1 project | dev.to | 10 Apr 2024
Recommend a flexible Node.js multi-functional crawler library —— x-crawl

1 project | dev.to | 20 Mar 2024
Differentiating between hypermarkets and supermarkets.

1 project | /r/openstreetmap | 9 Dec 2023
Meta, Microsoft and Amazon team up on maps project

1 project | news.ycombinator.com | 26 Jul 2023
A note from our sponsor - SaaSHub
www.saashub.com | 1 May 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Spider projects? This list will help you:

	Project	Stars
1	colly	22,165
2	crawlab	10,803
3	Photon	10,513
4	Pholcus	7,504
5	InfoSpider	7,134
6	Douyin_TikTok_Download_API	6,925
7	node-crawler	6,615
8	awesome-web-scraping	6,323
9	awesome-crawler	6,107
10	browser-fingerprinting	3,830
11	toapi	3,462
12	Gerapy	3,211
13	scrapydweb	3,004
14	SpiderKeeper	2,705
15	DHT	2,668
16	TorBot	2,618
17	Geziyor	2,480
18	Grab	2,355
19	abot	2,205
20	PSpider	1,811
21	cariddi	1,352
22	grab-site	1,261
23	x-crawl	1,176

Spider

Top 23 Spider Open-Source Projects

Spider related posts

Flexible Node.js AI-assisted crawler library

Traditional crawler or AI-assisted crawler? How to choose?

AI+Node.js x-crawl crawler: Why are traditional crawlers no longer the first choice for data crawling?

AI combined with Node.js x-crawl crawler

Recommend a flexible Node.js multi-functional crawler library —— x-crawl

Differentiating between hypermarkets and supermarkets.

Meta, Microsoft and Amazon team up on maps project

Index