SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Spider Open-Source Projects
-
crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
InfoSpider
INFO-SPIDER 是一个集众多数据源于一身的爬虫工具箱🧰,旨在安全快捷的帮助用户拿回自己的数据,工具代码开源,流程透明。支持数据源包括GitHub、QQ邮箱、网易邮箱、阿里邮箱、新浪邮箱、Hotmail邮箱、Outlook邮箱、京东、淘宝、支付宝、中国移动、中国联通、中国电信、知乎、哔哩哔哩、网易云音乐、QQ好友、QQ群、生成朋友圈相册、浏览器浏览历史、12306、博客园、CSDN博客、开源中国博客、简书。
-
Douyin_TikTok_Download_API
🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具,支持API调用,在线批量解析及下载。
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
browser-fingerprinting
Analysis of Bot Protection systems with available countermeasures 🚿. How to defeat anti-bot system 👻 and get around browser fingerprinting scripts 🕵️♂️ when scraping the web?
-
scrapydweb
Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO :point_right:
-
abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
-
cariddi
Take a list of domains, crawl urls and scan for endpoints, secrets, api keys, file extensions, tokens and more
-
grab-site
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
-
x-crawl
x-crawl is a flexible Node.js multifunctional crawler library. Flexible usage and numerous functions can help you quickly, safely, and stably crawl pages, interfaces, and files.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
SerpApi focuses on scraping search results. That's why we need extra help to scrape individual sites. We'll use GoColly package.
At the moment I am working on a web scraper for TikTok. At the moment, I am able to retrieve data about the first 16 videos from a channel. The way I achieved this was to make requests to an unofficial API https://github.com/Evil0ctal/Douyin_TikTok_Download_API. My problem is that the requirements for this project do not allow me to use any package that would extract data from TikTok. I would like to ask you all, how should I go about this task. Already tried getting data from the HTML, but is not sufficient since most of it is not displayed when I use requests.get(URL). Could you please recommend some repositories that could help or some way of extracting the data? Thank you!
Project mention: A site that tracks the price of a Big Mac in every US McDonald's | news.ycombinator.com | 2024-01-13Yes, there is a lot written about it. Here is one link I have saved:
https://github.com/niespodd/browser-fingerprinting
Project mention: Show HN: I scraped 25M Shopify products to build a search engine | news.ycombinator.com | 2023-12-13As someone who has scraped millions of items myself, I had success using Geziyor (https://github.com/geziyor/geziyor) built in Go. Shopify sites are especially easy to scrape because they tend to share the same product data formatting and don't hide it behind JS rendering.
Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29The format you want is WARC. Even the Library of Congress uses it. There are many many WARC scrapers. I'd look at what the Internet Archive recommends. A quick search turned up this from the Archive Team and Jason Scott https://github.com/ArchiveTeam/grab-site (https://wiki.archiveteam.org/index.php/Who_We_Are) but I found that in less than 15 seconds of searching so do your own diligence.
Spider related posts
-
Flexible Node.js AI-assisted crawler library
-
Traditional crawler or AI-assisted crawler? How to choose?
-
AI+Node.js x-crawl crawler: Why are traditional crawlers no longer the first choice for data crawling?
-
AI combined with Node.js x-crawl crawler
-
Recommend a flexible Node.js multi-functional crawler library —— x-crawl
-
Differentiating between hypermarkets and supermarkets.
-
Meta, Microsoft and Amazon team up on maps project
-
A note from our sponsor - SaaSHub
www.saashub.com | 1 May 2024
Index
What are some of the best open-source Spider projects? This list will help you:
Project | Stars | |
---|---|---|
1 | colly | 22,165 |
2 | crawlab | 10,803 |
3 | Photon | 10,513 |
4 | Pholcus | 7,504 |
5 | InfoSpider | 7,134 |
6 | Douyin_TikTok_Download_API | 6,925 |
7 | node-crawler | 6,615 |
8 | awesome-web-scraping | 6,323 |
9 | awesome-crawler | 6,107 |
10 | browser-fingerprinting | 3,830 |
11 | toapi | 3,462 |
12 | Gerapy | 3,211 |
13 | scrapydweb | 3,004 |
14 | SpiderKeeper | 2,705 |
15 | DHT | 2,668 |
16 | TorBot | 2,618 |
17 | Geziyor | 2,480 |
18 | Grab | 2,355 |
19 | abot | 2,205 |
20 | PSpider | 1,811 |
21 | cariddi | 1,352 |
22 | grab-site | 1,261 |
23 | x-crawl | 1,176 |
Sponsored