SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Crawler Open-Source Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
newspaper
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
-
crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
-
Douyin_TikTok_Download_API
🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具,支持API调用,在线批量解析及下载。
-
browser-fingerprinting
Analysis of Bot Protection systems with available countermeasures 🚿. How to defeat anti-bot system 👻 and get around browser fingerprinting scripts 🕵️♂️ when scraping the web?
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Scrapy: A Fast and Powerful Scraping and Web Crawling Framework | news.ycombinator.com | 2024-02-16
Not a fix, but I tend to use lux when downloading from bilibili. It is faster too.
SerpApi focuses on scraping search results. That's why we need extra help to scrape individual sites. We'll use GoColly package.
In this guide, we'll be extracting information from Amazon product pages using the power of TypeScript in combination with the Cheerio and Crawlee libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process.
At the moment I am working on a web scraper for TikTok. At the moment, I am able to retrieve data about the first 16 videos from a channel. The way I achieved this was to make requests to an unofficial API https://github.com/Evil0ctal/Douyin_TikTok_Download_API. My problem is that the requirements for this project do not allow me to use any package that would extract data from TikTok. I would like to ask you all, how should I go about this task. Already tried getting data from the HTML, but is not sufficient since most of it is not displayed when I use requests.get(URL). Could you please recommend some repositories that could help or some way of extracting the data? Thank you!
Project mention: A site that tracks the price of a Big Mac in every US McDonald's | news.ycombinator.com | 2024-01-13Yes, there is a lot written about it. Here is one link I have saved:
https://github.com/niespodd/browser-fingerprinting
Crawler related posts
- Flexible Node.js AI-assisted crawler library
- Traditional crawler or AI-assisted crawler? How to choose?
- Tutorial: Extracting structured data from websites using Groq and Firecrawl
- How to Build a Robust Web Scraper with Laravel: and Catch 'Em All
- AI+Node.js x-crawl crawler: Why are traditional crawlers no longer the first choice for data crawling?
- AI combined with Node.js x-crawl crawler
- Show HN: Nebula – A network agnostic DHT crawler
-
A note from our sponsor - SaaSHub
www.saashub.com | 25 Apr 2024
Index
What are some of the best open-source Crawler projects? This list will help you:
Project | Stars | |
---|---|---|
1 | Scrapy | 50,896 |
2 | lux | 25,247 |
3 | colly | 22,165 |
4 | pyspider | 16,319 |
5 | newspaper | 13,720 |
6 | crawlee | 12,129 |
7 | webmagic | 11,239 |
8 | crawlab | 10,788 |
9 | Photon | 10,501 |
10 | katana | 8,661 |
11 | Pholcus | 7,504 |
12 | Douyin_TikTok_Download_API | 6,780 |
13 | node-crawler | 6,612 |
14 | awesome-web-scraping | 6,308 |
15 | awesome-crawler | 6,078 |
16 | autoscraper | 5,937 |
17 | Ferret | 5,616 |
18 | scrapy-redis | 5,451 |
19 | myGPTReader | 4,379 |
20 | browser-fingerprinting | 3,830 |
21 | ProxyBroker | 3,714 |
22 | arachni | 3,639 |
23 | toapi | 3,462 |
Sponsored