SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 web-crawler Open-Source Projects
-
crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
-
crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
MarginaliaSearch
Internet search engine for text-oriented websites. Indexing the small, old and weird web.
-
spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use. (by postmodern)
-
Ignareo-ISML-auto-voter
Ignareo the Carillon, a web crawler/spider template of ultimate high concurrency built for leprechauns. Carillons as the best web spiders; Long live the golden years of leprechauns! (ISML=international saimoe; 2022 ISML is last ISML)
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
In this guide, we'll be extracting information from Amazon product pages using the power of TypeScript in combination with the Cheerio and Crawlee libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process.
> I think a larger concern is how you'll address the Bus Factor going forward
I can't speak to how much energy it is to go from code to serving requests, but FWIW the code is AGPLv3 and seems to be updated regularly https://github.com/MarginaliaSearch/MarginaliaSearch/blob/v2...
Project mention: Tutorial: Extracting structured data from websites using Groq and Firecrawl | news.ycombinator.com | 2024-04-22
web-crawler related posts
- Multiparadigmatic Web Scraping Tool!
- Distributed Web Crawler
- How impossible is this task that's been assigned to my coworkers and I?
- Dark web monitoring suggestions
- Hardware setup for continuous scraping
- Meet Crusty, Fast Scalable && Polite Broad Web Crawler Built With Rust!
- Show HN: A fast, feature-rich crawler for Go
-
A note from our sponsor - SaaSHub
www.saashub.com | 24 Apr 2024
Index
What are some of the best open-source web-crawler projects? This list will help you:
Project | Stars | |
---|---|---|
1 | crawlee | 12,044 |
2 | crawlab | 10,788 |
3 | awesome-crawler | 6,059 |
4 | Apache Nutch | 2,807 |
5 | abot | 2,204 |
6 | PSpider | 1,811 |
7 | storm-crawler | 856 |
8 | MarginaliaSearch | 826 |
9 | spidr | 792 |
10 | kochat | 442 |
11 | ache | 434 |
12 | Sparkler | 409 |
13 | spidy Web Crawler | 323 |
14 | crawler | 299 |
15 | ant | 276 |
16 | antch | 255 |
17 | Infinity Crawler | 237 |
18 | awesome-web-scraper | 236 |
19 | crawley | 227 |
20 | firecrawl | 1,659 |
21 | Ignareo-ISML-auto-voter | 189 |
22 | krawler | 130 |
23 | GoodreadsScraper | 115 |
Sponsored