web-crawler

Open-source projects categorized as web-crawler

Top 23 web-crawler Open-Source Projects

  • crawlee

    Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

  • Project mention: How to scrape Amazon products | dev.to | 2024-04-01

    In this guide, we'll be extracting information from Amazon product pages using the power of TypeScript in combination with the Cheerio and Crawlee libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process.

  • crawlab

    Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • awesome-crawler

    A collection of awesome web crawler,spider in different languages

  • Apache Nutch

    Apache Nutch is an extensible and scalable web crawler

  • abot

    Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

  • PSpider

    简单易用的Python爬虫框架,QQ交流群:597510560

  • storm-crawler

    A scalable, mature and versatile web crawler based on Apache Storm

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • MarginaliaSearch

    Internet search engine for text-oriented websites. Indexing the small, old and weird web.

  • Project mention: Marginalia: 3 Years | news.ycombinator.com | 2024-02-25

    > I think a larger concern is how you'll address the Bus Factor going forward

    I can't speak to how much energy it is to go from code to serving requests, but FWIW the code is AGPLv3 and seems to be updated regularly https://github.com/MarginaliaSearch/MarginaliaSearch/blob/v2...

  • spidr

    A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use. (by postmodern)

  • kochat

    Opensource Korean chatbot framework

  • ache

    ACHE is a web crawler for domain-specific search.

  • Sparkler

    Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

  • spidy Web Crawler

    The simple, easy to use command line web crawler.

  • crawler

    Library for Rapid (Web) Crawler and Scraper Development (by crwlrsoft)

  • ant

    A web crawler for Go (by yields)

  • antch

    Antch, a fast, powerful and extensible web crawling & scraping framework for Go

  • Infinity Crawler

    A simple but powerful web crawler library for .NET

  • awesome-web-scraper

    A collection of awesome web scaper, crawler.

  • crawley

    The unix-way web crawler (by s0rg)

  • firecrawl

    🔥 Turn entire websites into LLM-ready markdown

  • Project mention: Tutorial: Extracting structured data from websites using Groq and Firecrawl | news.ycombinator.com | 2024-04-22
  • Ignareo-ISML-auto-voter

    Ignareo the Carillon, a web crawler/spider template of ultimate high concurrency built for leprechauns. Carillons as the best web spiders; Long live the golden years of leprechauns! (ISML=international saimoe; 2022 ISML is last ISML)

  • krawler

    A web crawling framework written in Kotlin

  • GoodreadsScraper

    Scrape data from Goodreads using Scrapy and Selenium :books:

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

web-crawler related posts

Index

What are some of the best open-source web-crawler projects? This list will help you:

Project Stars
1 crawlee 12,044
2 crawlab 10,788
3 awesome-crawler 6,059
4 Apache Nutch 2,807
5 abot 2,204
6 PSpider 1,811
7 storm-crawler 856
8 MarginaliaSearch 826
9 spidr 792
10 kochat 442
11 ache 434
12 Sparkler 409
13 spidy Web Crawler 323
14 crawler 299
15 ant 276
16 antch 255
17 Infinity Crawler 237
18 awesome-web-scraper 236
19 crawley 227
20 firecrawl 1,659
21 Ignareo-ISML-auto-voter 189
22 krawler 130
23 GoodreadsScraper 115

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com