11 best open-source web crawlers and scrapers in 2024

This page summarizes the projects mentioned and recommended in the original post on dev.to

InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com
featured
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video.
Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
getstream.io
featured
  1. crawlee

    Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

    Language: Node.js, Python | GitHub: 15.4K+ stars | link

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

    Language: Python | GitHub: 52.9k stars | link

  4. MechanicalSoup

    A Python library for automating interaction with websites.

    Language: Python | GitHub: 4.7K+ stars | link

  5. node-crawler

    Web Crawler/Spider for NodeJS + server-side jQuery ;-)

    Language: Node.js | GitHub: 6.7K+ stars | link

  6. Selenium WebDriver

    A browser automation framework and ecosystem.

    Language: Multi-language | GitHub: 30.6K stars | link

  7. heritrix3

    Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

    Language: Java | GitHub: 2.8K+ stars | link

  8. Apache Nutch

    Apache Nutch is an extensible and scalable web crawler

    Language: Java | GitHub: 2.9K+ stars | link

  9. Stream

    Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

    Stream logo
  10. webmagic

    A scalable web crawler framework for Java.

    Language: Java | GitHub: 11.4K+ stars | link

  11. Nokogiri

    Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.

    Language: Ruby | GitHub: 6.1K+ stars | link

  12. Crawler4j

    Open Source Web Crawler for Java

    Language: Java | GitHub: 4.5K+ stars | link

  13. web-harvest

    Web data extraction (web data mining, web scraping) tool. It leverages well proved XML and text processing techologies in order to easily extract useful data from arbitrary web pages. Fork of http://web-harvest.sourceforge.net

    Language: Go | GitHub: 11.1k | link

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Scrapy needs to have sane defaults that do no harm

    1 project | news.ycombinator.com | 23 Apr 2025
  • 12 tips on how to think like a web scraping expert

    3 projects | dev.to | 4 Nov 2024
  • Current problems and mistakes of web scraping in Python and tricks to solve them!

    21 projects | dev.to | 22 Aug 2024
  • Scrapy, a fast high-level web crawling and scraping framework for Python

    1 project | news.ycombinator.com | 19 Aug 2024
  • Automate Spider Creation in Scrapy with Jinja2 and JSON

    2 projects | dev.to | 27 Jul 2024

Did you know that Java is
the 8th most popular programming language
based on number of references?