Scraping

Top 23 Scraping Open-Source Projects

  1. Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

    Project mention: Top 10 Tools for Efficient Web Scraping in 2025 | dev.to | 2025-01-16

    Scrapy is a robust and scalable open-source web crawling framework. It is highly efficient for large-scale projects and supports asynchronous scraping.

  2. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  3. Jobs_Applier_AI_Agent_AIHawk

    Jobs_Applier_AI_Agent_AIHawk aims to easy job hunt process by automating the job application process. Utilizing artificial intelligence, it enables users to apply for multiple jobs in a tailored way.

    Project mention: Jobs Applier AI Agent | news.ycombinator.com | 2024-12-08
  4. colly

    Elegant Scraper and Crawler Framework for Golang

    Project mention: Golang with Colly: Use Random Fake User-Agents When Scraping | dev.to | 2025-01-10

    https://github.com/lib4u/fake-useragent https://github.com/gocolly/colly

  5. firecrawl

    🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

    Project mention: Show HN: Get structured website data with just a prompt | news.ycombinator.com | 2025-01-20

    - Also, most of our work including /extract is open-source. Check it out here at https://github.com/mendableai/firecrawl

    That's all for now! Let us know any feedback on /extract.

  6. Scrapegraph-ai

    Python scraper based on AI

    Project mention: Revolutionizing the Search with ScrapeGraphAI | news.ycombinator.com | 2025-02-05
  7. crawlee

    Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

    Project mention: 12 tips on how to think like a web scraping expert | dev.to | 2024-11-04

    :::tip If you like the blog so far, please consider giving Crawlee a star on GitHub, it helps us to reach and help more developers. :::

  8. requests-html

    Pythonic HTML Parsing for Humans™

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. webmagic

    A scalable web crawler framework for Java.

    Project mention: 11 best open-source web crawlers and scrapers in 2024 | dev.to | 2024-10-29

    Language: Java | GitHub: 11.4K+ stars | link

  11. undetected-chromedriver

    Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)

    Project mention: Cross-platform RAT deployed by weaponized 'requests' clone | dev.to | 2024-08-30

    Failed Proof-of-Origin checks: The package claimed to be associated with the GitHub repository undetected-chromedriver, but other packages held stronger claims to this repository.

  12. tabula

    Tabula is a tool for liberating data tables trapped inside PDF files

    Project mention: Dive into Time-Series Anomaly Detection: A Decade Review | news.ycombinator.com | 2025-01-06

    Maybe you can use tabula [0] to extract the information from the PDF?

    https://github.com/tabulapdf/tabula

  13. awesome-web-scraping

    List of libraries, tools and APIs for web scraping and data processing.

  14. autoscraper

    A Smart, Automatic, Fast and Lightweight Web Scraper for Python

  15. Ferret

    Declarative web scraping

  16. crawlee-python

    Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

    Project mention: How to scrape Crunchbase using Python in 2024 (Easy Guide) | dev.to | 2025-01-15

    By the end of this blog, we'll explore three different ways to extract data from Crunchbase using Crawlee for Python. We'll fully implement two of them and discuss the specifics and challenges of the third. This will help us better understand how important it is to properly choose the right data source.

  17. Mechanize

    Mechanize is a ruby library that makes automated web interaction easy.

  18. Data-science

    Collection of useful data science topics along with articles, videos, and code (by khuyentran1401)

  19. trafilatura

    Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

    Project mention: Claude is now available in Europe | news.ycombinator.com | 2024-05-14
  20. fake-useragent

    Up-to-date simple useragent faker with real world database

  21. Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE

    Do you want to LEARN NEW STUFF for FREE? Don't worry, with the power of web-scraping and automation, this script will find the necessary Udemy coupons & enroll you for PAID UDEMY COURSES, ABSOLUTELY FREE!

  22. snoop

    Snoop — инструмент разведки на основе открытых данных (OSINT world)

  23. Symfony Panther

    A browser testing and web crawling library for PHP and Symfony

  24. Geziyor

    Geziyor, blazing fast web crawling & scraping framework for Go. Supports JS rendering.

  25. pipet

    Swiss-army tool for scraping and extracting data from online assets, made for hackers

    Project mention: Show HN: Pipet – CLI tool for scraping and extracting data online, with pipes | news.ycombinator.com | 2024-09-30
  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Scraping discussion

Log in or Post with

Scraping related posts

  • Revolutionizing the Search with ScrapeGraphAI

    1 project | news.ycombinator.com | 5 Feb 2025
  • How to scrape Crunchbase using Python in 2024 (Easy Guide)

    5 projects | dev.to | 15 Jan 2025
  • Golang with Colly: Use Random Fake User-Agents When Scraping

    2 projects | dev.to | 10 Jan 2025
  • Show HN: Equip Your Agents with Powerful Web Scraping Tools

    1 project | news.ycombinator.com | 10 Dec 2024
  • Browser fingerprinting tools for anonymizing scrapers

    1 project | news.ycombinator.com | 5 Dec 2024
  • What Are the Latest Scraping APIs or Services/Websites?

    1 project | news.ycombinator.com | 4 Dec 2024
  • ScrapeGraphAI: AI Scraping API

    1 project | news.ycombinator.com | 3 Dec 2024
  • A note from our sponsor - CodeRabbit
    coderabbit.ai | 9 Feb 2025
    Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →

Index

What are some of the best open-source Scraping projects? This list will help you:

# Project Stars
1 Scrapy 54,040
2 Jobs_Applier_AI_Agent_AIHawk 27,010
3 colly 23,690
4 firecrawl 23,500
5 Scrapegraph-ai 17,798
6 crawlee 16,707
7 requests-html 13,776
8 webmagic 11,482
9 undetected-chromedriver 10,572
10 tabula 6,900
11 awesome-web-scraping 6,866
12 autoscraper 6,611
13 Ferret 5,772
14 crawlee-python 5,221
15 Mechanize 4,405
16 Data-science 4,079
17 trafilatura 3,901
18 fake-useragent 3,772
19 Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE 3,183
20 snoop 3,164
21 Symfony Panther 2,970
22 Geziyor 2,662
23 pipet 2,627

Sponsored
CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai

Did you know that Python is
the 2nd most popular programming language
based on number of references?