Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →
Top 23 Scraping Open-Source Projects
-
Scrapy is a robust and scalable open-source web crawling framework. It is highly efficient for large-scale projects and supports asynchronous scraping.
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
Jobs_Applier_AI_Agent_AIHawk
Jobs_Applier_AI_Agent_AIHawk aims to easy job hunt process by automating the job application process. Utilizing artificial intelligence, it enables users to apply for multiple jobs in a tailored way.
-
https://github.com/lib4u/fake-useragent https://github.com/gocolly/colly
-
firecrawl
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
Project mention: Show HN: Get structured website data with just a prompt | news.ycombinator.com | 2025-01-20- Also, most of our work including /extract is open-source. Check it out here at https://github.com/mendableai/firecrawl
That's all for now! Let us know any feedback on /extract.
-
-
crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
:::tip If you like the blog so far, please consider giving Crawlee a star on GitHub, it helps us to reach and help more developers. :::
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
Language: Java | GitHub: 11.4K+ stars | link
-
undetected-chromedriver
Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)
Failed Proof-of-Origin checks: The package claimed to be associated with the GitHub repository undetected-chromedriver, but other packages held stronger claims to this repository.
-
Project mention: Dive into Time-Series Anomaly Detection: A Decade Review | news.ycombinator.com | 2025-01-06
Maybe you can use tabula [0] to extract the information from the PDF?
https://github.com/tabulapdf/tabula
-
-
-
-
crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
By the end of this blog, we'll explore three different ways to extract data from Crunchbase using Crawlee for Python. We'll fully implement two of them and discuss the specifics and challenges of the third. This will help us better understand how important it is to properly choose the right data source.
-
-
Data-science
Collection of useful data science topics along with articles, videos, and code (by khuyentran1401)
-
trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
-
-
Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE
Do you want to LEARN NEW STUFF for FREE? Don't worry, with the power of web-scraping and automation, this script will find the necessary Udemy coupons & enroll you for PAID UDEMY COURSES, ABSOLUTELY FREE!
-
-
-
-
Project mention: Show HN: Pipet – CLI tool for scraping and extracting data online, with pipes | news.ycombinator.com | 2024-09-30
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Scraping discussion
Scraping related posts
-
Revolutionizing the Search with ScrapeGraphAI
-
How to scrape Crunchbase using Python in 2024 (Easy Guide)
-
Golang with Colly: Use Random Fake User-Agents When Scraping
-
Show HN: Equip Your Agents with Powerful Web Scraping Tools
-
Browser fingerprinting tools for anonymizing scrapers
-
What Are the Latest Scraping APIs or Services/Websites?
-
ScrapeGraphAI: AI Scraping API
-
A note from our sponsor - CodeRabbit
coderabbit.ai | 9 Feb 2025
Index
What are some of the best open-source Scraping projects? This list will help you:
# | Project | Stars |
---|---|---|
1 | Scrapy | 54,040 |
2 | Jobs_Applier_AI_Agent_AIHawk | 27,010 |
3 | colly | 23,690 |
4 | firecrawl | 23,500 |
5 | Scrapegraph-ai | 17,798 |
6 | crawlee | 16,707 |
7 | requests-html | 13,776 |
8 | webmagic | 11,482 |
9 | undetected-chromedriver | 10,572 |
10 | tabula | 6,900 |
11 | awesome-web-scraping | 6,866 |
12 | autoscraper | 6,611 |
13 | Ferret | 5,772 |
14 | crawlee-python | 5,221 |
15 | Mechanize | 4,405 |
16 | Data-science | 4,079 |
17 | trafilatura | 3,901 |
18 | fake-useragent | 3,772 |
19 | Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE | 3,183 |
20 | snoop | 3,164 |
21 | Symfony Panther | 2,970 |
22 | Geziyor | 2,662 |
23 | pipet | 2,627 |