Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →
Top 23 Python Crawler Projects
-
Scrapy is a robust and scalable open-source web crawling framework. It is highly efficient for large-scale projects and supports asynchronous scraping.
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
newspaper
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
-
-
Douyin_TikTok_Download_API
🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具,支持API调用,在线批量解析及下载。
-
-
-
crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
By the end of this blog, we'll explore three different ways to extract data from Crunchbase using Crawlee for Python. We'll fully implement two of them and discuss the specifics and challenges of the third. This will help us better understand how important it is to properly choose the right data source.
-
Nutrient
Nutrient - The #1 PDF SDK Library. Bad PDFs = bad UX. Slow load times, broken annotations, clunky UX frustrates users. Nutrient’s PDF SDKs gives seamless document experiences, fast rendering, annotations, real-time collaboration, 100+ features. Used by 10K+ devs, serving ~half a billion users worldwide. Explore the SDK for free.
-
-
trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
-
-
-
-
-
-
news-please
news-please - an integrated web crawler and information extractor for news that just works
-
Leaked-GPTs
Leaked GPTs Prompts Bypass the 25 message limit or to try out GPTs without a Plus subscription.
-
-
grab-site
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Project mention: ArchiveBox is evolving: the future of self-hosted internet archives | news.ycombinator.com | 2024-10-16https://github.com/ArchiveTeam/grab-site might be helpful. I'm a fan of the ability to create WARC archives, put them in object storage (whether that is IA, S3, Backblaze B2, etc), and then keep them in cold storage or serve them up via HTTPS or a torrent (mutable, preferred).
-
-
-
-
-
google-play-scraper
Google play scraper for Python inspired by <facundoolano/google-play-scraper> (by JoMingyu)
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Crawler discussion
Python Crawler related posts
-
How to scrape Google Maps data using Python and Crawlee
-
How to scrape Google search results with Python
-
We're losing our digital history. Can the Internet Archive save it?
-
How to scrape infinite scrolling webpages with Python
-
Current problems and mistakes of web scraping in Python and tricks to solve them!
-
Show HN: Crawlee for Python – a web scraping and browser automation library
-
How to download a copy of a website using Wget
-
A note from our sponsor - CodeRabbit
coderabbit.ai | 18 Feb 2025
Index
What are some of the best open-source Crawler projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | Scrapy | 54,144 |
2 | newspaper | 14,353 |
3 | Photon | 11,315 |
4 | Douyin_TikTok_Download_API | 10,936 |
5 | autoscraper | 6,634 |
6 | scrapy-redis | 5,570 |
7 | crawlee-python | 5,287 |
8 | myGPTReader | 4,438 |
9 | trafilatura | 3,933 |
10 | ProxyBroker | 3,920 |
11 | weibo-crawler | 3,642 |
12 | toapi | 3,511 |
13 | TorBot | 3,113 |
14 | Grab | 2,400 |
15 | news-please | 2,150 |
16 | Leaked-GPTs | 2,129 |
17 | PSpider | 1,829 |
18 | grab-site | 1,443 |
19 | OpenWPM | 1,352 |
20 | mlscraper | 1,339 |
21 | XSRFProbe | 1,175 |
22 | scrapyrt | 849 |
23 | google-play-scraper | 818 |