SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Scraping Projects
-
Project mention: Scrapy needs to have sane defaults that do no harm | news.ycombinator.com | 2025-04-23
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
Jobs_Applier_AI_Agent_AIHawk
AIHawk aims to easy job hunt process by automating the job application process. Utilizing artificial intelligence, it enables users to apply for multiple jobs in a tailored way.
-
Project mention: Unit Testing Without Tears: How CodeBeaver Turned Testing from 'pytest run pain' to 'git push joy' 🚀 | dev.to | 2025-02-10
Let's face it, fellow devs - we'd rather debug a production outage at 3 AM than write unit tests. Okay, maybe not that extreme, but you get the point! 😅 Today, I'm going to share how the folks at ScrapegraphAI (18k stars!) solved their testing woes with a solution so smooth, it's like they found a cheat code for the matrix.
-
-
undetected-chromedriver
Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)
Failed Proof-of-Origin checks: The package claimed to be associated with the GitHub repository undetected-chromedriver, but other packages held stronger claims to this repository.
-
-
crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
Which hashtags are trending now? What is an influencer's engagement rate? What topics are important for a content creator? You can find answers to these and many other questions by analyzing TikTok data. However, for analysis, you need to extract the data in a convenient format. In this blog, we'll explore how to scrape TikTok using Crawlee for Python.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
-
-
-
Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE
Do you want to LEARN NEW STUFF for FREE? Don't worry, with the power of web-scraping and automation, this script will find the necessary Udemy coupons & enroll you for PAID UDEMY COURSES, ABSOLUTELY FREE!
-
-
twikit
Twitter API Scraper | Without an API key | Twitter Internal API | Free | Twitter scraper | Twitter Bot
Project mention: Show HN: I made a free tool that analyzes SEC filings and posts detailed reports | news.ycombinator.com | 2025-04-14Unpopular opinion here... If you tread carefully you'll most likely not succeed. I am not American and I know you guys like to sue eachother for putting cats in microwaves and stuff so maybe this is not great opinion to have in America at the current moment.
I would go for it and put a disclaimer, or I would just incorporate in a country where there's no issues with these things.
all this is hard of course to provide good value, but worthwhile.
Twitters' cost is insane right now, I had quite a few ideas for twitter integrations but they would easily cost thousands per month just to access their API.
I looked into https://github.com/d60/twikit - might not be suitable but you can definitely play around with it. Just don't use your official account as I got shadow banned using it unfortunately.
-
-
I actually built my own Playwright screenshotting software with this idea in mind too: https://shot-scraper.datasette.io/ - I wrote about using that for my project documentation here: https://simonwillison.net/2022/Oct/14/automating-screenshots...
Really it comes down to the team you are working with. If you have user-facing documentation authors who are happy with Markdown and Git you can probably get this to work.
-
cloudproxy
Hide your scrapers IP behind the cloud. Provision proxy servers across different cloud providers to improve your scraping success.
-
-
-
Yes, Parsel is open-source and freely available under the BSD license. You can find its source code on Parsel's Github page
-
Scweet
A simple and unlimited twitter scraper : scrape tweets, likes, retweets, following, followers, user info, images...
-
Project mention: llm-scraper VS parsera - a user suggested alternative | libhunt.com/r/llm-scraper | 2024-10-16
-
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Scraping discussion
Python Scraping related posts
-
Scrapy needs to have sane defaults that do no harm
-
How to scrape Bluesky with Python
-
AgentQL MCP Server: Structured Web Data for Claude, Cursor, Windsurf, and more
-
Unit Testing Without Tears: How CodeBeaver Turned Testing from 'pytest run pain' to 'git push joy' 🚀
-
Revolutionizing the Search with ScrapeGraphAI
-
How to scrape Crunchbase using Python in 2024 (Easy Guide)
-
Show HN: Equip Your Agents with Powerful Web Scraping Tools
-
A note from our sponsor - SaaSHub
www.saashub.com | 17 May 2025
Index
What are some of the best open-source Scraping projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | Scrapy | 55,206 |
2 | Jobs_Applier_AI_Agent_AIHawk | 28,141 |
3 | Scrapegraph-ai | 19,567 |
4 | requests-html | 13,806 |
5 | undetected-chromedriver | 11,150 |
6 | autoscraper | 6,758 |
7 | crawlee-python | 5,638 |
8 | trafilatura | 4,235 |
9 | fake-useragent | 3,896 |
10 | snoop | 3,338 |
11 | Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE | 3,205 |
12 | facebook-scraper | 2,653 |
13 | twikit | 2,586 |
14 | Grab | 2,402 |
15 | shot-scraper | 1,929 |
16 | cloudproxy | 1,462 |
17 | mlscraper | 1,342 |
18 | DataEngineeringProject | 1,237 |
19 | parsel | 1,225 |
20 | Scweet | 1,153 |
21 | parsera | 1,081 |
22 | mov-cli | 910 |
23 | hrequests | 870 |