SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Web Crawling Projects
-
Project mention: Scrapy needs to have sane defaults that do no harm | news.ycombinator.com | 2025-04-23
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
-
-
crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
Which hashtags are trending now? What is an influencer's engagement rate? What topics are important for a content creator? You can find answers to these and many other questions by analyzing TikTok data. However, for analysis, you need to extract the data in a convenient format. In this blog, we'll explore how to scrape TikTok using Crawlee for Python.
-
Language: Python | GitHub: 4.7K+ stars | link
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
i am using the feedparser library in python https://github.com/kurtmckee/feedparser/ which basically takes an RSS url and standardizes it to a reasonable extent. But I have noticed that different websites still get parsed slightly differently. For example look at how https://beincrypto.com/feed/ has a long description (containing actual HTML) inside but this website https://www.coindesk.com/arc/outboundfeeds/rss/ completely cuts the description out. I have about 50 such websites and they all have slight variations. So you are saying that in addition to storing parsed data (title, summary, content, author, pubdate, link, guid) that I currently store, I should also add an xml column and store the raw from each url till I get a good hang of how each site differs?
-
-
I especially recommend paying attention to curl_cffi and frameworks botasaurus and Crawlee for Python.
-
-
-
-
-
-
-
-
-
-
-
FastImage
Python library that finds the size / type of an image given its URI by fetching as little as needed (by bmuller)
-
-
Mariner
This a is mirror of Gitlab repository. Open your issues and pull requests there. (by radek-sprta)
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Web Crawling discussion
Python Web Crawling related posts
-
Scrapy needs to have sane defaults that do no harm
-
Top 10 Tools for Efficient Web Scraping in 2025
-
How to scrape Crunchbase using Python in 2024 (Easy Guide)
-
Current problems and mistakes of web scraping in Python and tricks to solve them!
-
Scrapy, a fast high-level web crawling and scraping framework for Python
-
Automate Spider Creation in Scrapy with Jinja2 and JSON
-
Show HN: Crawlee for Python – a web scraping and browser automation library
-
A note from our sponsor - SaaSHub
www.saashub.com | 15 May 2025
Index
What are some of the best open-source Web Crawling projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | Scrapy | 55,119 |
2 | requests-html | 13,806 |
3 | portia | 9,398 |
4 | crawlee-python | 5,638 |
5 | MechanicalSoup | 4,752 |
6 | RoboBrowser | 3,704 |
7 | Grab | 2,402 |
8 | feedparser | 2,113 |
9 | gain | 2,039 |
10 | botasaurus | 1,889 |
11 | PSpider | 1,833 |
12 | cola | 1,505 |
13 | Sukhoi | 880 |
14 | google-search-results-python | 671 |
15 | MSpider | 347 |
16 | spidy Web Crawler | 346 |
17 | Crawley | 189 |
18 | brownant | 159 |
19 | Demiurge | 116 |
20 | Pomp | 59 |
21 | FastImage | 28 |
22 | microwler | 13 |
23 | Mariner | 2 |