Python Web Crawling

Open-source Python projects categorized as Web Crawling

Top 23 Python Web Crawling Projects

Web Crawling
  1. Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

    Project mention: Scrapy needs to have sane defaults that do no harm | news.ycombinator.com | 2025-04-23
  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. requests-html

    Pythonic HTML Parsing for Humans™

  4. portia

    Visual scraping for Scrapy

  5. crawlee-python

    Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

    Project mention: How to scrape TikTok using Python | dev.to | 2025-04-30

    Which hashtags are trending now? What is an influencer's engagement rate? What topics are important for a content creator? You can find answers to these and many other questions by analyzing TikTok data. However, for analysis, you need to extract the data in a convenient format. In this blog, we'll explore how to scrape TikTok using Crawlee for Python.

  6. MechanicalSoup

    A Python library for automating interaction with websites.

    Project mention: 11 best open-source web crawlers and scrapers in 2024 | dev.to | 2024-10-29

    Language: Python | GitHub: 4.7K+ stars | link

  7. RoboBrowser

  8. Grab

    Web Scraping Framework

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. feedparser

    Parse feeds in Python

    Project mention: What I Wish Someone Told Me About Postgres | news.ycombinator.com | 2024-11-12

    i am using the feedparser library in python https://github.com/kurtmckee/feedparser/ which basically takes an RSS url and standardizes it to a reasonable extent. But I have noticed that different websites still get parsed slightly differently. For example look at how https://beincrypto.com/feed/ has a long description (containing actual HTML) inside but this website https://www.coindesk.com/arc/outboundfeeds/rss/ completely cuts the description out. I have about 50 such websites and they all have slight variations. So you are saying that in addition to storing parsed data (title, summary, content, author, pubdate, link, guid) that I currently store, I should also add an xml column and store the raw from each url till I get a good hang of how each site differs?

  11. gain

    Web crawling framework based on asyncio.

  12. botasaurus

    The All in One Framework to Build Undefeatable Scrapers

    Project mention: 12 tips on how to think like a web scraping expert | dev.to | 2024-11-04

    I especially recommend paying attention to curl_cffi and frameworks botasaurus and Crawlee for Python.

  13. PSpider

    简单易用的Python爬虫框架,QQ交流群:597510560

  14. cola

    A high-level distributed crawling framework.

  15. Sukhoi

    Minimalist and powerful Web Crawler.

  16. google-search-results-python

    Google Search Results via SERP API pip Python Package

  17. MSpider

    Spider

  18. spidy Web Crawler

    The simple, easy to use command line web crawler.

  19. Crawley

    Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

  20. brownant

    Brownant is a web data extracting framework.

  21. Demiurge

    PyQuery-based scraping micro-framework.

  22. Pomp

    Screen scraping and web crawling framework

  23. FastImage

    Python library that finds the size / type of an image given its URI by fetching as little as needed (by bmuller)

  24. microwler

    A micro-framework for asynchronous deep crawls and web scraping with Python

  25. Mariner

    This a is mirror of Gitlab repository. Open your issues and pull requests there. (by radek-sprta)

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Web Crawling discussion

Log in or Post with

Python Web Crawling related posts

  • Scrapy needs to have sane defaults that do no harm

    1 project | news.ycombinator.com | 23 Apr 2025
  • Top 10 Tools for Efficient Web Scraping in 2025

    2 projects | dev.to | 16 Jan 2025
  • How to scrape Crunchbase using Python in 2024 (Easy Guide)

    5 projects | dev.to | 15 Jan 2025
  • Current problems and mistakes of web scraping in Python and tricks to solve them!

    21 projects | dev.to | 22 Aug 2024
  • Scrapy, a fast high-level web crawling and scraping framework for Python

    1 project | news.ycombinator.com | 19 Aug 2024
  • Automate Spider Creation in Scrapy with Jinja2 and JSON

    2 projects | dev.to | 27 Jul 2024
  • Show HN: Crawlee for Python – a web scraping and browser automation library

    5 projects | news.ycombinator.com | 9 Jul 2024
  • A note from our sponsor - SaaSHub
    www.saashub.com | 15 May 2025
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Web Crawling projects in Python? This list will help you:

# Project Stars
1 Scrapy 55,119
2 requests-html 13,806
3 portia 9,398
4 crawlee-python 5,638
5 MechanicalSoup 4,752
6 RoboBrowser 3,704
7 Grab 2,402
8 feedparser 2,113
9 gain 2,039
10 botasaurus 1,889
11 PSpider 1,833
12 cola 1,505
13 Sukhoi 880
14 google-search-results-python 671
15 MSpider 347
16 spidy Web Crawler 346
17 Crawley 189
18 brownant 159
19 Demiurge 116
20 Pomp 59
21 FastImage 28
22 microwler 13
23 Mariner 2

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?