Python Crawler

Open-source Python projects categorized as Crawler

Top 23 Python Crawler Projects

  • Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

    Project mention: How I used Scrapy for my ML Project | dev.to | 2023-01-27

    I wanted to invest my time and energy in learning the fastest, most efficient one, that can scale with my as my projects get more and more complex scrapy. After all, I want my projects to shine so bright in my cv it blinds the recruiter's eyes.

  • pyspider

    A Powerful Spider(Web Crawler) System in Python.

  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • newspaper

    News, full-text, and article metadata extraction in Python 3. Advanced docs:

    Project mention: Gathering News Headlines | reddit.com/r/Automate | 2022-08-08
  • Photon

    Incredibly fast crawler designed for OSINT. (by s0md3v)

  • scrapy-redis

    Redis-based components for Scrapy.

    Project mention: Ask HN: What are the best tools for web scraping in 2022? | news.ycombinator.com | 2022-08-10

    11. With some work, you can use Scrapy for distributed projects that are scraping thousands (millions) of domains. We are using https://github.com/rmax/scrapy-redis.

  • autoscraper

    A Smart, Automatic, Fast and Lightweight Web Scraper for Python

    Project mention: A Smart, Automatic, Fast and Lightweight Web Scraper for Python | reddit.com/r/webdev | 2022-12-02
  • toapi

    Every web site provides APIs.

  • InfluxDB

    Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Platform where developers build real-time applications for analytics, IoT and cloud-native services. Easy to start, it is available in the cloud or on-premises.

  • Grab

    Web Scraping Framework

  • weibo-crawler

    新浪微博爬虫,用python爬取新浪微博数据,并下载微博图片和微博视频

    Project mention: 紧急求助🆘备份@我能抱起120斤 | reddit.com/r/DoubanGoosegroup | 2022-07-15
  • TorBot

    Dark Web OSINT Tool

    Project mention: List of resources | reddit.com/r/HackForUkraine | 2022-02-27
  • PSpider

    简单易用的Python爬虫框架,QQ交流群:597510560

  • news-please

    news-please - an integrated web crawler and information extractor for news that just works

    Project mention: Data extraction from news media outlets? | reddit.com/r/datasets | 2022-10-13

    Look at news-please, you can find it in GitHub. I did something similar and it was very helpful. You can hit me up with if you have any questions.

  • OpenWPM

    A web privacy measurement framework

  • grab-site

    The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

    Project mention: How are you archiving websites you visit? | reddit.com/r/selfhosted | 2023-01-07

    After a lot of searching for a similar topic, this is a tool I found which works pretty well: https://github.com/ArchiveTeam/grab-site

  • mlscraper

    🤖 Scrape data from HTML websites automatically by just providing examples

    Project mention: Pre-trained Webscraping Models | reddit.com/r/webscraping | 2022-12-04
  • XSRFProbe

    The Prime Cross Site Request Forgery (CSRF) Audit and Exploitation Toolkit.

  • scrapyrt

    HTTP API for Scrapy spiders

    Project mention: New to python and scrapy stuff but need this project to work so that I can do my data research and stuff easily in the future. | reddit.com/r/scrapy | 2022-04-19
  • trafilatura

    Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

    Project mention: Testing fast installation in tear-down environment | reddit.com/r/learnpython | 2022-07-06

    I want to test how easy it is to install a package plus special extra dependencies to run a certain script in that package: https://github.com/adbar/trafilatura

  • bookcorpus

    Crawl BookCorpus

    Project mention: Can chat GPT overtake Google if they play their cards right? | reddit.com/r/Futurology | 2022-12-23
  • dotcommon

    What do people have in their dotfiles?

    Project mention: Dotcommon – common aliases and plugins on most updated GitHub repositories | news.ycombinator.com | 2022-08-13
  • google-play-scraper

    Google play scraper for Python inspired by <facundoolano/google-play-scraper> (by JoMingyu)

    Project mention: Report: Analysis of 2.9 millions apps on Google Play | reddit.com/r/androiddev | 2022-11-08

    Its easy. python library: google-play-scraper.

  • freshonions-torscraper

    Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion

  • Moodle-Downloader-2

    A Moodle downloader that downloads course content fast from Moodle (eg. lecture pdfs)

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-01-27.

Python Crawler related posts

Index

What are some of the best open-source Crawler projects in Python? This list will help you:

Project Stars
1 Scrapy 45,702
2 pyspider 15,713
3 newspaper 12,386
4 Photon 9,334
5 scrapy-redis 5,227
6 autoscraper 4,873
7 toapi 3,356
8 Grab 2,257
9 weibo-crawler 2,159
10 TorBot 1,805
11 PSpider 1,732
12 news-please 1,489
13 OpenWPM 1,248
14 grab-site 959
15 mlscraper 815
16 XSRFProbe 791
17 scrapyrt 758
18 trafilatura 730
19 bookcorpus 594
20 dotcommon 591
21 google-play-scraper 490
22 freshonions-torscraper 439
23 Moodle-Downloader-2 309
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com