Python Crawling

Open-source Python projects categorized as Crawling

Top 18 Python Crawling Projects

  • Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

    Project mention: Scrapy: A Fast and Powerful Scraping and Web Crawling Framework | news.ycombinator.com | 2024-02-16
  • newspaper

    News, full-text, and article metadata extraction in Python 3. Advanced docs:

  • WorkOS

    The modern API for authentication & user identity. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • Grab

    Web Scraping Framework

  • mlscraper

    🤖 Scrape data from HTML websites automatically by just providing examples

    Project mention: What are the best tools for web scraping and analysis of natural language to populate a dataset? | /r/datasets | 2023-04-12

    See if something like autoscraper or mlscraper suits your needs.

  • scrapyrt

    HTTP API for Scrapy spiders

  • isp-data-pollution

    ISP Data Pollution to Protect Private Browsing History with Obfuscation

  • spidermon

    Scrapy Extension for monitoring spiders execution.

  • Onboard AI

    ChatGPT with full context of any GitHub repo. Onboard AI learns any GitHub repo in minutes and lets you chat with it to locate functionality, understand different parts, and generate new code. Use it for free at app.getonboardai.com.

  • botasaurus

    The All in One Framework to build Awesome Scrapers.

    Project mention: Reached 100 ⭐ for my Bot Development Framework. | /r/webscraping | 2023-07-04

    Do check it out at https://github.com/omkarcloud/bose

  • WarcDB

    WarcDB: Web crawl data as SQLite databases.

    Project mention: An Introduction to the WARC File | news.ycombinator.com | 2024-01-29
  • LinkedInDumper

    Python 3 script to dump/scrape/extract company employees from LinkedIn API

  • spidy Web Crawler

    The simple, easy to use command line web crawler.

  • telegram-crawler

    🕷 Automatically detect changes made to the official Telegram sites, clients and servers.

  • estela

    estela, an elastic web scraping cluster 🕸

    Project mention: Struggling to scrape specific website - any advice? | /r/webscraping | 2023-04-04

    This solution is using requests, you can also do this in scrapy, and if you are planning to run more crawlers you can use estela which is a spider management solution.

  • scrapfly-scrapers

    Web scrapers for popular targets powered Scrapfly.io

    Project mention: Real challenge for web scrapers- Need Help! | /r/webscraping | 2023-07-13

    Try with https://scrapfly.io with JavaScript rendering enabled, and see if it works. Then means you can use proxies to scrape the site. But just to let you know, their proxies are expensive. But really fast. You have 1000 free credit to try.

  • outlook-account-generator

    Outlook Account Generator helps you create outlook accounts.

    Project mention: I made a Python bot that creates hundreds of Outlook Account's. Check the comments below for the GitHub! Do Star and the repository :) | /r/webscraping | 2023-07-02

    GitHub Link: https://github.com/omkarcloud/outlook-account-generator

  • sneakpeek

    Sneakpeek is a framework that helps to quickly and conviniently develop scrapers. It’s the best choice for scrapers that have some specific complex scraping logic that needs to be run on a constant basis (by flulemon)

    Project mention: Sneakpeek is a framework that helps to quickly and conveniently develop scrapers. It’s the best choice for scrapers that have some specific complex scraping logic that needs to be run on a constant basis | /r/CKsTechNews | 2023-04-30
  • XingDumper

    Python 3 script to dump/scrape/extract company employees from XING API

  • estela-cli

    estela Command Line Client 🕸

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-02-16.

Python Crawling related posts

Index

What are some of the best open-source Crawling projects in Python? This list will help you:

Project Stars
1 Scrapy 50,198
2 newspaper 13,515
3 Grab 2,343
4 mlscraper 1,198
5 scrapyrt 810
6 isp-data-pollution 564
7 spidermon 504
8 botasaurus 403
9 WarcDB 381
10 LinkedInDumper 341
11 spidy Web Crawler 320
12 telegram-crawler 218
13 estela 146
14 scrapfly-scrapers 99
15 outlook-account-generator 87
16 sneakpeek 34
17 XingDumper 28
18 estela-cli 4
Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com