Python Crawling

Open-source Python projects categorized as Crawling

Top 13 Python Crawling Projects

  • Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

    Project mention: Show HN: SiteGPT – Create ChatGPT-like chatbots trained on your website content | news.ycombinator.com | 2023-04-01

    Not to go full "Dropbox in a weekend", but if you're technical enough to self-host, this is something you can build for yourself

    Everyone is going straight to embeddings, but it'd be easy enough to use old school NLP summarization from NLTK (https://www.nltk.org/)

    Hook that up a web scraping library like https://scrapy.org/ and get a summary of each page.

    Then embed a site map in your system prompt and use langchain (https://github.com/hwchase17/langchain) to allow GPT to query for a specific page's summary.

    -

    The point of this isn't to say that's how OP did it, but there might be people seeing stuff like this and wondering how on earth to get into it: This is something you could build in a weekend with pretty much no understanding of AI

  • newspaper

    News, full-text, and article metadata extraction in Python 3. Advanced docs:

    Project mention: Gathering News Headlines | reddit.com/r/Automate | 2022-08-08
  • InfluxDB

    Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.

  • Grab

    Web Scraping Framework

  • mlscraper

    🤖 Scrape data from HTML websites automatically by just providing examples

    Project mention: Experimental library for scraping websites using OpenAI's GPT API | news.ycombinator.com | 2023-03-25

    Why GPT-based then? There are libraries that do this: You give examples, they generate the rules for you and give you a scraper object that takes any html and returns the scraped data.

    Mine: https://github.com/lorey/mlscraper

  • scrapyrt

    HTTP API for Scrapy spiders

    Project mention: New to python and scrapy stuff but need this project to work so that I can do my data research and stuff easily in the future. | reddit.com/r/scrapy | 2022-04-19
  • isp-data-pollution

    ISP Data Pollution to Protect Private Browsing History with Obfuscation

    Project mention: A browser extension that sends periodical random search requests, from a large distributed list, to confuse the data broker industry. | reddit.com/r/CrazyIdeas | 2022-04-12

    I’ve seen multiple implementations of this idea. Here’s the first that came up in a quick search: https://github.com/essandess/isp-data-pollution/

  • spidermon

    Scrapy Extension for monitoring spiders execution.

  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • WarcDB

    WarcDB: Web crawl data as SQLite databases.

    Project mention: GitHub - Florents-Tselai/WarcDB: WarcDB: Web crawl data as SQLite databases. | reddit.com/r/sqlite | 2022-06-23
  • spidy Web Crawler

    The simple, easy to use command line web crawler.

  • telegram-crawler

    🕷 Automatically detect changes made to the official Telegram sites, clients and servers.

    Project mention: ☎️ Fragment is getting ready to sell virtual numbers | reddit.com/r/TONPlus | 2022-12-03

    According to the "Telegram Info", changes have been made to the source code of the Fragment platform, demonstrating the appearance of virtual numbers for sale soon!

  • estela

    estela, an elastic web scraping cluster 🕸

    Project mention: How to run webs scraping script every 15 minutes | reddit.com/r/webscraping | 2023-02-13

    You may want to check out [estela](https://estela.bitmaker.la/docs/), which is a spider management solution, developed by [Bitmaker](https://bitmaker.la) that allows you to run [Scrapy](https://scrapy.org) spiders.

  • XingDumper

    Python 3 script to dump company employees from XING API

    Project mention: OSINT: Enumerating Employees on LinkedIn and Xing | reddit.com/r/redteamsec | 2023-02-16
  • estela-cli

    estela Command Line Client 🕸

    Project mention: Deploying Scrapy Projects on the Cloud | reddit.com/r/webscraping | 2022-12-14

    We are currently running a closed beta of Bitmaker Cloud (free and unlimited). Bitmaker Cloud gives you easy management of scraping workloads via a web dashboard and API. Only Scrapy spiders are supported at the moment (additional languages/frameworks are on the roadmap). Bitmaker Cloud is powered by estela, an elastic web scraping cluster running on Kubernetes. estela is a modern alternative to proprietary platforms such as Scrapy Cloud, as well as OSS projects such as scrapyd. The source code of estela and estela-cli is available on Github.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-04-01.

Python Crawling related posts

Index

What are some of the best open-source Crawling projects in Python? This list will help you:

Project Stars
1 Scrapy 46,621
2 newspaper 12,585
3 Grab 2,268
4 mlscraper 901
5 scrapyrt 767
6 isp-data-pollution 530
7 spidermon 453
8 WarcDB 369
9 spidy Web Crawler 306
10 telegram-crawler 129
11 estela 110
12 XingDumper 16
13 estela-cli 3
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com