Python Spider

Open-source Python projects categorized as Spider

Top 23 Python Spider Projects

  • Photon

    Incredibly fast crawler designed for OSINT. (by s0md3v)

  • InfoSpider

    INFO-SPIDER 是一个集众多数据源于一身的爬虫工具箱🧰,旨在安全快捷的帮助用户拿回自己的数据,工具代码开源,流程透明。支持数据源包括GitHub、QQ邮箱、网易邮箱、阿里邮箱、新浪邮箱、Hotmail邮箱、Outlook邮箱、京东、淘宝、支付宝、中国移动、中国联通、中国电信、知乎、哔哩哔哩、网易云音乐、QQ好友、QQ群、生成朋友圈相册、浏览器浏览历史、12306、博客园、CSDN博客、开源中国博客、简书。

  • InfluxDB

    Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.

  • toapi

    Every web site provides APIs.

  • Gerapy

    Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

  • scrapydweb

    Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO :point_right:

    Project mention: Best scrapydweb fork | /r/scrapy | 2022-11-16

    It's seems like there are a lot of more recently updated forks https://github.com/my8100/scrapydweb/network

  • SpiderKeeper

    admin ui for scrapy/open source scrapinghub

  • Grab

    Web Scraping Framework

  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • TorBot

    Dark Web OSINT Tool

  • PSpider

    简单易用的Python爬虫框架,QQ交流群:597510560

  • grab-site

    The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

    Project mention: struggling to download websites | /r/DataHoarder | 2023-05-15

    You can use grab-site with --no-offsite-links and --igsets=mediawiki.

  • XSRFProbe

    The Prime Cross Site Request Forgery (CSRF) Audit and Exploitation Toolkit.

  • freshonions-torscraper

    Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion

  • alltheplaces

    A set of spiders and scrapers to extract location information from places that post their location on the internet.

    Project mention: An aggressive, stealthy web spider operating from Microsoft IP space | news.ycombinator.com | 2023-01-18

    I recently made some contributions to https://github.com/alltheplaces/alltheplaces which is a set of Scrapy spiders for extracting locations of primarily franchise business stores, for use with projects such as OpenStreetMap. After these contributions I am now in shock at how hostile a lot of businesses are with allowing people the privilege of accessing their websites to find store locations, store information or to purchase an item.

    Some examples of what I found you can't do at perhaps 20% of business websites:

    * Use the website near a country border where people often commute freely across borders on a daily or regular basis (example: Schengen region).

    * Plan a holiday figuring out when a shop closes for the day in another country.

    * Order something from an Internet connection in another country to arrive when you return home.

    * Look up product information whilst on a work trip overseas.

    * Access the website via a workplace proxy server that is mandated to be used and just so happens to be hosted in a data centre (the likes of AWS and Azure included) which is blocked simply because it's not a residential ISP. A more complex check would be the website checking the time between a request from Javascript code executing on client reaching the server matches the time the server thinks it should have taken (by ICMP PING or whatever to the origin IP address).

    * Use the website simply because the geo IP database the company uses hasn't been updated to reflect block reassignments by the regional registry, and thinks that your French residential IP address is a defunct Latvian business IP address.

    * Find the business on a price comparison website.

    * Find the business on a search engine that isn't Google.

    * Access the website without first allowing obfuscated Javascript to execute (example: [1]).

    * Use the website if you had certain disabilities.

    * Access the website with IPv6 or using 464XLAT (shared origin IPv4 address with potentially a large pool of other users).

    The answer to me appears obvious: Store and product information is for the most part static and can be stored in static HTML and JSON files that are rewritten on a regular cycle, or at least cached until their next update. JSON files can be advertised freely to the public so if anyone was trying to obtain information off the website, they could do so with a single request for a small file causing as minimal impact as possible.

    Of course, none of the anti-bot measures implemented by websites actually stop bots they're most seeking to block. There are services specifically designed for scraping websites that have 100,000's of residential IP addresses in their pools around the world (fixed and mobile ISPs), and it's trivial with software such as selenium-stealth to make each request look legitimate from a Javascript perspective as if the request was from a Chrome browser running on Windows 10 with a single 2160p screen, etc. If you force bots down the path of working around anti-bot measures, it's a never ending battle the website will ultimately lose because the website will end up blocking legitimate customers and have extremely poor performance that is worsened by forcing bots to make 10x the number of requests to the website just to look legitimate or pass client tests.

    [1] https://www.imperva.com/products/web-application-firewall-wa...

  • hacker-news-digest

    :newspaper: Let ChatGPT summarize Hacker News for you

    Project mention: Show HN: Hacker News Summary – Let ChatGPT Summarize Hacker News for You | /r/patient_hackernews | 2023-06-09
  • LinkedInDumper

    Python 3 script to dump company employees from LinkedIn API

    Project mention: OSINT: Enumerating Employees on LinkedIn and Xing | /r/redteamsec | 2023-02-16
  • estela

    estela, an elastic web scraping cluster 🕸

    Project mention: Struggling to scrape specific website - any advice? | /r/webscraping | 2023-04-04

    This solution is using requests, you can also do this in scrapy, and if you are planning to run more crawlers you can use estela which is a spider management solution.

  • webspot

    An intelligent web service to automatically detect web content and extract information from it.

    Project mention: Automatic Web Content Detection | news.ycombinator.com | 2023-03-23
  • telegram-groups-crawler

    A Telegram crawler made in Python to automatically search groups and channels and collect any type of data from them (+ dataset included).

  • scrapeops-scrapy-sdk

    Scrapy extension that gives you all the scraping monitoring, alerting, scheduling, and data validation you will need straight out of the box.

    Project mention: Free Python Scrapy 5-Part Mini Course | /r/scrapy | 2022-10-20

    Part 5: Deployment, Scheduling & Running Jobs - Deploying our spider on a server, and monitoring and scheduling jobs via ScrapeOps. Article

  • XingDumper

    Python 3 script to dump company employees from XING API

    Project mention: OSINT: Enumerating Employees on LinkedIn and Xing | /r/redteamsec | 2023-02-16
  • amazon_price_tracker

    A cool Scrapy spider that notifies price drop in a product you crave to buy!

  • scrapy-yle-kuntavaalit2021

    Fetch YLE kuntavaalit 2021 data

  • python-selenium

    :cityscape: (2021) Automate web browsing with chrome.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-06-09.

Python Spider related posts

Index

What are some of the best open-source Spider projects in Python? This list will help you:

Project Stars
1 Photon 9,686
2 InfoSpider 6,663
3 toapi 3,388
4 Gerapy 2,993
5 scrapydweb 2,715
6 SpiderKeeper 2,644
7 Grab 2,287
8 TorBot 1,983
9 PSpider 1,771
10 grab-site 1,020
11 XSRFProbe 857
12 freshonions-torscraper 447
13 alltheplaces 394
14 hacker-news-digest 324
15 LinkedInDumper 171
16 estela 117
17 webspot 55
18 telegram-groups-crawler 50
19 scrapeops-scrapy-sdk 20
20 XingDumper 17
21 amazon_price_tracker 6
22 scrapy-yle-kuntavaalit2021 2
23 python-selenium 2
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com