Python Spider

Open-source Python projects categorized as Spider

Top 23 Python Spider Projects

  • Photon

    Incredibly fast crawler designed for OSINT. (by s0md3v)

  • Scout Monitoring

    Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

    Scout Monitoring logo
  • Douyin_TikTok_Download_API

    🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具,支持API调用,在线批量解析及下载。

  • InfoSpider

    INFO-SPIDER 是一个集众多数据源于一身的爬虫工具箱🧰,旨在安全快捷的帮助用户拿回自己的数据,工具代码开源,流程透明。支持数据源包括GitHub、QQ邮箱、网易邮箱、阿里邮箱、新浪邮箱、Hotmail邮箱、Outlook邮箱、京东、淘宝、支付宝、中国移动、中国联通、中国电信、知乎、哔哩哔哩、网易云音乐、QQ好友、QQ群、生成朋友圈相册、浏览器浏览历史、12306、博客园、CSDN博客、开源中国博客、简书。

  • toapi

    Every web site provides APIs.

  • Gerapy

    Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

  • scrapydweb

    Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO :point_right:

  • TorBot

    Dark Web OSINT Tool

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • SpiderKeeper

    admin ui for scrapy/open source scrapinghub

  • Grab

    Web Scraping Framework

  • PSpider

    简单易用的Python爬虫框架,QQ交流群:597510560

  • grab-site

    The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

    Project mention: How to download a copy of a website using Wget | news.ycombinator.com | 2024-06-07
  • XSRFProbe

    The Prime Cross Site Request Forgery (CSRF) Audit and Exploitation Toolkit.

  • hacker-news-digest

    :newspaper: Let ChatGPT Summarize Hacker News for You

    Project mention: What's the fun in writing on the internet anymore? | news.ycombinator.com | 2024-02-17

    https://hackernews.betacat.io/ here they use ChatGTP so summarize HN frontpage stories, and it says "Article discusses automated plagiarism and the diminishing value of authorship online. It compares today's internet to ancient texts, where authorship was less defined."

  • alltheplaces

    A set of spiders and scrapers to extract location information from places that post their location on the internet.

    Project mention: Differentiating between hypermarkets and supermarkets. | /r/openstreetmap | 2023-12-09

    Maybe a different approach? https://www.alltheplaces.xyz/ has stores grouped by name

  • freshonions-torscraper

    Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion

  • LinkedInDumper

    Python 3 script to dump/scrape/extract company employees from LinkedIn API

  • linkedIn-scraper

    A playwright bot which is implemented to scrape linkedin and store advertisement data in a database and telegram channel

  • estela

    estela, an elastic web scraping cluster 🕸

  • graphinder

    🕸️ Blazing fast GraphQL endpoints finder using subdomain enumeration, scripts analysis and bruteforce. 🕸️

    Project mention: Launch HN: Escape (YC W23) – Discover and secure all your APIs | news.ycombinator.com | 2024-02-01

    When I met Antoine, who had previously been a security engineer at NATO and Apple, we decided to tackle this issue together and create a modern security tool that would appeal to both developers and security people. It needed to be fast, easy to set up yet configurable, have outstanding support for securing APIs, and find what was relevant with a low false positive rate.

    The first step was to show security engineers and developers what APIs they had to secure. We needed to find an easy way to discover any organization’s exposed and internal APIs.

    To discover all APIs, we crafted a system that extracts all the API routes the organization exposes by scanning its domains, frontend websites, and SPAs. It then enriches this data by connecting to code repositories, API gateways, and API development tools to create a full list of all the exposed endpoints and the sensitivity of the data they handle. Other testing tools do not provide an inventory of all the API routes exposed by an organization, but as we mentioned above, the biggest problem security engineers face is often just finding out what it is they need to test!

    Then, we needed to provide security engineers and developers with a list of security issues in their APIs.

    Since APIs act as a business model layer, most of the critical security issues lie in the business processes underlying APIs. In security, issues obtained from breaking business processes are called Broken Object Level Authorization (BOLA), Broken Function Level Authorization (BFLA), and Broken Object Property Level Authorization (BOPLA).

    To find them, we knew we couldn’t rely on traditional techniques like fuzzing. We needed to find a way to model the Business Process underlying the API and attempt to break it.

    Doing research on this topic, we discovered that modeling API business processes in a similar way to board games, like Chess or Go, worked surprisingly well. The underlying reason is simple: a board game is a state machine on which you can execute actions that must respect rules to change the game’s state. Think about moving the pieces in a chess game, each piece has its specific moves, and their position on the board represents the state.

    APIs are similar: they have a database, which represents the internal state, and API routes, which represent the actions you can run on the state. Of course, most APIs are more complex than a chess game because they have much more routes than there are chess pieces. In mathematics, we would say that the action space is much larger.

    But the models are similar enough for us to try applying alpha-beta, Monte-Carlo Search Three, and more advanced Machine Learning techniques that have proven to work well in the context of large action space games like Go.

    Those were the foundational ideas behind our in-house algorithm, Feedback-Driven API Exploration (FDAE), which automatically identifies the underlying business processes and generates sequences of API requests especially aimed at breaking them, uncovering potential security flaws and data leaks.

    FDAE starts by ingesting the list of routes and parameters in an API. It first identifies the routes leading to sensitive data, like PII or financial information, and the parameters that have the most chances of being vulnerable to various kinds of injections and attacks.

    Often, those routes require parameters like UUIDs or domain-specific values. That’s where traditional security scanners fall short: they often fuzz randomly the parameters hoping to find some low-hanging fruit injection, but end up blocked at the data validation layer.

    FDAE is smarter. If it detects that the route /user/:uuid might be sensible, it will first look at all the other routes in the API and try to find one that returns a valid user UUID. Once it gets the valid user UUID, it will use it to trigger the /user/:uuid route and try to exploit it in many different ways.

    If there are no existing users in the database, but there is a route to create one, Escape’s FDAE will even be able to create a user, get its UUID, and then attempt exploiting the routes that require a user UUID.

    This process, very similar to what human penetration testers and bug hunters do, allows Escape to do extensive and deep testing of any API and business processes. It’s specifically good at finding many access control bugs like tenant isolation problems, complex multi-step injections, and request forgeries.

    To give a specific example, imagine you’re building an e-commerce application, Escape can detect cases where users can bypass payment steps or modify input parameters in the request to access other user’s orders or private information.

    You can find a more detailed explanation of how Feedback Driven API Exploration works with graphics here: https://escape.tech/blog/feedback-driven-api-exploration/

    Escape’s entire scanning process takes minutes. It was very important to us, as former developers, to seamlessly integrate API testing in CI/CD pipelines and quickly implement relevant fixes. To verify that it was scalable, we scanned all public APIs on the internet and produced research reports on their quality: the State of GraphQL Security (https://26857953.fs1.hubspotusercontent-eu1.net/hubfs/268579...), and the State of Public APIs (https://apirank.dev/state-of-public-api-2023/).

    Apart from discovering and testing APIs in minutes, we wanted to make Escape actionable. Pinpointing a problem is one thing, but then how to fix it? Most dynamic scanners give vague remediation instructions. Escape actually generates code snippets to help developers.

    We offer a few monthly and yearly subscription plans based on the number of APIs and developers in the org, with a free 7 days trial. The pricing is accessible in the app during a trial period. Since our product is highly technical, we wanted to make sure that users can explore our features, evaluate what Escape does, and understand its value before making a decision. Users can see pricing details at a point in their trial journey where it makes the most sense, aligning with their understanding of the product. You can try us without a credit card at https://escape.tech.

    Our main SaaS product is closed source, but we publish many open source packages for security and developers on https://github.com/Escape-Technologies/ , some of them being widely used like GraphQL Armor (https://github.com/Escape-Technologies/graphql-armor/)

    The number and complexity of APIs are constantly growing, and we’re continuing to learn every day, so we would greatly appreciate and are eager for your feedback (no matter how big or small)! Thanks!

  • telegram-groups-crawler

    A Telegram crawler made in Python to automatically search groups and channels and collect any type of data from them (+ dataset included).

  • webspot

    An intelligent web service to automatically detect web content and extract information from it.

  • scrapeops-scrapy-sdk

    Scrapy extension that gives you all the scraping monitoring, alerting, scheduling, and data validation you will need straight out of the box.

  • XingDumper

    Python 3 script to dump/scrape/extract company employees from XING API

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Spider discussion

Log in or Post with

Python Spider related posts

  • How to download a copy of a website using Wget

    1 project | news.ycombinator.com | 7 Jun 2024
  • Differentiating between hypermarkets and supermarkets.

    1 project | /r/openstreetmap | 9 Dec 2023
  • Meta, Microsoft and Amazon team up on maps project

    1 project | news.ycombinator.com | 26 Jul 2023
  • Distribution of gross and net salaries on r/BESalary [OC]

    1 project | /r/BESalary | 1 Jul 2023
  • struggling to download websites

    1 project | /r/DataHoarder | 15 May 2023
  • Internet Archive Down, will be up and running soon (i hope).

    1 project | /r/DataHoarder | 22 Mar 2023
  • best tool for downloading forum posts in real-time?

    1 project | /r/DataHoarder | 22 Mar 2023
  • A note from our sponsor - SaaSHub
    www.saashub.com | 12 Jul 2024
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Spider projects in Python? This list will help you:

Project Stars
1 Photon 10,684
2 Douyin_TikTok_Download_API 7,745
3 InfoSpider 7,667
4 toapi 3,488
5 Gerapy 3,268
6 scrapydweb 3,062
7 TorBot 2,734
8 SpiderKeeper 2,705
9 Grab 2,370
10 PSpider 1,822
11 grab-site 1,309
12 XSRFProbe 915
13 hacker-news-digest 672
14 alltheplaces 587
15 freshonions-torscraper 493
16 LinkedInDumper 376
17 linkedIn-scraper 212
18 estela 161
19 graphinder 118
20 telegram-groups-crawler 94
21 webspot 81
22 scrapeops-scrapy-sdk 37
23 XingDumper 34

Sponsored
Free Django app performance insights with Scout Monitoring
Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
www.scoutapm.com

Did you konow that Python is
the 1st most popular programming language
based on number of metions?