Top 23 Python Spider Projects

Photon

3 10,501 0.0 Python

Incredibly fast crawler designed for OSINT. (by s0md3v)
InfoSpider

1 7,122 4.6 Python

INFO-SPIDER 是一个集众多数据源于一身的爬虫工具箱🧰，旨在安全快捷的帮助用户拿回自己的数据，工具代码开源，流程透明。支持数据源包括GitHub、QQ邮箱、网易邮箱、阿里邮箱、新浪邮箱、Hotmail邮箱、Outlook邮箱、京东、淘宝、支付宝、中国移动、中国联通、中国电信、知乎、哔哩哔哩、网易云音乐、QQ好友、QQ群、生成朋友圈相册、浏览器浏览历史、12306、博客园、CSDN博客、开源中国博客、简书。
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Douyin_TikTok_Download_API

3 6,780 8.2 Python

🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具，支持API调用，在线批量解析及下载。

Project mention: TikTok video scraper | /r/webscraping | 2023-05-23

At the moment I am working on a web scraper for TikTok. At the moment, I am able to retrieve data about the first 16 videos from a channel. The way I achieved this was to make requests to an unofficial API https://github.com/Evil0ctal/Douyin_TikTok_Download_API. My problem is that the requirements for this project do not allow me to use any package that would extract data from TikTok. I would like to ask you all, how should I go about this task. Already tried getting data from the HTML, but is not sufficient since most of it is not displayed when I use requests.get(URL). Could you please recommend some repositories that could help or some way of extracting the data? Thank you!

toapi

0 3,462 0.0 Python

Every web site provides APIs.
Gerapy

1 3,210 6.4 Python

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js
scrapydweb

6 3,001 3.6 Python

Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO :point_right:
SpiderKeeper

1 2,704 0.0 Python

admin ui for scrapy/open source scrapinghub
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
TorBot

1 2,599 8.7 Python

Dark Web OSINT Tool
Grab

0 2,354 3.0 Python

Web Scraping Framework
PSpider

0 1,811 0.0 Python

简单易用的Python爬虫框架，QQ交流群：597510560
grab-site

30 1,260 3.8 Python

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29

The format you want is WARC. Even the Library of Congress uses it. There are many many WARC scrapers. I'd look at what the Internet Archive recommends. A quick search turned up this from the Archive Team and Jason Scott https://github.com/ArchiveTeam/grab-site (https://wiki.archiveteam.org/index.php/Who_We_Are) but I found that in less than 15 seconds of searching so do your own diligence.

XSRFProbe

1 915 0.0 Python

The Prime Cross Site Request Forgery (CSRF) Audit and Exploitation Toolkit.
hacker-news-digest

11 645 7.4 Python

:newspaper: Let ChatGPT Summarize Hacker News for You

Project mention: What's the fun in writing on the internet anymore? | news.ycombinator.com | 2024-02-17

https://hackernews.betacat.io/ here they use ChatGTP so summarize HN frontpage stories, and it says "Article discusses automated plagiarism and the diminishing value of authorship online. It compares today's internet to ancient texts, where authorship was less defined."

alltheplaces

6 528 10.0 Python

A set of spiders and scrapers to extract location information from places that post their location on the internet.

Project mention: Differentiating between hypermarkets and supermarkets. | /r/openstreetmap | 2023-12-09

Maybe a different approach? https://www.alltheplaces.xyz/ has stores grouped by name

freshonions-torscraper

2 489 1.8 Python

Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion
LinkedInDumper

1 362 6.8 Python

Python 3 script to dump/scrape/extract company employees from LinkedIn API
linkedIn-scraper

6 199 6.9 Python

A playwright bot which is implemented to scrape linkedin and store advertisement data in a database and telegram channel
estela

10 153 8.1 Python

estela, an elastic web scraping cluster 🕸
graphinder

2 116 2.6 Python

🕸️ Blazing fast GraphQL endpoints finder using subdomain enumeration, scripts analysis and bruteforce. 🕸️

Project mention: Launch HN: Escape (YC W23) – Discover and secure all your APIs | news.ycombinator.com | 2024-02-01

When I met Antoine, who had previously been a security engineer at NATO and Apple, we decided to tackle this issue together and create a modern security tool that would appeal to both developers and security people. It needed to be fast, easy to set up yet configurable, have outstanding support for securing APIs, and find what was relevant with a low false positive rate.
The first step was to show security engineers and developers what APIs they had to secure. We needed to find an easy way to discover any organization’s exposed and internal APIs.
To discover all APIs, we crafted a system that extracts all the API routes the organization exposes by scanning its domains, frontend websites, and SPAs. It then enriches this data by connecting to code repositories, API gateways, and API development tools to create a full list of all the exposed endpoints and the sensitivity of the data they handle. Other testing tools do not provide an inventory of all the API routes exposed by an organization, but as we mentioned above, the biggest problem security engineers face is often just finding out what it is they need to test!
Then, we needed to provide security engineers and developers with a list of security issues in their APIs.
Since APIs act as a business model layer, most of the critical security issues lie in the business processes underlying APIs. In security, issues obtained from breaking business processes are called Broken Object Level Authorization (BOLA), Broken Function Level Authorization (BFLA), and Broken Object Property Level Authorization (BOPLA).
To find them, we knew we couldn’t rely on traditional techniques like fuzzing. We needed to find a way to model the Business Process underlying the API and attempt to break it.
Doing research on this topic, we discovered that modeling API business processes in a similar way to board games, like Chess or Go, worked surprisingly well. The underlying reason is simple: a board game is a state machine on which you can execute actions that must respect rules to change the game’s state. Think about moving the pieces in a chess game, each piece has its specific moves, and their position on the board represents the state.
APIs are similar: they have a database, which represents the internal state, and API routes, which represent the actions you can run on the state. Of course, most APIs are more complex than a chess game because they have much more routes than there are chess pieces. In mathematics, we would say that the action space is much larger.
But the models are similar enough for us to try applying alpha-beta, Monte-Carlo Search Three, and more advanced Machine Learning techniques that have proven to work well in the context of large action space games like Go.
Those were the foundational ideas behind our in-house algorithm, Feedback-Driven API Exploration (FDAE), which automatically identifies the underlying business processes and generates sequences of API requests especially aimed at breaking them, uncovering potential security flaws and data leaks.
FDAE starts by ingesting the list of routes and parameters in an API. It first identifies the routes leading to sensitive data, like PII or financial information, and the parameters that have the most chances of being vulnerable to various kinds of injections and attacks.
Often, those routes require parameters like UUIDs or domain-specific values. That’s where traditional security scanners fall short: they often fuzz randomly the parameters hoping to find some low-hanging fruit injection, but end up blocked at the data validation layer.
FDAE is smarter. If it detects that the route /user/:uuid might be sensible, it will first look at all the other routes in the API and try to find one that returns a valid user UUID. Once it gets the valid user UUID, it will use it to trigger the /user/:uuid route and try to exploit it in many different ways.
If there are no existing users in the database, but there is a route to create one, Escape’s FDAE will even be able to create a user, get its UUID, and then attempt exploiting the routes that require a user UUID.
This process, very similar to what human penetration testers and bug hunters do, allows Escape to do extensive and deep testing of any API and business processes. It’s specifically good at finding many access control bugs like tenant isolation problems, complex multi-step injections, and request forgeries.
To give a specific example, imagine you’re building an e-commerce application, Escape can detect cases where users can bypass payment steps or modify input parameters in the request to access other user’s orders or private information.
You can find a more detailed explanation of how Feedback Driven API Exploration works with graphics here: https://escape.tech/blog/feedback-driven-api-exploration/
Escape’s entire scanning process takes minutes. It was very important to us, as former developers, to seamlessly integrate API testing in CI/CD pipelines and quickly implement relevant fixes. To verify that it was scalable, we scanned all public APIs on the internet and produced research reports on their quality: the State of GraphQL Security (https://26857953.fs1.hubspotusercontent-eu1.net/hubfs/268579...), and the State of Public APIs (https://apirank.dev/state-of-public-api-2023/).
Apart from discovering and testing APIs in minutes, we wanted to make Escape actionable. Pinpointing a problem is one thing, but then how to fix it? Most dynamic scanners give vague remediation instructions. Escape actually generates code snippets to help developers.
We offer a few monthly and yearly subscription plans based on the number of APIs and developers in the org, with a free 7 days trial. The pricing is accessible in the app during a trial period. Since our product is highly technical, we wanted to make sure that users can explore our features, evaluate what Escape does, and understand its value before making a decision. Users can see pricing details at a point in their trial journey where it makes the most sense, aligning with their understanding of the product. You can try us without a credit card at https://escape.tech.
Our main SaaS product is closed source, but we publish many open source packages for security and developers on https://github.com/Escape-Technologies/ , some of them being widely used like GraphQL Armor (https://github.com/Escape-Technologies/graphql-armor/)
The number and complexity of APIs are constantly growing, and we’re continuing to learn every day, so we would greatly appreciate and are eager for your feedback (no matter how big or small)! Thanks!

webspot

1 77 7.2 Python

An intelligent web service to automatically detect web content and extract information from it.
telegram-groups-crawler

1 77 0.0 Python

A Telegram crawler made in Python to automatically search groups and channels and collect any type of data from them (+ dataset included).
scrapeops-scrapy-sdk

11 36 3.9 Python

Scrapy extension that gives you all the scraping monitoring, alerting, scheduling, and data validation you will need straight out of the box.

Project mention: Distribution of gross and net salaries on r/BESalary [OC] | /r/BESalary | 2023-07-01

My favourite scrapingtool is Scrappy, requires some Python knowledge but there are some very good tutorials about it on https://scrapeops.io

XingDumper

1 34 5.1 Python

Python 3 script to dump/scrape/extract company employees from XING API
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Spider related posts

Differentiating between hypermarkets and supermarkets.
1 project | /r/openstreetmap | 9 Dec 2023
Meta, Microsoft and Amazon team up on maps project
1 project | news.ycombinator.com | 26 Jul 2023
Distribution of gross and net salaries on r/BESalary [OC]
1 project | /r/BESalary | 1 Jul 2023
struggling to download websites
1 project | /r/DataHoarder | 15 May 2023
Internet Archive Down, will be up and running soon (i hope).
1 project | /r/DataHoarder | 22 Mar 2023
best tool for downloading forum posts in real-time?
1 project | /r/DataHoarder | 22 Mar 2023
Best way to back up entire website on a schedule
2 projects | /r/DataHoarder | 29 Jan 2023
A note from our sponsor - WorkOS
workos.com | 24 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source Spider projects in Python? This list will help you:

	Project	Stars
1	Photon	10,501
2	InfoSpider	7,122
3	Douyin_TikTok_Download_API	6,780
4	toapi	3,462
5	Gerapy	3,210
6	scrapydweb	3,001
7	SpiderKeeper	2,704
8	TorBot	2,599
9	Grab	2,354
10	PSpider	1,811
11	grab-site	1,260
12	XSRFProbe	915
13	hacker-news-digest	645
14	alltheplaces	528
15	freshonions-torscraper	489
16	LinkedInDumper	362
17	linkedIn-scraper	199
18	estela	153
19	graphinder	116
20	webspot	77
21	telegram-groups-crawler	77
22	scrapeops-scrapy-sdk	36
23	XingDumper	34