SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Spark Projects
-
data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
Redash
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
Project mention: The 50 best open-source alternatives to popular SaaS software | dev.to | 2024-07-10GitHub: Redash GitHub Repository
-
ChuanhuChatGPT
GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.
-
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
Mage AI is a data transforming and integrating framework that allows data scientists and ML engineers to build and automate data pipelines without extensive coding. Data scientists can easily connect to their data sources, ingest data, and build production-ready data pipelines within Mage notebooks.
-
Project mention: Writing Composable SQL Using Knex and Pipelines | news.ycombinator.com | 2024-11-28
You can compose SQL with https://ibis-project.org/tutorials/ibis-for-sql-users, which is using https://github.com/tobymao/sqlglot to parse the SQL under the hood.
As an alternative to parsing the SQL yourself, DuckDB's [relational API](https://duckdb.org/docs/api/python/relational_api) allows you to compose SQL expressions efficiently and lazily, which I've used when playing around with thinks like https://gist.github.com/ajfriend/eea0795546c7c44f1c24ab0560a...
-
dev-setup
macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.
-
Nutrient
Nutrient – The #1 PDF SDK Library, trusted by 10K+ developers. Other PDF SDKs promise a lot - then break. Laggy scrolling, poor mobile UX, tons of bugs, and lack of support cost you endless frustrations. Nutrient’s SDK handles billion-page workloads - so you don’t have to debug PDFs. Used by ~1 billion end users in more than 150 different countries.
-
-
-
fugue
A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.
-
-
Optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)
-
splink
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
Has anyone tries to link and dedupe the various datasets using a probabilistic linkage tool like Splink?
https://moj-analytical-services.github.io/splink/
(Disclaimer: I am the lead author, but the tool is FOSS)
-
-
listenbrainz-server
Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.
For people moving off Spotify, have a look at https://listenbrainz.org. You can sync your listens to there and it will give you weekly recommendations. From my experience so far they are decent.
Note they don't host songs themselves, but will auto-search youtube/bandcamp/etc. and play the closest match. So YMMV.
-
streamify
A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!
-
enterprise_gateway
A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
Project mention: Zasper: A Modern and Efficient Alternative to JupyterLab, Built in Go | news.ycombinator.com | 2025-01-01https://github.com/jupyter-server/enterprise_gateway
JupyterLab supports Lumino and React widgets.
Jupyter Notebook was built on jQuery, but Notebooks is now forked from JupyterLab and there's NbClassic FWIU.
Breaking the notebook extension API from Notebook to Lab unfortunately caused re-work for progress, as I recall.
jupyter-xeus/xeus is an "Implementation of the Jupyter kernel protocol in C++*
-
-
-
flytekit
Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.
-
-
cape-dataframes
Privacy transformations on Spark and Pandas dataframes backed by a simple policy language.
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Spark discussion
Python Spark related posts
-
PyTorch Library for Running LLM on Intel CPU and GPU
-
Splink: Fast, accurate, scalable probabilistic data linkage
-
FLaNK Stack Weekly 22 January 2024
-
FLaNK Stack Weekly for 12 September 2023
-
Data diffs: Algorithms for explaining what changed in a dataset (2022)
-
A platform for building Gen-AI applications on Spark
-
Daft: A High-Performance Distributed Dataframe Library for Multimodal Data
-
A note from our sponsor - SaaSHub
www.saashub.com | 16 Feb 2025
Index
What are some of the best open-source Spark projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | data-science-ipython-notebooks | 27,837 |
2 | Redash | 26,902 |
3 | ChuanhuChatGPT | 15,364 |
4 | horovod | 14,376 |
5 | Mage | 8,133 |
6 | sqlglot | 7,123 |
7 | dev-setup | 6,130 |
8 | TensorFlowOnSpark | 3,872 |
9 | koalas | 3,346 |
10 | fugue | 2,039 |
11 | pyspark-example-project | 1,722 |
12 | Optimus | 1,491 |
13 | splink | 1,465 |
14 | sparkmagic | 1,341 |
15 | listenbrainz-server | 734 |
16 | streamify | 657 |
17 | enterprise_gateway | 631 |
18 | datacompy | 506 |
19 | popmon | 498 |
20 | flytekit | 259 |
21 | visions | 210 |
22 | cape-dataframes | 174 |
23 | emr-serverless-samples | 161 |