Python Spark

Open-source Python projects categorized as Spark

Top 23 Python Spark Projects

  1. data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  2. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  3. Redash

    Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

    Project mention: The 50 best open-source alternatives to popular SaaS software | dev.to | 2024-07-10

    GitHub: Redash GitHub Repository

  4. ChuanhuChatGPT

    GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.

  5. horovod

    Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

  6. Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

    Project mention: 25 Open Source AI Tools to Cut Your Development Time in Half | dev.to | 2024-07-11

    Mage AI is a data transforming and integrating framework that allows data scientists and ML engineers to build and automate data pipelines without extensive coding. Data scientists can easily connect to their data sources, ingest data, and build production-ready data pipelines within Mage notebooks.

  7. sqlglot

    Python SQL Parser and Transpiler

    Project mention: Writing Composable SQL Using Knex and Pipelines | news.ycombinator.com | 2024-11-28

    You can compose SQL with https://ibis-project.org/tutorials/ibis-for-sql-users, which is using https://github.com/tobymao/sqlglot to parse the SQL under the hood.

    As an alternative to parsing the SQL yourself, DuckDB's [relational API](https://duckdb.org/docs/api/python/relational_api) allows you to compose SQL expressions efficiently and lazily, which I've used when playing around with thinks like https://gist.github.com/ajfriend/eea0795546c7c44f1c24ab0560a...

  8. dev-setup

    macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.

  9. Nutrient

    Nutrient – The #1 PDF SDK Library, trusted by 10K+ developers. Other PDF SDKs promise a lot - then break. Laggy scrolling, poor mobile UX, tons of bugs, and lack of support cost you endless frustrations. Nutrient’s SDK handles billion-page workloads - so you don’t have to debug PDFs. Used by ~1 billion end users in more than 150 different countries.

    Nutrient logo
  10. TensorFlowOnSpark

    TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

  11. koalas

    Koalas: pandas API on Apache Spark

  12. fugue

    A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

  13. pyspark-example-project

    Implementing best practices for PySpark ETL jobs and applications.

  14. Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  15. sparkmagic

    Jupyter magics and kernels for working with remote Spark clusters

  16. listenbrainz-server

    Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

    Project mention: Goodbye, Slopify | news.ycombinator.com | 2025-01-28

    For people moving off Spotify, have a look at https://listenbrainz.org. You can sync your listens to there and it will give you weekly recommendations. From my experience so far they are decent.

    Note they don't host songs themselves, but will auto-search youtube/bandcamp/etc. and play the closest match. So YMMV.

  17. streamify

    A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

  18. enterprise_gateway

    A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.

    Project mention: Zasper: A Modern and Efficient Alternative to JupyterLab, Built in Go | news.ycombinator.com | 2025-01-01

    https://github.com/jupyter-server/enterprise_gateway

    JupyterLab supports Lumino and React widgets.

    Jupyter Notebook was built on jQuery, but Notebooks is now forked from JupyterLab and there's NbClassic FWIU.

    Breaking the notebook extension API from Notebook to Lab unfortunately caused re-work for progress, as I recall.

    jupyter-xeus/xeus is an "Implementation of the Jupyter kernel protocol in C++*

  19. datacompy

    Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!

  20. popmon

    Monitor the stability of a Pandas or Spark dataframe ⚙︎

  21. flytekit

    Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.

  22. visions

    Type System for Data Analysis in Python

  23. cape-dataframes

    Privacy transformations on Spark and Pandas dataframes backed by a simple policy language.

  24. emr-serverless-samples

    Example code for running Spark and Hive jobs on EMR Serverless.

  25. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Spark discussion

Log in or Post with

Python Spark related posts

  • PyTorch Library for Running LLM on Intel CPU and GPU

    1 project | news.ycombinator.com | 3 Apr 2024
  • Splink: Fast, accurate, scalable probabilistic data linkage

    1 project | news.ycombinator.com | 13 Mar 2024
  • FLaNK Stack Weekly 22 January 2024

    37 projects | dev.to | 22 Jan 2024
  • FLaNK Stack Weekly for 12 September 2023

    26 projects | dev.to | 12 Sep 2023
  • Data diffs: Algorithms for explaining what changed in a dataset (2022)

    8 projects | news.ycombinator.com | 26 Jul 2023
  • A platform for building Gen-AI applications on Spark

    2 projects | /r/datascience | 26 Jun 2023
  • Daft: A High-Performance Distributed Dataframe Library for Multimodal Data

    4 projects | news.ycombinator.com | 7 Jun 2023
  • A note from our sponsor - SaaSHub
    www.saashub.com | 16 Feb 2025
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Spark projects in Python? This list will help you:

# Project Stars
1 data-science-ipython-notebooks 27,837
2 Redash 26,902
3 ChuanhuChatGPT 15,364
4 horovod 14,376
5 Mage 8,133
6 sqlglot 7,123
7 dev-setup 6,130
8 TensorFlowOnSpark 3,872
9 koalas 3,346
10 fugue 2,039
11 pyspark-example-project 1,722
12 Optimus 1,491
13 splink 1,465
14 sparkmagic 1,341
15 listenbrainz-server 734
16 streamify 657
17 enterprise_gateway 631
18 datacompy 506
19 popmon 498
20 flytekit 259
21 visions 210
22 cape-dataframes 174
23 emr-serverless-samples 161

Sponsored
CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai

Did you know that Python is
the 2nd most popular programming language
based on number of references?