Python Spark

Open-source Python projects categorized as Spark

Top 23 Python Spark Projects

  • data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  • Redash

    Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

    Project mention: FLaNK Stack 26 February 2024 | | 2024-02-26
  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • horovod

    Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

  • Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data.

    Project mention: A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes | | 2023-06-12

    In the coming years, Mage will create a cooperative experience so that developers can build data pipelines with their team and level up together. After that journey, Mage will go on an epic quest to create the 1st open world community experience in the data universe.

  • dev-setup

    macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.

  • sqlglot

    Python SQL Parser and Transpiler

    Project mention: Build the dependency graph of your BigQuery pipelines at no cost: a Python implementation | | 2024-01-11

    In the project we used Python lib networkx and a DiGraph object (Direct Graph). To detect a table reference in a Query, we use sqlglot, a SQL parser (among other things) that works well with Bigquery.

  • TensorFlowOnSpark

    TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • koalas

    Koalas: pandas API on Apache Spark

  • dpark

    Python clone of Spark, a MapReduce alike framework in Python

  • fugue

    A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

    Project mention: FLaNK Stack Weekly 22 January 2024 | | 2024-01-22
  • Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  • pyspark-example-project

    Implementing best practices for PySpark ETL jobs and applications.

  • sparkmagic

    Jupyter magics and kernels for working with remote Spark clusters

    Project mention: Doing ML works in AWS. Need help installing cartopy | /r/aws | 2023-06-05

    Please file an issue at

  • listenbrainz-server

    Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

    Project mention: Analyzing Spotify Stream History | | 2024-02-12

    There's also ListenBrainz, run by the MusicBrainz org, which offers similar functionality without API restrictions or other paid features that Last.FM tries to push.

    If you wish to use your scrobble data at all programmatically this is a far better tool to use.

  • popmon

    Monitor the stability of a Pandas or Spark dataframe ⚙︎

  • streamify

    A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

    Project mention: Where can I find online projects end-to-end? | /r/dataengineering | 2023-03-21
  • datacompy

    Pandas and Spark DataFrame comparison for humans and more!

    Project mention: How to Check 2 SQL Tables Are the Same | | 2023-07-26
  • visions

    Type System for Data Analysis in Python

    Project mention: Complete Beginner tasked with ML at work - where do I start | /r/learnmachinelearning | 2023-06-27

    This one works pretty well:

  • flytekit

    Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.

  • cape-dataframes

    Privacy transformations on Spark and Pandas dataframes backed by a simple policy language.

    Project mention: Show HN: Cape API – Keep your sensitive data private while using GPT-4 | | 2023-06-27

    - How can we mitigate hallucinations and bias so that we have higher trust in AI generated text?

    The features of the Cape API are designed to help solve these problems for developers, and we have a number of early customers using the API in production already.

    To get started, checkout our docs:

    View the API reference:

    Join the discussion on our Discord:

    And of course try the CapeChat playground at

  • emr-serverless-samples

    Example code for running Spark and Hive jobs on EMR Serverless.

    Project mention: Hi, i want to convert existing EMR on EC2 CLuster into EMR Serverless. | /r/aws | 2023-03-12

    In addition to the Serverless docs, be sure to check out the emr-serverless-samples GitHub repo.

  • prosto

    Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-02-26.

Python Spark related posts


What are some of the best open-source Spark projects in Python? This list will help you:

Project Stars
1 data-science-ipython-notebooks 26,176
2 Redash 24,659
3 horovod 13,836
4 Mage 6,609
5 dev-setup 6,032
6 sqlglot 4,998
7 TensorFlowOnSpark 3,861
8 koalas 3,313
9 dpark 2,694
10 fugue 1,833
11 Optimus 1,428
12 pyspark-example-project 1,312
13 sparkmagic 1,278
14 splink 992
15 listenbrainz-server 624
16 popmon 481
17 streamify 435
18 datacompy 348
19 visions 194
20 flytekit 194
21 cape-dataframes 174
22 emr-serverless-samples 130
23 prosto 89
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives