Python Spark

Open-source Python projects categorized as Spark | Edit details

Top 22 Python Spark Projects

  • GitHub repo data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  • GitHub repo Redash

    Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

    Project mention: Open source DW? | | 2022-01-03

    Its a bad idea use Redash as ETL and Data warehouse? Im not a data engineer just looking for a low/mid scale solution for experiment.

  • SonarQube

    Static code analysis for 29 languages.. Your projects are multi-language. So is SonarQube analysis. Find Bugs, Vulnerabilities, Security Hotspots, and Code Smells so you can release quality code every time. Get started analyzing your projects today for free.

  • GitHub repo horovod

    Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

    Project mention: [D] PyTorch Distributed Training Libraries: What are the current options? | | 2021-12-07

    Check out Horovod -

  • GitHub repo dev-setup

    macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.

    Project mention: MacOS Development workspace 2021 | | 2021-03-08

    donnemartin - dev setup

  • GitHub repo TensorFlowOnSpark

    TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

    Project mention: [D] Plug or Integrate a GNN Pytorch code base into Spark Cluster | | 2022-01-03 : check out if this project is useful for you.

  • GitHub repo koalas

    Koalas: pandas API on Apache Spark

    Project mention: Spark vs Pandas | | 2021-02-18

    If you like excessive use of square brackets.. I mean pandas, you might wanna check out Koalas. Koalas suppose to provide pandas datafrafe API implementation atop of Spark.

  • GitHub repo dpark

    Python clone of Spark, a MapReduce alike framework in Python

  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • GitHub repo feast

    Feature Store for Machine Learning

    Project mention: [P] Announcing Feast 0.10: The simplest way to serve features in production | | 2021-04-15


  • GitHub repo Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  • GitHub repo sparkmagic

    Jupyter magics and kernels for working with remote Spark clusters

    Project mention: Spark is lit once again | | 2021-10-29

    Things get a bit more complicated on interactive sessions. We've created Sparkmagic compatible REST API so that Sparkmagic kernel could communicate with Lighter the same way as it does with Apache Livy. When a user creates an interactive session Lighter server submits a custom PySpark application which contains an infinite loop which constantly checks for new commands to be executed. Each Sparkmagic command is saved on Java collection, retrieved by the PySpark application through Py4J Gateway and executed.

  • GitHub repo Hail

    Scalable genomic data analysis.

    Project mention: Ask HN: Who is hiring? (July 2021) | | 2021-07-01

    Broad Institute of MIT and Harvard | Cambridge, MA | Associate Software Engineer | Onsite

    We are seeking an associate software engineer interested in contributing to an open-source data visualization library for analyzing the biological impact human genetic variation. You will contribute to projects like gnomAD (, the world's largest catalogue of human genetic variation used by hundreds of thousands of researchers and help us scale towards millions of genomes in the coming years. We are also developing next-generation tools for enabling genetic analyses of large biobanks across richly phenotyped individuals ( In this role you will gain experience developing data-intensive web applications with Typescript, React, Python, Terraform, Google Cloud Platform, and will make use of the scalable data analysis library Hail ( Key to our success is growing a strong team with a diverse membership who foster a culture of continual learning, and who support the growth and success of one another. Towards this end, we are committed to seeking applications from women and from underrepresented groups. We know that many excellent candidates choose not to apply despite their capabilities; please allow us to enthusiastically counter this tendency.

    Please provide a CV and links previous work or projects, ideally with contributions visible on Github.

    email: [email protected]

  • GitHub repo listenbrainz-server

    Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

    Project mention: ListenBrainz Year in Music 2021 | | 2021-12-17
  • GitHub repo fugue

    A unified interface for distributed computing. Fugue executes SQL, Python, and Pandas code on Spark and Dask without any rewrites.

    Project mention: Pyspark now provides a native Pandas API | | 2022-01-02

    There's dask-sql, but I think it is being abandoned for fugue-project. I'm actually excited for this project as it is trying to provide a backend agnostic solution, which would seem like a difficult, lofty goal. I wish them luck.

  • GitHub repo popmon

    Monitor the stability of a pandas or spark dataframe ⚙︎

    Project mention: Monitor the stability of a pandas or spark dataframe | | 2021-09-15
  • GitHub repo cape-python

    Collaborate on privacy-preserving policy for data science projects in Pandas and Apache Spark

    Project mention: Anonymize your Data with a single line! | | 2021-12-26

    Well, many of the features in this project are simply wrappers around other libraries like this one. Therefore, the value proposition of this project would either have to be the automation aspect or the idea that you can shield the user from the details of how the implemented techniques work. I think both approaches are risky in this setting.

  • GitHub repo flytekit

    Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.

    Project mention: Release the TextHTMLPress package to PyPI | | 2021-11-26

    Based on references on setup Python project, package structure, and a production-level Python package, I refactor the package as shown below:

  • GitHub repo prosto

    Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

    Project mention: No-Code Self-Service BI/Data Analytics Tool | | 2021-11-13

    Most of the self-service or no-code BI, ETL, data wrangling tools are am aware of (like airtable, fieldbook, rowshare, Power BI etc.) were thought of as a replacement for Excel: working with tables should be as easily as working with spreadsheets. This problem can be solved when defining columns within one table: ``ColumnA=ColumnB+ColumnC, ColumnD=ColumnAColumnE`` we get a graph of column computations* similar to the graph of cell dependencies in spreadsheets.

    Yet, the main problem is in working multiple tables: how can we define a column in one table in terms of columns in other tables? For example: ``Table1::ColumnA=FUNCTION(Table2::ColumnB, Table3::ColumnC)`` Different systems provided different answers to this question but all of them are highly specific and rather limited.

    Why it is difficult to define new columns in terms of other columns in other tables? Short answer is that working with columns is not the relational approach. The relational model is working with sets (rows of tables) and not with columns.

    One generic approach to working with columns in multiple tables is provided in the concept-oriented model of data which treats mathematical functions as first-class elements of the model. Previously it was implemented in a data wrangling tool called Data Commander. But them I decided to implement this model in the *Prosto* data processing toolkit which is an alternative to map-reduce and SQL:

    It defines data transformations as operations with columns in multiple tables. Since we use mathematical functions, no joins and no groupby operations are needed and this significantly simplifies and makes more natural the task of data transformations.

    Moreover, now it provides *Column-SQL* which makes it even easier to define new columns in terms of other columns:

  • GitHub repo learningOrchestra

    learningOrchestra is a distributed Machine Learning processing tool that facilitates and streamlines iterative processes in a Data Science project.

    Project mention: Someone with a good experience in python can rate my code? | | 2021-02-16

  • GitHub repo openverse-catalog

    Identifies and collects data on cc-licensed content across web crawl data and public apis.

    Project mention: Hacktoberfest Recap | | 2021-10-31

    Issue, Pull Request, Blog Post

  • GitHub repo ds2ai

    The MLOps platform for innovators 🚀

    Project mention: Release: End-to-End MLOps Platform | | 2021-07-13
  • GitHub repo fastdbfs

    fastdbfs - An interactive command line client for Databricks DBFS.

    Project mention: fastdbfs - An interactive command line client for Databricks DBFS | | 2021-05-07

    fastdbfs is an interactive command line client for accessing Databricks DBFS. It aims to be much more friendly and faster than the official CLI tool and also feature rich.

  • GitHub repo etl-markup-toolkit

    ETL Markup Toolkit is a spark-native tool for expressing ETL transformations as configuration

    Project mention: How do you serialize and save "transformations" in your pipeline? | | 2021-08-31

    I have a side project (, if you're interested) that takes transformations as yaml files and outputs step-level logs about each step of the transformation. I've always felt that both artifacts could made searchable using an ELK stack or something... Do you have similar artifacts? Or perhaps there's a way to turn SQL into a structured or semi-structured form to aid in searchability

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-01-03.

Python Spark related posts


What are some of the best open-source Spark projects in Python? This list will help you:

Project Stars
1 data-science-ipython-notebooks 22,261
2 Redash 20,290
3 horovod 12,024
4 dev-setup 5,590
5 TensorFlowOnSpark 3,748
6 koalas 3,054
7 dpark 2,668
8 feast 2,656
9 Optimus 1,159
10 sparkmagic 1,076
11 Hail 772
12 listenbrainz-server 464
13 fugue 462
14 popmon 218
15 cape-python 144
16 flytekit 67
17 prosto 52
18 learningOrchestra 49
19 openverse-catalog 22
20 ds2ai 8
21 fastdbfs 4
22 etl-markup-toolkit 3
Find remote jobs at our new job board There are 29 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
OPS - Build and Run Open Source Unikernels
Quickly and easily build and deploy open source unikernels in tens of seconds. Deploy in any language to any cloud.