Python Spark

Open-source Python projects categorized as Spark | Edit details

Top 21 Python Spark Projects

  • GitHub repo data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

    Project mention: Beginner in Python for Data Science | | 2020-12-27

    data science ipython notebooks

  • GitHub repo Redash

    Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

    Project mention: How often do you use SQL query tool or service in your daily work? | | 2021-11-21

    Regarding the subqueries: try or, they materialize queried data so you can do a subquery multiple times.

  • Scout APM

    Scout APM: A developer's best friend. Try free for 14-days. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.

  • GitHub repo horovod

    Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

    Project mention: [D] GPU buying recommendation | | 2021-07-17

    If you just want to run tensorflow or pytorch for a Jupyter notebook, setting the environment shouldn't be difficult. I know that AWS has a marketplace of preconfigured images. However, you can go as advanced as setting up a cluster of gpu-equipped nodes to setup Horovod ( to do distributed machine learning. Yes, there's a learning curve, but you cannot acquire this skillet any other way.

  • GitHub repo dev-setup

    macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.

    Project mention: MacOS Development workspace 2021 | | 2021-03-08

    donnemartin - dev setup

  • GitHub repo koalas

    Koalas: pandas API on Apache Spark

    Project mention: Spark vs Pandas | | 2021-02-18

    If you like excessive use of square brackets.. I mean pandas, you might wanna check out Koalas. Koalas suppose to provide pandas datafrafe API implementation atop of Spark.

  • GitHub repo dpark

    Python clone of Spark, a MapReduce alike framework in Python

  • GitHub repo feast

    Feature Store for Machine Learning

    Project mention: [P] Announcing Feast 0.10: The simplest way to serve features in production | | 2021-04-15


  • Nanos

    Run Linux Software Faster and Safer than Linux with Unikernels.

  • GitHub repo Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  • GitHub repo sparkmagic

    Jupyter magics and kernels for working with remote Spark clusters

    Project mention: Spark is lit once again | | 2021-10-29

    Things get a bit more complicated on interactive sessions. We've created Sparkmagic compatible REST API so that Sparkmagic kernel could communicate with Lighter the same way as it does with Apache Livy. When a user creates an interactive session Lighter server submits a custom PySpark application which contains an infinite loop which constantly checks for new commands to be executed. Each Sparkmagic command is saved on Java collection, retrieved by the PySpark application through Py4J Gateway and executed.

  • GitHub repo Hail

    Scalable genomic data analysis.

    Project mention: Ask HN: Who is hiring? (July 2021) | | 2021-07-01

    Broad Institute of MIT and Harvard | Cambridge, MA | Associate Software Engineer | Onsite

    We are seeking an associate software engineer interested in contributing to an open-source data visualization library for analyzing the biological impact human genetic variation. You will contribute to projects like gnomAD (, the world's largest catalogue of human genetic variation used by hundreds of thousands of researchers and help us scale towards millions of genomes in the coming years. We are also developing next-generation tools for enabling genetic analyses of large biobanks across richly phenotyped individuals ( In this role you will gain experience developing data-intensive web applications with Typescript, React, Python, Terraform, Google Cloud Platform, and will make use of the scalable data analysis library Hail ( Key to our success is growing a strong team with a diverse membership who foster a culture of continual learning, and who support the growth and success of one another. Towards this end, we are committed to seeking applications from women and from underrepresented groups. We know that many excellent candidates choose not to apply despite their capabilities; please allow us to enthusiastically counter this tendency.

    Please provide a CV and links previous work or projects, ideally with contributions visible on Github.

    email: [email protected]

  • GitHub repo listenbrainz-server

    Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

    Project mention: Dislike button would improve Spotify's recommendations | | 2021-10-16

    Listenbrainz[0] looks like an interesting project for building better (or at least more open) recommendation systems[1].



  • GitHub repo fugue

    A unified interface for distributed computing. Fugue executes SQL, Python, and Pandas code on Spark and Dask without any rewrites. (by fugue-project)

    Project mention: FugueSQL: SQL-ish for pandas, dask, spark | | 2021-10-11

    Hey, I am the author of Fugue.

    Fugue is a higher level abstraction compared to Ray. It provides unified and non-invasive interfaces for people to use Spark, Dask and Pandas. Ray/Modin is also on our roadmap.

    It provides both Python interface (not pandas-like) and Fugue SQL (standard SQL + extra features). Users can choose the one they are most comfortable with as the semantic layer for distributed computing, they are equivalent.

    With Fugue, most of your logic will be in simple Python/SQL that is framework and scale agnostic. From the mindset to the code, Fugue minimizes your dependency on any specific computing frameworks including Fugue itself.

    Please let me know if you want to learn more. our slack is in the README of the fugue repo

    Fugue repo:

  • GitHub repo popmon

    Monitor the stability of a pandas or spark dataframe ⚙︎

    Project mention: Monitor the stability of a pandas or spark dataframe | | 2021-09-15
  • GitHub repo cape-python

    Collaborate on privacy-preserving policy for data science projects in Pandas and Apache Spark

    Project mention: Data Anonymization Libraries | | 2021-11-10

    I was wondering what other helpful and easy of use libraries are there for data anonymization like faker and cape-python ?

  • GitHub repo flytekit

    Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.

    Project mention: Release the TextHTMLPress package to PyPI | | 2021-11-26

    Based on references on setup Python project, package structure, and a production-level Python package, I refactor the package as shown below:

  • GitHub repo prosto

    Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

    Project mention: No-Code Self-Service BI/Data Analytics Tool | | 2021-11-13

    Most of the self-service or no-code BI, ETL, data wrangling tools are am aware of (like airtable, fieldbook, rowshare, Power BI etc.) were thought of as a replacement for Excel: working with tables should be as easily as working with spreadsheets. This problem can be solved when defining columns within one table: ``ColumnA=ColumnB+ColumnC, ColumnD=ColumnAColumnE`` we get a graph of column computations* similar to the graph of cell dependencies in spreadsheets.

    Yet, the main problem is in working multiple tables: how can we define a column in one table in terms of columns in other tables? For example: ``Table1::ColumnA=FUNCTION(Table2::ColumnB, Table3::ColumnC)`` Different systems provided different answers to this question but all of them are highly specific and rather limited.

    Why it is difficult to define new columns in terms of other columns in other tables? Short answer is that working with columns is not the relational approach. The relational model is working with sets (rows of tables) and not with columns.

    One generic approach to working with columns in multiple tables is provided in the concept-oriented model of data which treats mathematical functions as first-class elements of the model. Previously it was implemented in a data wrangling tool called Data Commander. But them I decided to implement this model in the *Prosto* data processing toolkit which is an alternative to map-reduce and SQL:

    It defines data transformations as operations with columns in multiple tables. Since we use mathematical functions, no joins and no groupby operations are needed and this significantly simplifies and makes more natural the task of data transformations.

    Moreover, now it provides *Column-SQL* which makes it even easier to define new columns in terms of other columns:

  • GitHub repo learningOrchestra

    learningOrchestra is a distributed Machine Learning processing tool that facilitates and streamlines iterative processes in a Data Science project.

    Project mention: Someone with a good experience in python can rate my code? | | 2021-02-16

  • GitHub repo openverse-catalog

    Identifies and collects data on cc-licensed content across web crawl data and public apis.

    Project mention: Hacktoberfest Recap | | 2021-10-31

    Issue, Pull Request, Blog Post

  • GitHub repo ds2ai

    The MLOps platform for innovators 🚀

    Project mention: Release: End-to-End MLOps Platform | | 2021-07-13
  • GitHub repo fastdbfs

    fastdbfs - An interactive command line client for Databricks DBFS.

    Project mention: fastdbfs - An interactive command line client for Databricks DBFS | | 2021-05-07

    fastdbfs is an interactive command line client for accessing Databricks DBFS. It aims to be much more friendly and faster than the official CLI tool and also feature rich.

  • GitHub repo etl-markup-toolkit

    ETL Markup Toolkit is a spark-native tool for expressing ETL transformations as configuration

    Project mention: How do you serialize and save "transformations" in your pipeline? | | 2021-08-31

    I have a side project (, if you're interested) that takes transformations as yaml files and outputs step-level logs about each step of the transformation. I've always felt that both artifacts could made searchable using an ELK stack or something... Do you have similar artifacts? Or perhaps there's a way to turn SQL into a structured or semi-structured form to aid in searchability

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-11-26.

Python Spark related posts


What are some of the best open-source Spark projects in Python? This list will help you:

Project Stars
1 data-science-ipython-notebooks 21,842
2 Redash 19,987
3 horovod 11,881
4 dev-setup 5,566
5 koalas 3,027
6 dpark 2,664
7 feast 2,490
8 Optimus 1,139
9 sparkmagic 1,067
10 Hail 761
11 listenbrainz-server 452
12 fugue 396
13 popmon 197
14 cape-python 144
15 flytekit 62
16 prosto 53
17 learningOrchestra 48
18 openverse-catalog 15
19 ds2ai 8
20 fastdbfs 4
21 etl-markup-toolkit 3
Find remote jobs at our new job board There are 34 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives