Top 23 Python Pyspark Projects

ibis

23 4,074 10.0 Python

the portable Python dataframe library

Project mention: Show HN: Hashquery, a Python library for defining reusable analysis | news.ycombinator.com | 2024-04-23

I really don't understand the appeal of dbt vs a proper programming language. The templating approach leads to massive spaghetti. I look forward to trying out something like Ibis [0]
0: https://ibis-project.org/

petastorm

2 1,751 3.7 Python

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Optimus

0 1,446 1.9 Python

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)
pyspark-example-project

1 1,370 0.0 Python

Implementing best practices for PySpark ETL jobs and applications.
sparkmagic

4 1,284 7.6 Python

Jupyter magics and kernels for working with remote Spark clusters

Project mention: Doing ML works in AWS. Need help installing cartopy | /r/aws | 2023-06-05

Please file an issue at https://github.com/jupyter-incubator/sparkmagic

quinn

9 576 9.2 Python

pyspark methods to enhance developer productivity 📣 👯 🎉 (by MrPowers)
chispa

12 508 6.7 Python

PySpark test helper methods with beautiful error messages

Project mention: Testing spark applications | /r/dataengineering | 2023-07-05

Unit and e2e tests using a combination of pytest and chispa (https://github.com/MrPowers/chispa). Custom library to create random test data that fits schema with optional hardcoded overrides for relevant fields to test business logic.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
PySpark-Boilerplate

1 390 2.5 Python

A boilerplate for writing PySpark Jobs
tdigest

0 375 0.0 Python

t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark (by CamDavidsonPilon)
mack

5 269 5.9 Python

Delta Lake helper methods in PySpark

Project mention: Implementing and using SCD Type 2 | /r/dataengineering | 2023-07-04

There still library form databricks? But I have never used it: https://github.com/MrPowers/mack

OSCI

18 150 3.1 Python

Open Source Contributor Index

Project mention: Due to Red Hat's decision to remove public access, SUSE CTO Dr. Thomas Di Giacomo shares their position. | /r/openSUSE | 2023-07-05

RH is still one the biggest contributor to open source. Most sites I found place them in third place in terms of currently active contributors, only beaten by Google and Microsoft (companies with respectively 7x and 10x their number of employees). Not to shit on Suse (who are on 12th place on the list I found, quite impressive for a company with only about 2000 employees), but pretending RH doesn't get Open Source is just untrue.

dataproc-templates

1 111 8.7 Python

Dataproc templates and pipelines for solving simple in-cloud data tasks
cuallee

5 107 9.0 Python

Possibly the fastest DataFrame-agnostic quality check library in town.

Project mention: Show HN: Snowflake Data Quality Checks in Python | news.ycombinator.com | 2024-02-11

soda-spark

1 60 0.0 Python

Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
spark_app_twitter

3 60 0.0 Python

A data engineering project (Twitter monitor app)
ceja

1 32 2.3 Python

PySpark phonetic and string matching algorithms (by MrPowers)
pyspark-k8s-boilerplate

1 30 0.0 Python

Boilerplate for PySpark on Cloud Kubernetes
Apache-Spark-Guide

2 26 1.8 Python

Apache Spark Guide
pyspark-on-aws-emr

1 24 3.6 Python

The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.
covid-19-data-engineering-pipeline

1 22 5.3 Python

A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.
TypedPyspark

1 14 2.4 Python

Type-annotate your spark dataframes and validate them
Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data

1 10 0.0 Python

Mobile robot data were analyzed with Apache-Spark to extract five different statistical result such as travel time, waiting time, average speed, occupancy and density were produced.
etl-markup-toolkit

7 5 0.0 Python

ETL Markup Toolkit is a spark-native tool for expressing ETL transformations as configuration
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Pyspark related posts

Show HN: Snowflake Data Quality Checks in Python
1 project | news.ycombinator.com | 11 Feb 2024
Testing spark applications
1 project | /r/dataengineering | 5 Jul 2023
Due to Red Hat's decision to remove public access, SUSE CTO Dr. Thomas Di Giacomo shares their position.
1 project | /r/openSUSE | 5 Jul 2023
Your opinion of the Red Hat's latest fiasco
2 projects | /r/Fedora | 26 Jun 2023
Trying out the new generative fill feature in Photoshop Beta
1 project | /r/singularity | 23 May 2023
Brainstorming functions to make PySpark easier
1 project | /r/apachespark | 13 Mar 2023
PySpark OSS Contribution Opportunity
3 projects | /r/apachespark | 5 Mar 2023
A note from our sponsor - WorkOS
workos.com | 26 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source Pyspark projects in Python? This list will help you:

	Project	Stars
1	ibis	4,074
2	petastorm	1,751
3	Optimus	1,446
4	pyspark-example-project	1,370
5	sparkmagic	1,284
6	quinn	576
7	chispa	508
8	PySpark-Boilerplate	390
9	tdigest	375
10	mack	269
11	OSCI	150
12	dataproc-templates	111
13	cuallee	107
14	soda-spark	60
15	spark_app_twitter	60
16	ceja	32
17	pyspark-k8s-boilerplate	30
18	Apache-Spark-Guide	26
19	pyspark-on-aws-emr	24
20	covid-19-data-engineering-pipeline	22
21	TypedPyspark	14
22	Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data	10
23	etl-markup-toolkit	5