Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →
Top 23 Python Pyspark Projects
-
Project mention: Modern Polars – A side-by-side comparison of the Polars and Pandas libraries | news.ycombinator.com | 2025-01-23
I just want to add an additional entry to the Other cool stuff you might like in the summary: https://ibis-project.org/
It's a portable dataframe library that defaults to a DuckDB backend, but you can also use polars and pandas (among the 20 backends that it supports).
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
petastorm
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
-
-
Optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)
-
-
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
-
-
tdigest
t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark (by CamDavidsonPilon)
-
-
Project mention: Show HN: Snowflake Data Quality Checks in Python | news.ycombinator.com | 2024-02-11
-
-
-
-
soda-spark
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
-
-
-
-
-
pyspark-on-aws-emr
The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.
-
covid-19-data-engineering-pipeline
A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Pyspark discussion
Python Pyspark related posts
-
Show HN: Snowflake Data Quality Checks in Python
-
Testing spark applications
-
Due to Red Hat's decision to remove public access, SUSE CTO Dr. Thomas Di Giacomo shares their position.
-
Your opinion of the Red Hat's latest fiasco
-
Trying out the new generative fill feature in Photoshop Beta
-
Brainstorming functions to make PySpark easier
-
PySpark OSS Contribution Opportunity
-
A note from our sponsor - CodeRabbit
coderabbit.ai | 9 Feb 2025
Index
What are some of the best open-source Pyspark projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | ibis | 5,496 |
2 | petastorm | 1,801 |
3 | pyspark-example-project | 1,722 |
4 | Optimus | 1,492 |
5 | sparkmagic | 1,339 |
6 | quinn | 659 |
7 | chispa | 657 |
8 | datacompy | 502 |
9 | PySpark-Boilerplate | 396 |
10 | tdigest | 390 |
11 | mack | 315 |
12 | cuallee | 181 |
13 | OSCI | 163 |
14 | dataproc-templates | 123 |
15 | spark_app_twitter | 77 |
16 | soda-spark | 63 |
17 | pyjaws | 41 |
18 | ceja | 39 |
19 | pyspark-k8s-boilerplate | 33 |
20 | Apache-Spark-Guide | 30 |
21 | pyspark-on-aws-emr | 26 |
22 | covid-19-data-engineering-pipeline | 23 |
23 | TypedPyspark | 14 |