Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work. Learn more →
Top 22 Python Pyspark Projects
-
Use Ibis and you won't have to worry about migrating to a vendor-specific python framework anymore. It connects to Snowflake, Spark, and many other engines you may want to interact with.
-
petastorm
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
-
Sonar
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
-
Optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)
-
Project mention: Ask HN: Who's an open source maintainer/project that needs sponsorship or help? | news.ycombinator.com | 2023-02-28
I maintain several open source projects, most notably:
Sparkmagic (https://github.com/jupyter-incubator/sparkmagic)
Sparkmagic provides jupyter magics and kernels for working with remote Spark clusters. It's used by thousands of developers and companies like Pinterest, Amazon, more!
I've been maintaining for the past few years and would love help!
KSOPS (https://github.com/viaduct-ai/kustomize-sops)
KSOPS, or kustomize-SOPS, is a kustomize KRM exec plugin for SOPS encrypted resources. KSOPS can be used to decrypt any Kubernetes resource, but is most commonly used to decrypt encrypted Kubernetes Secrets and ConfigMaps. As a kustomize plugin, KSOPS allows you to manage, build, and apply encrypted manifests the same way you manage the rest of your Kubernetes manifests.
KSOPS is the most popular kustomize plugin and I'd love help maintaining and improving it from out GitOps fanatics.
-
pyspark-example-project
Example project implementing best practices for PySpark ETL jobs and applications.
https://github.com/AlexIoannides/pyspark-example-project You can use this as an example to organize your project. I have referred to this in the past.
-
Project mention: Brainstorming functions to make PySpark easier | reddit.com/r/apachespark | 2023-03-13
We're brainstorming functions to make PySpark easier, see this issue: https://github.com/MrPowers/quinn/issues/83
-
-
CodiumAI
TestGPT | Generating meaningful tests for busy devs. Get non-trivial tests (and trivial, too!) suggested right inside your IDE, so you can code smart, create more value, and stay confident when you push.
-
here's a little README fix a user pushed to chispa
-
tdigest
t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark (by CamDavidsonPilon)
-
Project mention: Trying out the new generative fill feature in Photoshop Beta | reddit.com/r/singularity | 2023-05-23
The top two contributors to open source have been Microsoft and Google (OSCI https://opensourceindex.io/)
-
soda-spark
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
-
Project mention: Trying to dockerize an all python data engineering project | reddit.com/r/docker | 2022-05-28
You can see the structure of everything in my repository: https://github.com/jmcmt87/spark_app_twitter
-
Project mention: data-diff VS cuallee - a user suggested alternative | libhunt.com/r/data-diff | 2022-11-30
Declarative data quality rules at scale
-
-
-
pyspark-on-aws-emr
The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.
Building Big Data Pipelines in the Cloud with AWS EMR
-
-
covid-19-data-engineering-pipeline
A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.
Project mention: COVID-19 data pipeline on AWS feat. Glue/PySpark, Docker, Great Expectations, Airflow, and Redshift, templated in CF/CDK, deployable via Github Actions | reddit.com/r/dataengineering | 2023-04-03I've seen amazing projects here already, which honestly were a great inspiration, and today I would like to show you my project. Some time ago, I had the idea to apply every tool I wanted to learn or try out to the same topic and since then this idea has grown into an entire pipeline: https://github.com/moritzkoerber/covid-19-data-engineering-pipeline
-
Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data
Mobile robot data were analyzed with Apache-Spark to extract five different statistical result such as travel time, waiting time, average speed, occupancy and density were produced.
-
etl-markup-toolkit
ETL Markup Toolkit is a spark-native tool for expressing ETL transformations as configuration
-
Project mention: Static type hints for PySpark SQL dataframes | reddit.com/r/dataengineering | 2023-03-27
-
weather_data_pipeline
This is a PySpark-based data pipeline that fetches weather data for a few cities, performs some basic processing and transformation on the data, and then writes the processed data to a Google Cloud Storage bucket and a BigQuery table.The data is then viewed in a looker dashboard
Project mention: Building a Weather Data Pipeline with PySpark, Prefect, and Google Cloud | dev.to | 2023-05-01We'll be using PySpark for distributed data processing, Prefect for workflow management, and Google Cloud Storage and BigQuery for data storage and processing.The code is available on github.
-
ONLYOFFICE
ONLYOFFICE Docs — document collaboration in your environment. Powerful document editing and collaboration in your app or environment. Ultimate security, API and 30+ ready connectors, SaaS or on-premises
Python Pyspark related posts
- Trying out the new generative fill feature in Photoshop Beta
- Brainstorming functions to make PySpark easier
- PySpark OSS Contribution Opportunity
- I’m sorry...the Fuck?
- Spark open source community is awesome
- Invitation to collaborate on open source PySpark projects
- Open Source Contributor Index
-
A note from our sponsor - Sonar
www.sonarsource.com | 27 May 2023
Index
What are some of the best open-source Pyspark projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | ibis | 2,735 |
2 | petastorm | 1,619 |
3 | Optimus | 1,376 |
4 | sparkmagic | 1,223 |
5 | pyspark-example-project | 1,185 |
6 | quinn | 460 |
7 | PySpark-Boilerplate | 388 |
8 | chispa | 370 |
9 | tdigest | 353 |
10 | OSCI | 123 |
11 | soda-spark | 57 |
12 | spark_app_twitter | 56 |
13 | cuallee | 33 |
14 | ceja | 28 |
15 | pyspark-k8s-boilerplate | 24 |
16 | pyspark-on-aws-emr | 21 |
17 | Apache-Spark-Guide | 17 |
18 | covid-19-data-engineering-pipeline | 16 |
19 | Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data | 9 |
20 | etl-markup-toolkit | 5 |
21 | TypedPyspark | 4 |
22 | weather_data_pipeline | 0 |