The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 23 Python Pyspark Projects
-
petastorm
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
Optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
tdigest
t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark (by CamDavidsonPilon)
-
soda-spark
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
-
pyspark-on-aws-emr
The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.
-
covid-19-data-engineering-pipeline
A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.
-
Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data
Mobile robot data were analyzed with Apache-Spark to extract five different statistical result such as travel time, waiting time, average speed, occupancy and density were produced.
-
etl-markup-toolkit
ETL Markup Toolkit is a spark-native tool for expressing ETL transformations as configuration
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Show HN: Hashquery, a Python library for defining reusable analysis | news.ycombinator.com | 2024-04-23I really don't understand the appeal of dbt vs a proper programming language. The templating approach leads to massive spaghetti. I look forward to trying out something like Ibis [0]
0: https://ibis-project.org/
Please file an issue at https://github.com/jupyter-incubator/sparkmagic
Unit and e2e tests using a combination of pytest and chispa (https://github.com/MrPowers/chispa). Custom library to create random test data that fits schema with optional hardcoded overrides for relevant fields to test business logic.
There still library form databricks? But I have never used it: https://github.com/MrPowers/mack
Project mention: Due to Red Hat's decision to remove public access, SUSE CTO Dr. Thomas Di Giacomo shares their position. | /r/openSUSE | 2023-07-05RH is still one the biggest contributor to open source. Most sites I found place them in third place in terms of currently active contributors, only beaten by Google and Microsoft (companies with respectively 7x and 10x their number of employees). Not to shit on Suse (who are on 12th place on the list I found, quite impressive for a company with only about 2000 employees), but pretending RH doesn't get Open Source is just untrue.
Project mention: Show HN: Snowflake Data Quality Checks in Python | news.ycombinator.com | 2024-02-11
Python Pyspark related posts
- Show HN: Snowflake Data Quality Checks in Python
- Testing spark applications
- Due to Red Hat's decision to remove public access, SUSE CTO Dr. Thomas Di Giacomo shares their position.
- Your opinion of the Red Hat's latest fiasco
- Trying out the new generative fill feature in Photoshop Beta
- Brainstorming functions to make PySpark easier
- PySpark OSS Contribution Opportunity
-
A note from our sponsor - WorkOS
workos.com | 26 Apr 2024
Index
What are some of the best open-source Pyspark projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | ibis | 4,074 |
2 | petastorm | 1,751 |
3 | Optimus | 1,446 |
4 | pyspark-example-project | 1,370 |
5 | sparkmagic | 1,284 |
6 | quinn | 576 |
7 | chispa | 508 |
8 | PySpark-Boilerplate | 390 |
9 | tdigest | 375 |
10 | mack | 269 |
11 | OSCI | 150 |
12 | dataproc-templates | 111 |
13 | cuallee | 107 |
14 | soda-spark | 60 |
15 | spark_app_twitter | 60 |
16 | ceja | 32 |
17 | pyspark-k8s-boilerplate | 30 |
18 | Apache-Spark-Guide | 26 |
19 | pyspark-on-aws-emr | 24 |
20 | covid-19-data-engineering-pipeline | 22 |
21 | TypedPyspark | 14 |
22 | Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data | 10 |
23 | etl-markup-toolkit | 5 |
Sponsored