Python Pyspark

Open-source Python projects categorized as Pyspark

Top 22 Python Pyspark Projects

  • ibis

    The flexibility of Python with the scale and performance of modern SQL.

    Project mention: Thoughts About Snowpark? | reddit.com/r/dataengineering | 2023-05-16

    Use Ibis and you won't have to worry about migrating to a vendor-specific python framework anymore. It connects to Snowflake, Spark, and many other engines you may want to interact with.

  • petastorm

    Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  • sparkmagic

    Jupyter magics and kernels for working with remote Spark clusters

    Project mention: Ask HN: Who's an open source maintainer/project that needs sponsorship or help? | news.ycombinator.com | 2023-02-28

    I maintain several open source projects, most notably:

    Sparkmagic (https://github.com/jupyter-incubator/sparkmagic)

    Sparkmagic provides jupyter magics and kernels for working with remote Spark clusters. It's used by thousands of developers and companies like Pinterest, Amazon, more!

    I've been maintaining for the past few years and would love help!

    KSOPS (https://github.com/viaduct-ai/kustomize-sops)

    KSOPS, or kustomize-SOPS, is a kustomize KRM exec plugin for SOPS encrypted resources. KSOPS can be used to decrypt any Kubernetes resource, but is most commonly used to decrypt encrypted Kubernetes Secrets and ConfigMaps. As a kustomize plugin, KSOPS allows you to manage, build, and apply encrypted manifests the same way you manage the rest of your Kubernetes manifests.

    KSOPS is the most popular kustomize plugin and I'd love help maintaining and improving it from out GitOps fanatics.

  • pyspark-example-project

    Example project implementing best practices for PySpark ETL jobs and applications.

    Project mention: Learning Pyspark for a new role | reddit.com/r/dataengineering | 2022-12-23

    https://github.com/AlexIoannides/pyspark-example-project You can use this as an example to organize your project. I have referred to this in the past.

  • quinn

    pyspark methods to enhance developer productivity 📣 👯 🎉 (by MrPowers)

    Project mention: Brainstorming functions to make PySpark easier | reddit.com/r/apachespark | 2023-03-13

    We're brainstorming functions to make PySpark easier, see this issue: https://github.com/MrPowers/quinn/issues/83

  • PySpark-Boilerplate

    A boilerplate for writing PySpark Jobs

  • CodiumAI

    TestGPT | Generating meaningful tests for busy devs. Get non-trivial tests (and trivial, too!) suggested right inside your IDE, so you can code smart, create more value, and stay confident when you push.

  • chispa

    PySpark test helper methods with beautiful error messages

    Project mention: Spark open source community is awesome | reddit.com/r/apachespark | 2022-12-29

    here's a little README fix a user pushed to chispa

  • tdigest

    t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark (by CamDavidsonPilon)

  • OSCI

    Open Source Contributor Index

    Project mention: Trying out the new generative fill feature in Photoshop Beta | reddit.com/r/singularity | 2023-05-23

    The top two contributors to open source have been Microsoft and Google (OSCI https://opensourceindex.io/)

  • soda-spark

    Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

  • spark_app_twitter

    A data engineering project (Twitter monitor app)

    Project mention: Trying to dockerize an all python data engineering project | reddit.com/r/docker | 2022-05-28

    You can see the structure of everything in my repository: https://github.com/jmcmt87/spark_app_twitter

  • cuallee

    A data quality acceleration library to get data sets verified in a friendly interface

    Project mention: data-diff VS cuallee - a user suggested alternative | libhunt.com/r/data-diff | 2022-11-30

    Declarative data quality rules at scale

  • ceja

    PySpark phonetic and string matching algorithms (by MrPowers)

  • pyspark-k8s-boilerplate

    Boilerplate for PySpark on Cloud Kubernetes

  • pyspark-on-aws-emr

    The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.

    Project mention: Data Engineering Projects for Beginners | dev.to | 2022-06-15

    Building Big Data Pipelines in the Cloud with AWS EMR

  • Apache-Spark-Guide

    Apache Spark Guide

  • covid-19-data-engineering-pipeline

    A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.

    Project mention: COVID-19 data pipeline on AWS feat. Glue/PySpark, Docker, Great Expectations, Airflow, and Redshift, templated in CF/CDK, deployable via Github Actions | reddit.com/r/dataengineering | 2023-04-03

    I've seen amazing projects here already, which honestly were a great inspiration, and today I would like to show you my project. Some time ago, I had the idea to apply every tool I wanted to learn or try out to the same topic and since then this idea has grown into an entire pipeline: https://github.com/moritzkoerber/covid-19-data-engineering-pipeline

  • Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data

    Mobile robot data were analyzed with Apache-Spark to extract five different statistical result such as travel time, waiting time, average speed, occupancy and density were produced.

  • etl-markup-toolkit

    ETL Markup Toolkit is a spark-native tool for expressing ETL transformations as configuration

  • TypedPyspark

    Type-annotate your spark dataframes and validate them

    Project mention: Static type hints for PySpark SQL dataframes | reddit.com/r/dataengineering | 2023-03-27
  • weather_data_pipeline

    This is a PySpark-based data pipeline that fetches weather data for a few cities, performs some basic processing and transformation on the data, and then writes the processed data to a Google Cloud Storage bucket and a BigQuery table.The data is then viewed in a looker dashboard

    Project mention: Building a Weather Data Pipeline with PySpark, Prefect, and Google Cloud | dev.to | 2023-05-01

    We'll be using PySpark for distributed data processing, Prefect for workflow management, and Google Cloud Storage and BigQuery for data storage and processing.The code is available on github.

  • ONLYOFFICE

    ONLYOFFICE Docs — document collaboration in your environment. Powerful document editing and collaboration in your app or environment. Ultimate security, API and 30+ ready connectors, SaaS or on-premises

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-05-23.

Python Pyspark related posts

Index

What are some of the best open-source Pyspark projects in Python? This list will help you:

Project Stars
1 ibis 2,735
2 petastorm 1,619
3 Optimus 1,376
4 sparkmagic 1,223
5 pyspark-example-project 1,185
6 quinn 460
7 PySpark-Boilerplate 388
8 chispa 370
9 tdigest 353
10 OSCI 123
11 soda-spark 57
12 spark_app_twitter 56
13 cuallee 33
14 ceja 28
15 pyspark-k8s-boilerplate 24
16 pyspark-on-aws-emr 21
17 Apache-Spark-Guide 17
18 covid-19-data-engineering-pipeline 16
19 Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data 9
20 etl-markup-toolkit 5
21 TypedPyspark 4
22 weather_data_pipeline 0
Access the most powerful time series database as a service
Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
www.influxdata.com