Python Pyspark

Open-source Python projects categorized as Pyspark

Top 23 Python Pyspark Projects

  1. ibis

    the portable Python dataframe library

    Project mention: Modern Polars – A side-by-side comparison of the Polars and Pandas libraries | news.ycombinator.com | 2025-01-23

    I just want to add an additional entry to the Other cool stuff you might like in the summary: https://ibis-project.org/

    It's a portable dataframe library that defaults to a DuckDB backend, but you can also use polars and pandas (among the 20 backends that it supports).

  2. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  3. petastorm

    Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

  4. pyspark-example-project

    Implementing best practices for PySpark ETL jobs and applications.

  5. Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  6. sparkmagic

    Jupyter magics and kernels for working with remote Spark clusters

  7. quinn

    pyspark methods to enhance developer productivity 📣 👯 🎉 (by mrpowers-io)

  8. chispa

    PySpark test helper methods with beautiful error messages

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. datacompy

    Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!

  11. PySpark-Boilerplate

    A boilerplate for writing PySpark Jobs

  12. tdigest

    t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark (by CamDavidsonPilon)

  13. mack

    Delta Lake helper methods in PySpark

  14. cuallee

    Possibly the fastest DataFrame-agnostic quality check library in town.

    Project mention: Show HN: Snowflake Data Quality Checks in Python | news.ycombinator.com | 2024-02-11
  15. OSCI

    Open Source Contributor Index

  16. dataproc-templates

    Dataproc templates and pipelines for solving simple in-cloud data tasks

  17. spark_app_twitter

    A data engineering project (Twitter monitor app)

  18. soda-spark

    Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

  19. pyjaws

    PyJaws: A Pythonic Way to Define Databricks Jobs and Workflows

  20. ceja

    PySpark phonetic and string matching algorithms (by MrPowers)

  21. pyspark-k8s-boilerplate

    Boilerplate for PySpark on Cloud Kubernetes

  22. Apache-Spark-Guide

    Apache Spark Guide

  23. pyspark-on-aws-emr

    The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.

  24. covid-19-data-engineering-pipeline

    A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.

  25. TypedPyspark

    Type-annotate your spark dataframes and validate them

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Pyspark discussion

Log in or Post with

Python Pyspark related posts

  • Show HN: Snowflake Data Quality Checks in Python

    1 project | news.ycombinator.com | 11 Feb 2024
  • Testing spark applications

    1 project | /r/dataengineering | 5 Jul 2023
  • Due to Red Hat's decision to remove public access, SUSE CTO Dr. Thomas Di Giacomo shares their position.

    1 project | /r/openSUSE | 5 Jul 2023
  • Your opinion of the Red Hat's latest fiasco

    2 projects | /r/Fedora | 26 Jun 2023
  • Trying out the new generative fill feature in Photoshop Beta

    1 project | /r/singularity | 23 May 2023
  • Brainstorming functions to make PySpark easier

    1 project | /r/apachespark | 13 Mar 2023
  • PySpark OSS Contribution Opportunity

    3 projects | /r/apachespark | 5 Mar 2023
  • A note from our sponsor - CodeRabbit
    coderabbit.ai | 9 Feb 2025
    Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →

Index

What are some of the best open-source Pyspark projects in Python? This list will help you:

# Project Stars
1 ibis 5,496
2 petastorm 1,801
3 pyspark-example-project 1,722
4 Optimus 1,492
5 sparkmagic 1,339
6 quinn 659
7 chispa 657
8 datacompy 502
9 PySpark-Boilerplate 396
10 tdigest 390
11 mack 315
12 cuallee 181
13 OSCI 163
14 dataproc-templates 123
15 spark_app_twitter 77
16 soda-spark 63
17 pyjaws 41
18 ceja 39
19 pyspark-k8s-boilerplate 33
20 Apache-Spark-Guide 30
21 pyspark-on-aws-emr 26
22 covid-19-data-engineering-pipeline 23
23 TypedPyspark 14

Sponsored
CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai

Did you know that Python is
the 2nd most popular programming language
based on number of references?