Python Pyspark

Open-source Python projects categorized as Pyspark

Top 23 Python Pyspark Projects

  • ibis

    the portable Python dataframe library

  • Project mention: Show HN: Hashquery, a Python library for defining reusable analysis | news.ycombinator.com | 2024-04-23

    I really don't understand the appeal of dbt vs a proper programming language. The templating approach leads to massive spaghetti. I look forward to trying out something like Ibis [0]

    0: https://ibis-project.org/

  • petastorm

    Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  • pyspark-example-project

    Implementing best practices for PySpark ETL jobs and applications.

  • sparkmagic

    Jupyter magics and kernels for working with remote Spark clusters

  • Project mention: Doing ML works in AWS. Need help installing cartopy | /r/aws | 2023-06-05

    Please file an issue at https://github.com/jupyter-incubator/sparkmagic

  • quinn

    pyspark methods to enhance developer productivity 📣 👯 🎉 (by MrPowers)

  • chispa

    PySpark test helper methods with beautiful error messages

  • Project mention: Testing spark applications | /r/dataengineering | 2023-07-05

    Unit and e2e tests using a combination of pytest and chispa (https://github.com/MrPowers/chispa). Custom library to create random test data that fits schema with optional hardcoded overrides for relevant fields to test business logic.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • PySpark-Boilerplate

    A boilerplate for writing PySpark Jobs

  • tdigest

    t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark (by CamDavidsonPilon)

  • mack

    Delta Lake helper methods in PySpark

  • Project mention: Implementing and using SCD Type 2 | /r/dataengineering | 2023-07-04

    There still library form databricks? But I have never used it: https://github.com/MrPowers/mack

  • OSCI

    Open Source Contributor Index

  • Project mention: Due to Red Hat's decision to remove public access, SUSE CTO Dr. Thomas Di Giacomo shares their position. | /r/openSUSE | 2023-07-05

    RH is still one the biggest contributor to open source. Most sites I found place them in third place in terms of currently active contributors, only beaten by Google and Microsoft (companies with respectively 7x and 10x their number of employees). Not to shit on Suse (who are on 12th place on the list I found, quite impressive for a company with only about 2000 employees), but pretending RH doesn't get Open Source is just untrue.

  • dataproc-templates

    Dataproc templates and pipelines for solving simple in-cloud data tasks

  • cuallee

    Possibly the fastest DataFrame-agnostic quality check library in town.

  • Project mention: Show HN: Snowflake Data Quality Checks in Python | news.ycombinator.com | 2024-02-11
  • soda-spark

    Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

  • spark_app_twitter

    A data engineering project (Twitter monitor app)

  • ceja

    PySpark phonetic and string matching algorithms (by MrPowers)

  • pyspark-k8s-boilerplate

    Boilerplate for PySpark on Cloud Kubernetes

  • Apache-Spark-Guide

    Apache Spark Guide

  • pyspark-on-aws-emr

    The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.

  • covid-19-data-engineering-pipeline

    A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.

  • TypedPyspark

    Type-annotate your spark dataframes and validate them

  • Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data

    Mobile robot data were analyzed with Apache-Spark to extract five different statistical result such as travel time, waiting time, average speed, occupancy and density were produced.

  • etl-markup-toolkit

    ETL Markup Toolkit is a spark-native tool for expressing ETL transformations as configuration

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Pyspark related posts

Index

What are some of the best open-source Pyspark projects in Python? This list will help you:

Project Stars
1 ibis 4,074
2 petastorm 1,751
3 Optimus 1,446
4 pyspark-example-project 1,370
5 sparkmagic 1,284
6 quinn 576
7 chispa 508
8 PySpark-Boilerplate 390
9 tdigest 375
10 mack 269
11 OSCI 150
12 dataproc-templates 111
13 cuallee 107
14 soda-spark 60
15 spark_app_twitter 60
16 ceja 32
17 pyspark-k8s-boilerplate 30
18 Apache-Spark-Guide 26
19 pyspark-on-aws-emr 24
20 covid-19-data-engineering-pipeline 22
21 TypedPyspark 14
22 Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data 10
23 etl-markup-toolkit 5

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com