Python Pyspark

Open-source Python projects categorized as Pyspark

Top 23 Python Pyspark Projects

  • ibis

    the portable Python dataframe library

  • Project mention: Show HN: Hashquery, a Python library for defining reusable analysis | news.ycombinator.com | 2024-04-23

    I really don't understand the appeal of dbt vs a proper programming language. The templating approach leads to massive spaghetti. I look forward to trying out something like Ibis [0]

    0: https://ibis-project.org/

  • petastorm

    Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  • pyspark-example-project

    Implementing best practices for PySpark ETL jobs and applications.

  • sparkmagic

    Jupyter magics and kernels for working with remote Spark clusters

  • Project mention: Doing ML works in AWS. Need help installing cartopy | /r/aws | 2023-06-05

    Please file an issue at https://github.com/jupyter-incubator/sparkmagic

  • quinn

    pyspark methods to enhance developer productivity 📣 👯 🎉 (by MrPowers)

  • chispa

    PySpark test helper methods with beautiful error messages

  • Project mention: Testing spark applications | /r/dataengineering | 2023-07-05

    Unit and e2e tests using a combination of pytest and chispa (https://github.com/MrPowers/chispa). Custom library to create random test data that fits schema with optional hardcoded overrides for relevant fields to test business logic.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • datacompy

    Pandas and Spark DataFrame comparison for humans and more!

  • Project mention: How to Check 2 SQL Tables Are the Same | news.ycombinator.com | 2023-07-26
  • PySpark-Boilerplate

    A boilerplate for writing PySpark Jobs

  • tdigest

    t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark (by CamDavidsonPilon)

  • mack

    Delta Lake helper methods in PySpark

  • Project mention: Implementing and using SCD Type 2 | /r/dataengineering | 2023-07-04

    There still library form databricks? But I have never used it: https://github.com/MrPowers/mack

  • OSCI

    Open Source Contributor Index

  • Project mention: Due to Red Hat's decision to remove public access, SUSE CTO Dr. Thomas Di Giacomo shares their position. | /r/openSUSE | 2023-07-05

    RH is still one the biggest contributor to open source. Most sites I found place them in third place in terms of currently active contributors, only beaten by Google and Microsoft (companies with respectively 7x and 10x their number of employees). Not to shit on Suse (who are on 12th place on the list I found, quite impressive for a company with only about 2000 employees), but pretending RH doesn't get Open Source is just untrue.

  • dataproc-templates

    Dataproc templates and pipelines for solving simple in-cloud data tasks

  • cuallee

    Possibly the fastest DataFrame-agnostic quality check library in town.

  • Project mention: Show HN: Snowflake Data Quality Checks in Python | news.ycombinator.com | 2024-02-11
  • spark_app_twitter

    A data engineering project (Twitter monitor app)

  • soda-spark

    Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

  • ceja

    PySpark phonetic and string matching algorithms (by MrPowers)

  • pyspark-k8s-boilerplate

    Boilerplate for PySpark on Cloud Kubernetes

  • Apache-Spark-Guide

    Apache Spark Guide

  • pyspark-on-aws-emr

    The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.

  • covid-19-data-engineering-pipeline

    A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.

  • TypedPyspark

    Type-annotate your spark dataframes and validate them

  • Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data

    Mobile robot data were analyzed with Apache-Spark to extract five different statistical result such as travel time, waiting time, average speed, occupancy and density were produced.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Pyspark related posts

  • Show HN: Snowflake Data Quality Checks in Python

    1 project | news.ycombinator.com | 11 Feb 2024
  • Testing spark applications

    1 project | /r/dataengineering | 5 Jul 2023
  • Due to Red Hat's decision to remove public access, SUSE CTO Dr. Thomas Di Giacomo shares their position.

    1 project | /r/openSUSE | 5 Jul 2023
  • Your opinion of the Red Hat's latest fiasco

    2 projects | /r/Fedora | 26 Jun 2023
  • Trying out the new generative fill feature in Photoshop Beta

    1 project | /r/singularity | 23 May 2023
  • Brainstorming functions to make PySpark easier

    1 project | /r/apachespark | 13 Mar 2023
  • PySpark OSS Contribution Opportunity

    3 projects | /r/apachespark | 5 Mar 2023
  • A note from our sponsor - SaaSHub
    www.saashub.com | 22 May 2024
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Pyspark projects in Python? This list will help you:

Project Stars
1 ibis 4,304
2 petastorm 1,755
3 Optimus 1,447
4 pyspark-example-project 1,370
5 sparkmagic 1,286
6 quinn 582
7 chispa 515
8 datacompy 399
9 PySpark-Boilerplate 391
10 tdigest 376
11 mack 273
12 OSCI 151
13 dataproc-templates 112
14 cuallee 111
15 spark_app_twitter 60
16 soda-spark 60
17 ceja 33
18 pyspark-k8s-boilerplate 30
19 Apache-Spark-Guide 28
20 pyspark-on-aws-emr 24
21 covid-19-data-engineering-pipeline 22
22 TypedPyspark 14
23 Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data 10

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com