Pyspark

Top 23 Pyspark Open-Source Projects

  • SynapseML

    Simple and Distributed Machine Learning

  • Project mention: FLaNK Stack Weekly for 12 September 2023 | dev.to | 2023-09-12
  • ibis

    the portable Python dataframe library

  • Project mention: Show HN: Hashquery, a Python library for defining reusable analysis | news.ycombinator.com | 2024-04-23

    I really don't understand the appeal of dbt vs a proper programming language. The templating approach leads to massive spaghetti. I look forward to trying out something like Ibis [0]

    0: https://ibis-project.org/

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • spark-nlp

    State of the Art Natural Language Processing

  • Project mention: Spark NLP 5.1.0: Introducing state-of-the-art OpenAI Whisper speech-to-text, OpenAI Embeddings and Completion transformers, MPNet text embeddings, ONNX support for E5 text embeddings, new multi-lingual BART Zero-Shot text classification, and much more! | /r/Python | 2023-09-06
  • linkis

    Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.

  • petastorm

    Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

  • awesome-spark

    A curated list of awesome Apache Spark packages and resources.

  • Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • pyspark-example-project

    Implementing best practices for PySpark ETL jobs and applications.

  • sparkmagic

    Jupyter magics and kernels for working with remote Spark clusters

  • Project mention: Doing ML works in AWS. Need help installing cartopy | /r/aws | 2023-06-05

    Please file an issue at https://github.com/jupyter-incubator/sparkmagic

  • hopsworks

    Hopsworks - Data-Intensive AI platform with a Feature Store

  • H2O

    Sparkling Water provides H2O functionality inside Spark cluster

  • kuwala

    Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demograp

  • Project mention: Show HN: GeoSage – A ETL Webtool for Geo and Demographics Data from the Open Web | news.ycombinator.com | 2023-10-05

    --> Google Trends Data for Regions (Coming Soon)

    The tool goes beyond our previously published CLI tool (https://github.com/kuwala-io/kuwala/tree/master/kuwala) by providing a hostable solution with a user-friendly interface. We have not open-sourced it yet but a demo is available here: https://geosage.kuwala.io/.

    Urban planners can utilize movement data to analyze foot traffic in different city zones. Marketers can leverage demographic data to tailor campaigns more effectively. Developers can build their apps on top of it.

    To round it up .... GeoSage brings...

    Unified Data Management: Access data from OSM, Facebook, and soon Google, all in one place.

  • SparkLearning

    A comprehensive Spark guide collated from multiple sources that can be referred to learn more about Spark or as an interview refresher.

  • quinn

    pyspark methods to enhance developer productivity 📣 👯 🎉 (by MrPowers)

  • chispa

    PySpark test helper methods with beautiful error messages

  • Project mention: Testing spark applications | /r/dataengineering | 2023-07-05

    Unit and e2e tests using a combination of pytest and chispa (https://github.com/MrPowers/chispa). Custom library to create random test data that fits schema with optional hardcoded overrides for relevant fields to test business logic.

  • PySpark-Boilerplate

    A boilerplate for writing PySpark Jobs

  • datacompy

    Pandas and Spark DataFrame comparison for humans and more!

  • Project mention: How to Check 2 SQL Tables Are the Same | news.ycombinator.com | 2023-07-26
  • tdigest

    t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark (by CamDavidsonPilon)

  • Gather-Deployment

    Gathers Python deployment, infrastructure and practices.

  • mack

    Delta Lake helper methods in PySpark

  • Project mention: Implementing and using SCD Type 2 | /r/dataengineering | 2023-07-04

    There still library form databricks? But I have never used it: https://github.com/MrPowers/mack

  • incubator-graphar

    An open source, standard data file format for graph data storage and retrieval.

  • Project mention: GraphAr Release v0.5.0 | news.ycombinator.com | 2023-05-16
  • spark-extension

    A library that provides useful extensions to Apache Spark and PySpark.

  • Project mention: Data diffs: Algorithms for explaining what changed in a dataset (2022) | news.ycombinator.com | 2023-07-26

    We're doing a env migration and I've been using spark diff extension for reconcile data, it's amazing, we've discover bugs in the data logic so quickly,

    here is the extension if anyone is interested https://github.com/G-Research/spark-extension/blob/master/DI...

  • WallStreetBets_BigDataAnalysis

    Research project aimed to classify the best stock research posts from r/WallStreetBets for you. 😏

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Pyspark related posts

  • Show HN: Snowflake Data Quality Checks in Python

    1 project | news.ycombinator.com | 11 Feb 2024
  • Spark NLP 5.1.0: Introducing state-of-the-art OpenAI Whisper speech-to-text, OpenAI Embeddings and Completion transformers, MPNet text embeddings, ONNX support for E5 text embeddings, new multi-lingual BART Zero-Shot text classification, and much more!

    1 project | /r/Python | 6 Sep 2023
  • Testing spark applications

    1 project | /r/dataengineering | 5 Jul 2023
  • Due to Red Hat's decision to remove public access, SUSE CTO Dr. Thomas Di Giacomo shares their position.

    1 project | /r/openSUSE | 5 Jul 2023
  • Your opinion of the Red Hat's latest fiasco

    2 projects | /r/Fedora | 26 Jun 2023
  • Trying out the new generative fill feature in Photoshop Beta

    1 project | /r/singularity | 23 May 2023
  • PySpark for NLP Workshop - Materials and Jupyter Notebooks

    2 projects | /r/dataengineering | 14 May 2023
  • A note from our sponsor - SaaSHub
    www.saashub.com | 2 May 2024
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Pyspark projects? This list will help you:

Project Stars
1 SynapseML 4,970
2 ibis 4,208
3 spark-nlp 3,695
4 linkis 3,230
5 petastorm 1,751
6 awesome-spark 1,614
7 Optimus 1,446
8 pyspark-example-project 1,370
9 sparkmagic 1,286
10 hopsworks 1,074
11 H2O 952
12 kuwala 755
13 SparkLearning 618
14 quinn 578
15 chispa 508
16 PySpark-Boilerplate 390
17 datacompy 386
18 tdigest 376
19 Gather-Deployment 350
20 mack 271
21 incubator-graphar 175
22 spark-extension 168
23 WallStreetBets_BigDataAnalysis 165

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com