Python Spark

Open-source Python projects categorized as Spark

Top 23 Python Spark Projects

  • data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  • Redash

    Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

    Project mention: Redash: Connect to data source, easily visualize, dashboard and share your data | news.ycombinator.com | 2024-03-20
  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • horovod

    Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

  • Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

    Project mention: A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes | dev.to | 2023-06-12

    In the coming years, Mage will create a cooperative experience so that developers can build data pipelines with their team and level up together. After that journey, Mage will go on an epic quest to create the 1st open world community experience in the data universe.

  • dev-setup

    macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.

  • sqlglot

    Python SQL Parser and Transpiler

    Project mention: Transpile Any SQL to PostgreSQL Dialect | news.ycombinator.com | 2024-03-18

    Recommend checking out https://github.com/tobymao/sqlglot if you are interested in this capability for other SQL dialects

    Tools like this are helpful for:

    - Rendering SQL in a consistent way, eg for snapshot testing

  • TensorFlowOnSpark

    TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • koalas

    Koalas: pandas API on Apache Spark

  • dpark

    Python clone of Spark, a MapReduce alike framework in Python

  • fugue

    A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

    Project mention: FLaNK Stack Weekly 22 January 2024 | dev.to | 2024-01-22
  • Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  • pyspark-example-project

    Implementing best practices for PySpark ETL jobs and applications.

  • sparkmagic

    Jupyter magics and kernels for working with remote Spark clusters

    Project mention: Doing ML works in AWS. Need help installing cartopy | /r/aws | 2023-06-05

    Please file an issue at https://github.com/jupyter-incubator/sparkmagic

  • listenbrainz-server

    Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

    Project mention: Analyzing Spotify Stream History | news.ycombinator.com | 2024-02-12

    There's also ListenBrainz, run by the MusicBrainz org, which offers similar functionality without API restrictions or other paid features that Last.FM tries to push.

    https://listenbrainz.org/

    If you wish to use your scrobble data at all programmatically this is a far better tool to use.

  • popmon

    Monitor the stability of a Pandas or Spark dataframe ⚙︎

  • streamify

    A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

  • datacompy

    Pandas and Spark DataFrame comparison for humans and more!

    Project mention: How to Check 2 SQL Tables Are the Same | news.ycombinator.com | 2023-07-26
  • flytekit

    Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.

  • visions

    Type System for Data Analysis in Python

    Project mention: Complete Beginner tasked with ML at work - where do I start | /r/learnmachinelearning | 2023-06-27

    This one works pretty well: https://github.com/dylan-profiler/visions

  • cape-dataframes

    Privacy transformations on Spark and Pandas dataframes backed by a simple policy language.

    Project mention: Show HN: Cape API – Keep your sensitive data private while using GPT-4 | news.ycombinator.com | 2023-06-27

    - How can we mitigate hallucinations and bias so that we have higher trust in AI generated text?

    The features of the Cape API are designed to help solve these problems for developers, and we have a number of early customers using the API in production already.

    To get started, checkout our docs: https://docs.capeprivacy.com/

    View the API reference: https://api.capeprivacy.com/redoc

    Join the discussion on our Discord: https://discord.gg/nQW7YxUYjh

    And of course try the CapeChat playground at https://chat.capeprivacy.com/

  • emr-serverless-samples

    Example code for running Spark and Hive jobs on EMR Serverless.

  • prosto

    Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-03-20.

Python Spark related posts

Index

What are some of the best open-source Spark projects in Python? This list will help you:

Project Stars
1 data-science-ipython-notebooks 26,307
2 Redash 24,789
3 horovod 13,899
4 Mage 6,802
5 dev-setup 6,032
6 sqlglot 5,236
7 TensorFlowOnSpark 3,862
8 koalas 3,312
9 dpark 2,691
10 fugue 1,853
11 Optimus 1,434
12 pyspark-example-project 1,312
13 sparkmagic 1,281
14 splink 1,060
15 listenbrainz-server 634
16 popmon 483
17 streamify 435
18 datacompy 352
19 flytekit 197
20 visions 194
21 cape-dataframes 174
22 emr-serverless-samples 132
23 prosto 89
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com