Python Spark

Open-source Python projects categorized as Spark

Top 23 Python Spark Projects

  • data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  • Redash

    Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

    Project mention: Recommend Django Great Projects | | 2022-12-03
  • InfluxDB

    Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Platform where developers build real-time applications for analytics, IoT and cloud-native services. Easy to start, it is available in the cloud or on-premises.

  • horovod

    Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

    Project mention: [D] What is the recommended approach to training NN on big data set? | | 2022-12-08

    And in case scaling is really important to you. May I suggest you look into Horovod?

  • dev-setup

    macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.

    Project mention: Automate Mac setup? | | 2022-04-10

    Something like this at least is the most direct answer to your question, as opposed to "you're doing it wrong" which unfortunately seems to be more upvoted. An example of something like this might be

  • TensorFlowOnSpark

    TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

    Project mention: [D]Speed up inference on Spark | | 2022-02-18

    Currently I use TensorflowOnSpark frame to train and predict model. When prediction, I have billions of samples to predict which is time-consuming. I wonder if there is some good practices on this.

  • koalas

    Koalas: pandas API on Apache Spark

    Project mention: My new company uses Pyspark. I want to learn it before my starting date. Any advice? | | 2022-11-10

    If they're using databricks and you're familiar with pandas, koalas should be right up your alley .

  • dpark

    Python clone of Spark, a MapReduce alike framework in Python

  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • ibis

    Expressive analytics in Python at any scale.

    Project mention: Why use Python over SQL? | | 2022-12-25

    By the way I was introduced to Ibis recently, I have mix feelings about it.

  • Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  • sparkmagic

    Jupyter magics and kernels for working with remote Spark clusters

  • fugue

    A unified interface for distributed computing. Fugue executes SQL, Python, and Pandas code on Spark, Dask and Ray without any rewrites.

    Project mention: Ask HN: How do you test SQL? | | 2023-01-31
  • pyspark-example-project

    Example project implementing best practices for PySpark ETL jobs and applications.

    Project mention: Learning Pyspark for a new role | | 2022-12-23 You can use this as an example to organize your project. I have referred to this in the past.

  • Hail

    Scalable genomic data analysis.

    Project mention: We're wasting money by only supporting gzip for raw DNA files | | 2023-01-09
  • listenbrainz-server

    Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

    Project mention: Spotify discover weekly alternative | | 2023-01-22

    I don't think it's self-hosted, but listenbrainz is open source and provides recommendations based on your listening history.

  • popmon

    Monitor the stability of a Pandas or Spark dataframe ⚙︎

  • datacompy

    Pandas and Spark DataFrame comparison for humans

    Project mention: Comparing 2 CSV files | | 2022-07-11

    datacompy is a package to compare 2 pandas dataframes

  • visions

    Type System for Data Analysis in Python

    Project mention: Visions – User defined data type systems | | 2022-02-04
  • cape-python

    Collaborate on privacy-preserving policy for data science projects in Pandas and Apache Spark

    Project mention: Secure Sentiment Analysis with Enclaves | | 2022-11-22

    There are three essential components that enable this: cape encrypt, cape deploy, and cape run. The command cape encrypt encrypts inputs that can be sent into the Cape enclave for processing, cape deploy performs all needed actions for deploying a function into the enclave, and finally cape run invokes the deployed function with an input that was previously encrypted with cape encrypt. Learn more on the Cape docs.

  • flytekit

    Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.

    Project mention: From Incubation to Graduation, and Beyond: FlytePath | | 2022-02-07

    Modin: Speeds up Pandas

  • emr-serverless-samples

    Example code for running Spark and Hive jobs on EMR Serverless.

    Project mention: Should I study these topics or should I skip? | | 2022-11-08
  • mlToolKits

    learningOrchestra is a distributed Machine Learning integration tool that facilitates and streamlines iterative processes in a Data Science project.

  • prosto

    Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

    Project mention: Show HN: PRQL 0.2 – Releasing a better SQL | | 2022-06-27

    > Joins are what makes relational modeling interesting!

    It is the central part of RM which is difficult to model using other methods and which requires high expertise in non-trivial use cases. One alternative to how multiple tables can be analyzed without joins is proposed in the concept-oriented model [1] which relies on two equal modeling constructs: sets (like RM) and functions. In particular, it is implemented in the Prosto data processing toolkit [2] and its Column-SQL language. The idea is that links between tables are used instead of joins. A link is formally a function from one set to another set.

    [1] Joins vs. Links or Relational Join Considered Harmful

    [2] data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-01-31.

Python Spark related posts


What are some of the best open-source Spark projects in Python? This list will help you:

Project Stars
1 data-science-ipython-notebooks 24,571
2 Redash 22,501
3 horovod 12,968
4 dev-setup 5,872
5 TensorFlowOnSpark 3,841
6 koalas 3,247
7 dpark 2,688
8 ibis 2,361
9 Optimus 1,337
10 sparkmagic 1,199
11 fugue 1,150
12 pyspark-example-project 1,108
13 Hail 854
14 listenbrainz-server 529
15 splink 497
16 popmon 412
17 datacompy 264
18 visions 170
19 cape-python 155
20 flytekit 120
21 emr-serverless-samples 74
22 mlToolKits 71
23 prosto 65
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives