The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 23 Python Spark Projects
-
data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
-
Redash
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
Project mention: Redash: Connect to data source, easily visualize, dashboard and share your data | news.ycombinator.com | 2024-03-20 -
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
Project mention: A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes | dev.to | 2023-06-12In the coming years, Mage will create a cooperative experience so that developers can build data pipelines with their team and level up together. After that journey, Mage will go on an epic quest to create the 1st open world community experience in the data universe.
-
dev-setup
macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.
-
Recommend checking out https://github.com/tobymao/sqlglot if you are interested in this capability for other SQL dialects
Tools like this are helpful for:
- Rendering SQL in a consistent way, eg for snapshot testing
-
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
-
-
fugue
A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.
-
Optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)
-
-
Please file an issue at https://github.com/jupyter-incubator/sparkmagic
-
splink
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
Project mention: Splink: Fast, accurate, scalable probabilistic data linkage | news.ycombinator.com | 2024-03-13 -
listenbrainz-server
Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.
There's also ListenBrainz, run by the MusicBrainz org, which offers similar functionality without API restrictions or other paid features that Last.FM tries to push.
If you wish to use your scrobble data at all programmatically this is a far better tool to use.
-
-
streamify
A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!
-
-
flytekit
Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.
-
Project mention: Complete Beginner tasked with ML at work - where do I start | /r/learnmachinelearning | 2023-06-27
This one works pretty well: https://github.com/dylan-profiler/visions
-
cape-dataframes
Privacy transformations on Spark and Pandas dataframes backed by a simple policy language.
Project mention: Show HN: Cape API – Keep your sensitive data private while using GPT-4 | news.ycombinator.com | 2023-06-27- How can we mitigate hallucinations and bias so that we have higher trust in AI generated text?
The features of the Cape API are designed to help solve these problems for developers, and we have a number of early customers using the API in production already.
To get started, checkout our docs: https://docs.capeprivacy.com/
View the API reference: https://api.capeprivacy.com/redoc
Join the discussion on our Discord: https://discord.gg/nQW7YxUYjh
And of course try the CapeChat playground at https://chat.capeprivacy.com/
-
-
prosto
Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Spark related posts
- Splink: Fast, accurate, scalable probabilistic data linkage
- FLaNK Stack Weekly 22 January 2024
- FLaNK Stack Weekly for 12 September 2023
- Data diffs: Algorithms for explaining what changed in a dataset (2022)
- A platform for building Gen-AI applications on Spark
- Daft: A High-Performance Distributed Dataframe Library for Multimodal Data
- Doing ML works in AWS. Need help installing cartopy
-
A note from our sponsor - WorkOS
workos.com | 29 Mar 2024
Index
What are some of the best open-source Spark projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | data-science-ipython-notebooks | 26,307 |
2 | Redash | 24,789 |
3 | horovod | 13,899 |
4 | Mage | 6,802 |
5 | dev-setup | 6,032 |
6 | sqlglot | 5,236 |
7 | TensorFlowOnSpark | 3,862 |
8 | koalas | 3,312 |
9 | dpark | 2,691 |
10 | fugue | 1,853 |
11 | Optimus | 1,434 |
12 | pyspark-example-project | 1,312 |
13 | sparkmagic | 1,281 |
14 | splink | 1,060 |
15 | listenbrainz-server | 634 |
16 | popmon | 483 |
17 | streamify | 435 |
18 | datacompy | 352 |
19 | flytekit | 197 |
20 | visions | 194 |
21 | cape-dataframes | 174 |
22 | emr-serverless-samples | 132 |
23 | prosto | 89 |