InfluxDB is the Time Series Platform where developers build real-time applications for analytics, IoT and cloud-native services. Easy to start, it is available in the cloud or on-premises. Learn more →
Top 23 Python Spark Projects
-
data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
-
Redash
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
-
InfluxDB
Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Platform where developers build real-time applications for analytics, IoT and cloud-native services. Easy to start, it is available in the cloud or on-premises.
-
Project mention: [D] What is the recommended approach to training NN on big data set? | reddit.com/r/MachineLearning | 2022-12-08
And in case scaling is really important to you. May I suggest you look into Horovod?
-
dev-setup
macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.
Something like this at least is the most direct answer to your question, as opposed to "you're doing it wrong" which unfortunately seems to be more upvoted. An example of something like this might be https://github.com/donnemartin/dev-setup
-
Currently I use TensorflowOnSpark frame to train and predict model. When prediction, I have billions of samples to predict which is time-consuming. I wonder if there is some good practices on this.
-
Project mention: My new company uses Pyspark. I want to learn it before my starting date. Any advice? | reddit.com/r/datascience | 2022-11-10
If they're using databricks and you're familiar with pandas, koalas should be right up your alley .
-
-
Sonar
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
-
By the way I was introduced to Ibis https://github.com/ibis-project/ibis recently, I have mix feelings about it.
-
Optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)
-
-
fugue
A unified interface for distributed computing. Fugue executes SQL, Python, and Pandas code on Spark, Dask and Ray without any rewrites.
-
pyspark-example-project
Example project implementing best practices for PySpark ETL jobs and applications.
https://github.com/AlexIoannides/pyspark-example-project You can use this as an example to organize your project. I have referred to this in the past.
-
Project mention: We're wasting money by only supporting gzip for raw DNA files | news.ycombinator.com | 2023-01-09
-
listenbrainz-server
Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.
I don't think it's self-hosted, but listenbrainz is open source and provides recommendations based on your listening history.
-
Project mention: Ask HN: What have you created that deserves a second chance on HN? | news.ycombinator.com | 2023-01-26
Splink - a python library for probabilistic record linkage (fuzzy matching/entity resolution).
Splink is dramatically faster and works on much larger datasets than other open source libraries. I'm particularly proud of the fact we support multiple execution backends (at the moment, DuckDb Spark Athena and Sqlite, but additional adaptors are relatively straightforward to write).
We've had >4 million pypi downloads and it's used in government, academia and the private sector, often replacing extremely expensive proprietary solutions.
https://github.com/moj-analytical-services/splink
More info in blog posts here:
-
-
datacompy is a package to compare 2 pandas dataframes
-
-
cape-python
Collaborate on privacy-preserving policy for data science projects in Pandas and Apache Spark
There are three essential components that enable this: cape encrypt, cape deploy, and cape run. The command cape encrypt encrypts inputs that can be sent into the Cape enclave for processing, cape deploy performs all needed actions for deploying a function into the enclave, and finally cape run invokes the deployed function with an input that was previously encrypted with cape encrypt. Learn more on the Cape docs.
-
flytekit
Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.
Modin: Speeds up Pandas
-
Project mention: Should I study these topics or should I skip? | reddit.com/r/AWSCertifications | 2022-11-08
-
mlToolKits
learningOrchestra is a distributed Machine Learning integration tool that facilitates and streamlines iterative processes in a Data Science project.
-
prosto
Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby
> Joins are what makes relational modeling interesting!
It is the central part of RM which is difficult to model using other methods and which requires high expertise in non-trivial use cases. One alternative to how multiple tables can be analyzed without joins is proposed in the concept-oriented model [1] which relies on two equal modeling constructs: sets (like RM) and functions. In particular, it is implemented in the Prosto data processing toolkit [2] and its Column-SQL language. The idea is that links between tables are used instead of joins. A link is formally a function from one set to another set.
[1] Joins vs. Links or Relational Join Considered Harmful https://www.researchgate.net/publication/301764816_Joins_vs_...
[2] https://github.com/asavinov/prosto data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Spark related posts
- Replacing Pandas with Polars. A Practical Guide
- Conformed Dimensions problem that keeps recurring on every project
- Trying Delta Lake at home
- The hand-picked selection of the best Python libraries and tools of 2022
- How do you join two sources with attributes that aren't identical?
- My new company uses Pyspark. I want to learn it before my starting date. Any advice?
- Should I study these topics or should I skip?
-
A note from our sponsor - InfluxDB
www.influxdata.com | 1 Feb 2023
Index
What are some of the best open-source Spark projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | data-science-ipython-notebooks | 24,571 |
2 | Redash | 22,501 |
3 | horovod | 12,968 |
4 | dev-setup | 5,872 |
5 | TensorFlowOnSpark | 3,841 |
6 | koalas | 3,247 |
7 | dpark | 2,688 |
8 | ibis | 2,361 |
9 | Optimus | 1,337 |
10 | sparkmagic | 1,199 |
11 | fugue | 1,150 |
12 | pyspark-example-project | 1,108 |
13 | Hail | 854 |
14 | listenbrainz-server | 529 |
15 | splink | 497 |
16 | popmon | 412 |
17 | datacompy | 264 |
18 | visions | 170 |
19 | cape-python | 155 |
20 | flytekit | 120 |
21 | emr-serverless-samples | 74 |
22 | mlToolKits | 71 |
23 | prosto | 65 |