InfluxDB is the Time Series Platform where developers build real-time applications for analytics, IoT and cloud-native services. Easy to start, it is available in the cloud or on-premises. Learn more →
Top 23 Python Spark Projects
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.Project mention: Recommend Django Great Projects | news.ycombinator.com | 2022-12-03
Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Platform where developers build real-time applications for analytics, IoT and cloud-native services. Easy to start, it is available in the cloud or on-premises.
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.Project mention: [D] What is the recommended approach to training NN on big data set? | reddit.com/r/MachineLearning | 2022-12-08
And in case scaling is really important to you. May I suggest you look into Horovod?
Something like this at least is the most direct answer to your question, as opposed to "you're doing it wrong" which unfortunately seems to be more upvoted. An example of something like this might be https://github.com/donnemartin/dev-setup
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.Project mention: [D]Speed up inference on Spark | reddit.com/r/MachineLearning | 2022-02-18
Currently I use TensorflowOnSpark frame to train and predict model. When prediction, I have billions of samples to predict which is time-consuming. I wonder if there is some good practices on this.
Koalas: pandas API on Apache SparkProject mention: My new company uses Pyspark. I want to learn it before my starting date. Any advice? | reddit.com/r/datascience | 2022-11-10
If they're using databricks and you're familiar with pandas, koalas should be right up your alley .
Python clone of Spark, a MapReduce alike framework in Python
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
Expressive analytics in Python at any scale.Project mention: Why use Python over SQL? | reddit.com/r/datascience | 2022-12-25
By the way I was introduced to Ibis https://github.com/ibis-project/ibis recently, I have mix feelings about it.
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)
Jupyter magics and kernels for working with remote Spark clusters
A unified interface for distributed computing. Fugue executes SQL, Python, and Pandas code on Spark, Dask and Ray without any rewrites.Project mention: Ask HN: How do you test SQL? | news.ycombinator.com | 2023-01-31
Example project implementing best practices for PySpark ETL jobs and applications.Project mention: Learning Pyspark for a new role | reddit.com/r/dataengineering | 2022-12-23
https://github.com/AlexIoannides/pyspark-example-project You can use this as an example to organize your project. I have referred to this in the past.
Scalable genomic data analysis.Project mention: We're wasting money by only supporting gzip for raw DNA files | news.ycombinator.com | 2023-01-09
I don't think it's self-hosted, but listenbrainz is open source and provides recommendations based on your listening history.
Monitor the stability of a Pandas or Spark dataframe ⚙︎
Pandas and Spark DataFrame comparison for humansProject mention: Comparing 2 CSV files | reddit.com/r/Python | 2022-07-11
datacompy is a package to compare 2 pandas dataframes
Type System for Data Analysis in PythonProject mention: Visions – User defined data type systems | reddit.com/r/Python | 2022-02-04
Collaborate on privacy-preserving policy for data science projects in Pandas and Apache SparkProject mention: Secure Sentiment Analysis with Enclaves | dev.to | 2022-11-22
There are three essential components that enable this: cape encrypt, cape deploy, and cape run. The command cape encrypt encrypts inputs that can be sent into the Cape enclave for processing, cape deploy performs all needed actions for deploying a function into the enclave, and finally cape run invokes the deployed function with an input that was previously encrypted with cape encrypt. Learn more on the Cape docs.
Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.Project mention: From Incubation to Graduation, and Beyond: FlytePath | dev.to | 2022-02-07
Modin: Speeds up Pandas
Example code for running Spark and Hive jobs on EMR Serverless.Project mention: Should I study these topics or should I skip? | reddit.com/r/AWSCertifications | 2022-11-08
learningOrchestra is a distributed Machine Learning integration tool that facilitates and streamlines iterative processes in a Data Science project.
Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupbyProject mention: Show HN: PRQL 0.2 – Releasing a better SQL | news.ycombinator.com | 2022-06-27
> Joins are what makes relational modeling interesting!
It is the central part of RM which is difficult to model using other methods and which requires high expertise in non-trivial use cases. One alternative to how multiple tables can be analyzed without joins is proposed in the concept-oriented model  which relies on two equal modeling constructs: sets (like RM) and functions. In particular, it is implemented in the Prosto data processing toolkit  and its Column-SQL language. The idea is that links between tables are used instead of joins. A link is formally a function from one set to another set.
 Joins vs. Links or Relational Join Considered Harmful https://www.researchgate.net/publication/301764816_Joins_vs_...
 https://github.com/asavinov/prosto data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Spark related posts
Replacing Pandas with Polars. A Practical Guide
4 projects | news.ycombinator.com | 22 Jan 2023
Conformed Dimensions problem that keeps recurring on every project
1 project | reddit.com/r/dataengineering | 17 Jan 2023
Trying Delta Lake at home
5 projects | reddit.com/r/dataengineering | 31 Dec 2022
The hand-picked selection of the best Python libraries and tools of 2022
11 projects | reddit.com/r/Python | 26 Dec 2022
How do you join two sources with attributes that aren't identical?
1 project | reddit.com/r/dataengineering | 12 Nov 2022
My new company uses Pyspark. I want to learn it before my starting date. Any advice?
1 project | reddit.com/r/datascience | 10 Nov 2022
Should I study these topics or should I skip?
1 project | reddit.com/r/AWSCertifications | 8 Nov 2022
A note from our sponsor - InfluxDB
www.influxdata.com | 1 Feb 2023
What are some of the best open-source Spark projects in Python? This list will help you: