Top 21 Python Spark Projects
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.Project mention: Beginner in Python for Data Science | reddit.com/r/learnpython | 2020-12-27
data science ipython notebooks
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.Project mention: How often do you use SQL query tool or service in your daily work? | reddit.com/r/SQL | 2021-11-21
Regarding the subqueries: try https://tablum.io or https://redash.io, they materialize queried data so you can do a subquery multiple times.
Scout APM: A developer's best friend. Try free for 14-days. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.Project mention: [D] GPU buying recommendation | reddit.com/r/MachineLearning | 2021-07-17
If you just want to run tensorflow or pytorch for a Jupyter notebook, setting the environment shouldn't be difficult. I know that AWS has a marketplace of preconfigured images. However, you can go as advanced as setting up a cluster of gpu-equipped nodes to setup Horovod (https://github.com/horovod/horovod) to do distributed machine learning. Yes, there's a learning curve, but you cannot acquire this skillet any other way.
donnemartin - dev setup
Koalas: pandas API on Apache SparkProject mention: Spark vs Pandas | reddit.com/r/dataengineering | 2021-02-18
If you like excessive use of square brackets.. I mean pandas, you might wanna check out Koalas. Koalas suppose to provide pandas datafrafe API implementation atop of Spark.
Python clone of Spark, a MapReduce alike framework in Python
Feature Store for Machine LearningProject mention: [P] Announcing Feast 0.10: The simplest way to serve features in production | reddit.com/r/MachineLearning | 2021-04-15
Run Linux Software Faster and Safer than Linux with Unikernels.
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)
Jupyter magics and kernels for working with remote Spark clustersProject mention: Spark is lit once again | dev.to | 2021-10-29
Things get a bit more complicated on interactive sessions. We've created Sparkmagic compatible REST API so that Sparkmagic kernel could communicate with Lighter the same way as it does with Apache Livy. When a user creates an interactive session Lighter server submits a custom PySpark application which contains an infinite loop which constantly checks for new commands to be executed. Each Sparkmagic command is saved on Java collection, retrieved by the PySpark application through Py4J Gateway and executed.
Scalable genomic data analysis.Project mention: Ask HN: Who is hiring? (July 2021) | news.ycombinator.com | 2021-07-01
Broad Institute of MIT and Harvard | Cambridge, MA | Associate Software Engineer | Onsite
We are seeking an associate software engineer interested in contributing to an open-source data visualization library for analyzing the biological impact human genetic variation. You will contribute to projects like gnomAD (https://gnomad.broadinstitute.org), the world's largest catalogue of human genetic variation used by hundreds of thousands of researchers and help us scale towards millions of genomes in the coming years. We are also developing next-generation tools for enabling genetic analyses of large biobanks across richly phenotyped individuals (https://genebass.org). In this role you will gain experience developing data-intensive web applications with Typescript, React, Python, Terraform, Google Cloud Platform, and will make use of the scalable data analysis library Hail (https://hail.is). Key to our success is growing a strong team with a diverse membership who foster a culture of continual learning, and who support the growth and success of one another. Towards this end, we are committed to seeking applications from women and from underrepresented groups. We know that many excellent candidates choose not to apply despite their capabilities; please allow us to enthusiastically counter this tendency.
Please provide a CV and links previous work or projects, ideally with contributions visible on Github.
email: [email protected]
Listenbrainz looks like an interesting project for building better (or at least more open) recommendation systems.
A unified interface for distributed computing. Fugue executes SQL, Python, and Pandas code on Spark and Dask without any rewrites. (by fugue-project)Project mention: FugueSQL: SQL-ish for pandas, dask, spark | news.ycombinator.com | 2021-10-11
Hey, I am the author of Fugue.
Fugue is a higher level abstraction compared to Ray. It provides unified and non-invasive interfaces for people to use Spark, Dask and Pandas. Ray/Modin is also on our roadmap.
It provides both Python interface (not pandas-like) and Fugue SQL (standard SQL + extra features). Users can choose the one they are most comfortable with as the semantic layer for distributed computing, they are equivalent.
With Fugue, most of your logic will be in simple Python/SQL that is framework and scale agnostic. From the mindset to the code, Fugue minimizes your dependency on any specific computing frameworks including Fugue itself.
Please let me know if you want to learn more. our slack is in the README of the fugue repo
Fugue repo: https://github.com/fugue-project/fugue
Monitor the stability of a pandas or spark dataframe ⚙︎Project mention: Monitor the stability of a pandas or spark dataframe | news.ycombinator.com | 2021-09-15
Collaborate on privacy-preserving policy for data science projects in Pandas and Apache SparkProject mention: Data Anonymization Libraries | reddit.com/r/Python | 2021-11-10
I was wondering what other helpful and easy of use libraries are there for data anonymization like faker and cape-python ?
Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.Project mention: Release the TextHTMLPress package to PyPI | dev.to | 2021-11-26
Based on references on setup Python project, package structure, and a production-level Python package, I refactor the package as shown below:
Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupbyProject mention: No-Code Self-Service BI/Data Analytics Tool | news.ycombinator.com | 2021-11-13
Most of the self-service or no-code BI, ETL, data wrangling tools are am aware of (like airtable, fieldbook, rowshare, Power BI etc.) were thought of as a replacement for Excel: working with tables should be as easily as working with spreadsheets. This problem can be solved when defining columns within one table: ``ColumnA=ColumnB+ColumnC, ColumnD=ColumnAColumnE`` we get a graph of column computations* similar to the graph of cell dependencies in spreadsheets.
Yet, the main problem is in working multiple tables: how can we define a column in one table in terms of columns in other tables? For example: ``Table1::ColumnA=FUNCTION(Table2::ColumnB, Table3::ColumnC)`` Different systems provided different answers to this question but all of them are highly specific and rather limited.
Why it is difficult to define new columns in terms of other columns in other tables? Short answer is that working with columns is not the relational approach. The relational model is working with sets (rows of tables) and not with columns.
One generic approach to working with columns in multiple tables is provided in the concept-oriented model of data which treats mathematical functions as first-class elements of the model. Previously it was implemented in a data wrangling tool called Data Commander. But them I decided to implement this model in the *Prosto* data processing toolkit which is an alternative to map-reduce and SQL:
It defines data transformations as operations with columns in multiple tables. Since we use mathematical functions, no joins and no groupby operations are needed and this significantly simplifies and makes more natural the task of data transformations.
Moreover, now it provides *Column-SQL* which makes it even easier to define new columns in terms of other columns:
learningOrchestra is a distributed Machine Learning processing tool that facilitates and streamlines iterative processes in a Data Science project.Project mention: Someone with a good experience in python can rate my code? | reddit.com/r/learnpython | 2021-02-16
Identifies and collects data on cc-licensed content across web crawl data and public apis.Project mention: Hacktoberfest Recap | dev.to | 2021-10-31
Issue, Pull Request, Blog Post
The MLOps platform for innovators 🚀Project mention: DS2.ai Release: End-to-End MLOps Platform | reddit.com/r/programming | 2021-07-13
fastdbfs - An interactive command line client for Databricks DBFS.Project mention: fastdbfs - An interactive command line client for Databricks DBFS | reddit.com/r/dataengineering | 2021-05-07
fastdbfs is an interactive command line client for accessing Databricks DBFS. It aims to be much more friendly and faster than the official CLI tool and also feature rich.
ETL Markup Toolkit is a spark-native tool for expressing ETL transformations as configurationProject mention: How do you serialize and save "transformations" in your pipeline? | reddit.com/r/dataengineering | 2021-08-31
I have a side project (https://github.com/leozqin/etl-markup-toolkit, if you're interested) that takes transformations as yaml files and outputs step-level logs about each step of the transformation. I've always felt that both artifacts could made searchable using an ELK stack or something... Do you have similar artifacts? Or perhaps there's a way to turn SQL into a structured or semi-structured form to aid in searchability
Python Spark related posts
No-Code Self-Service BI/Data Analytics Tool
1 project | news.ycombinator.com | 13 Nov 2021
FugueSQL: SQL-ish for pandas, dask, spark
1 project | news.ycombinator.com | 11 Oct 2021
How do you serialize and save "transformations" in your pipeline?
1 project | reddit.com/r/dataengineering | 31 Aug 2021
Alternative tools to DBT / SQL and Python for writing business logic? Trying to prevent creating a mountain of undocumented spaghetti
1 project | reddit.com/r/dataengineering | 30 Aug 2021
How to keep track of the different Transformations done in an ETL pipeline?
2 projects | reddit.com/r/dataengineering | 22 Aug 2021
Show dataengineering: beavis, a library for unit testing Pandas/Dask code
3 projects | reddit.com/r/dataengineering | 9 Aug 2021
Is Spark - The Defenitive Guide outdated?
2 projects | reddit.com/r/apachespark | 1 Jul 2021
What are some of the best open-source Spark projects in Python? This list will help you:
Are you hiring? Post a new remote job listing for free.