Top 22 Python Spark Projects
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.Project mention: Open source DW? | reddit.com/r/dataengineering | 2022-01-03
Its a bad idea use Redash as ETL and Data warehouse? Im not a data engineer just looking for a low/mid scale solution for experiment.
Static code analysis for 29 languages.. Your projects are multi-language. So is SonarQube analysis. Find Bugs, Vulnerabilities, Security Hotspots, and Code Smells so you can release quality code every time. Get started analyzing your projects today for free.
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.Project mention: [D] PyTorch Distributed Training Libraries: What are the current options? | reddit.com/r/MachineLearning | 2021-12-07
Check out Horovod - https://github.com/horovod/horovod
donnemartin - dev setup
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.Project mention: [D] Plug or Integrate a GNN Pytorch code base into Spark Cluster | reddit.com/r/MachineLearning | 2022-01-03
https://github.com/yahoo/TensorFlowOnSpark : check out if this project is useful for you.
Koalas: pandas API on Apache SparkProject mention: Spark vs Pandas | reddit.com/r/dataengineering | 2021-02-18
If you like excessive use of square brackets.. I mean pandas, you might wanna check out Koalas. Koalas suppose to provide pandas datafrafe API implementation atop of Spark.
Python clone of Spark, a MapReduce alike framework in Python
Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.
Feature Store for Machine LearningProject mention: [P] Announcing Feast 0.10: The simplest way to serve features in production | reddit.com/r/MachineLearning | 2021-04-15
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)
Jupyter magics and kernels for working with remote Spark clustersProject mention: Spark is lit once again | dev.to | 2021-10-29
Things get a bit more complicated on interactive sessions. We've created Sparkmagic compatible REST API so that Sparkmagic kernel could communicate with Lighter the same way as it does with Apache Livy. When a user creates an interactive session Lighter server submits a custom PySpark application which contains an infinite loop which constantly checks for new commands to be executed. Each Sparkmagic command is saved on Java collection, retrieved by the PySpark application through Py4J Gateway and executed.
Scalable genomic data analysis.Project mention: Ask HN: Who is hiring? (July 2021) | news.ycombinator.com | 2021-07-01
Broad Institute of MIT and Harvard | Cambridge, MA | Associate Software Engineer | Onsite
We are seeking an associate software engineer interested in contributing to an open-source data visualization library for analyzing the biological impact human genetic variation. You will contribute to projects like gnomAD (https://gnomad.broadinstitute.org), the world's largest catalogue of human genetic variation used by hundreds of thousands of researchers and help us scale towards millions of genomes in the coming years. We are also developing next-generation tools for enabling genetic analyses of large biobanks across richly phenotyped individuals (https://genebass.org). In this role you will gain experience developing data-intensive web applications with Typescript, React, Python, Terraform, Google Cloud Platform, and will make use of the scalable data analysis library Hail (https://hail.is). Key to our success is growing a strong team with a diverse membership who foster a culture of continual learning, and who support the growth and success of one another. Towards this end, we are committed to seeking applications from women and from underrepresented groups. We know that many excellent candidates choose not to apply despite their capabilities; please allow us to enthusiastically counter this tendency.
Please provide a CV and links previous work or projects, ideally with contributions visible on Github.
email: [email protected]
A unified interface for distributed computing. Fugue executes SQL, Python, and Pandas code on Spark and Dask without any rewrites.Project mention: Pyspark now provides a native Pandas API | reddit.com/r/Python | 2022-01-02
There's dask-sql, but I think it is being abandoned for fugue-project. I'm actually excited for this project as it is trying to provide a backend agnostic solution, which would seem like a difficult, lofty goal. I wish them luck.
Monitor the stability of a pandas or spark dataframe ⚙︎Project mention: Monitor the stability of a pandas or spark dataframe | news.ycombinator.com | 2021-09-15
Collaborate on privacy-preserving policy for data science projects in Pandas and Apache SparkProject mention: Anonymize your Data with a single line! | reddit.com/r/Python | 2021-12-26
Well, many of the features in this project are simply wrappers around other libraries like this one. Therefore, the value proposition of this project would either have to be the automation aspect or the idea that you can shield the user from the details of how the implemented techniques work. I think both approaches are risky in this setting.
Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.Project mention: Release the TextHTMLPress package to PyPI | dev.to | 2021-11-26
Based on references on setup Python project, package structure, and a production-level Python package, I refactor the package as shown below:
Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupbyProject mention: No-Code Self-Service BI/Data Analytics Tool | news.ycombinator.com | 2021-11-13
Most of the self-service or no-code BI, ETL, data wrangling tools are am aware of (like airtable, fieldbook, rowshare, Power BI etc.) were thought of as a replacement for Excel: working with tables should be as easily as working with spreadsheets. This problem can be solved when defining columns within one table: ``ColumnA=ColumnB+ColumnC, ColumnD=ColumnAColumnE`` we get a graph of column computations* similar to the graph of cell dependencies in spreadsheets.
Yet, the main problem is in working multiple tables: how can we define a column in one table in terms of columns in other tables? For example: ``Table1::ColumnA=FUNCTION(Table2::ColumnB, Table3::ColumnC)`` Different systems provided different answers to this question but all of them are highly specific and rather limited.
Why it is difficult to define new columns in terms of other columns in other tables? Short answer is that working with columns is not the relational approach. The relational model is working with sets (rows of tables) and not with columns.
One generic approach to working with columns in multiple tables is provided in the concept-oriented model of data which treats mathematical functions as first-class elements of the model. Previously it was implemented in a data wrangling tool called Data Commander. But them I decided to implement this model in the *Prosto* data processing toolkit which is an alternative to map-reduce and SQL:
It defines data transformations as operations with columns in multiple tables. Since we use mathematical functions, no joins and no groupby operations are needed and this significantly simplifies and makes more natural the task of data transformations.
Moreover, now it provides *Column-SQL* which makes it even easier to define new columns in terms of other columns:
learningOrchestra is a distributed Machine Learning processing tool that facilitates and streamlines iterative processes in a Data Science project.Project mention: Someone with a good experience in python can rate my code? | reddit.com/r/learnpython | 2021-02-16
Identifies and collects data on cc-licensed content across web crawl data and public apis.Project mention: Hacktoberfest Recap | dev.to | 2021-10-31
Issue, Pull Request, Blog Post
The MLOps platform for innovators 🚀Project mention: DS2.ai Release: End-to-End MLOps Platform | reddit.com/r/programming | 2021-07-13
fastdbfs - An interactive command line client for Databricks DBFS.Project mention: fastdbfs - An interactive command line client for Databricks DBFS | reddit.com/r/dataengineering | 2021-05-07
fastdbfs is an interactive command line client for accessing Databricks DBFS. It aims to be much more friendly and faster than the official CLI tool and also feature rich.
ETL Markup Toolkit is a spark-native tool for expressing ETL transformations as configurationProject mention: How do you serialize and save "transformations" in your pipeline? | reddit.com/r/dataengineering | 2021-08-31
I have a side project (https://github.com/leozqin/etl-markup-toolkit, if you're interested) that takes transformations as yaml files and outputs step-level logs about each step of the transformation. I've always felt that both artifacts could made searchable using an ELK stack or something... Do you have similar artifacts? Or perhaps there's a way to turn SQL into a structured or semi-structured form to aid in searchability
Python Spark related posts
How can you do efficient text preprocessing?
2 projects | reddit.com/r/LanguageTechnology | 6 Jan 2022
[D] Plug or Integrate a GNN Pytorch code base into Spark Cluster
2 projects | reddit.com/r/MachineLearning | 3 Jan 2022
Pyspark now provides a native Pandas API
3 projects | reddit.com/r/Python | 2 Jan 2022
No-Code Self-Service BI/Data Analytics Tool
1 project | news.ycombinator.com | 13 Nov 2021
FugueSQL: SQL-ish for pandas, dask, spark
1 project | news.ycombinator.com | 11 Oct 2021
How do you serialize and save "transformations" in your pipeline?
1 project | reddit.com/r/dataengineering | 31 Aug 2021
Alternative tools to DBT / SQL and Python for writing business logic? Trying to prevent creating a mountain of undocumented spaghetti
1 project | reddit.com/r/dataengineering | 30 Aug 2021
What are some of the best open-source Spark projects in Python? This list will help you:
Are you hiring? Post a new remote job listing for free.