Top 5 Python apache-spark Projects
Open source platform for the machine learning lifecycleProject mention: [D] Tips for ML workflow on raw data | reddit.com/r/MachineLearning | 2022-01-21
A command-line tool for launching Apache Spark clusters.Project mention: Why Databricks Is Winning | news.ycombinator.com | 2021-02-14
> * AWS has a managed Spark offering called EMR
There is also my rinky-dink open source project, Flintrock , that will launch open source Spark clusters on AWS for you.
It's probably not the right tool for production use (and you would be right to wonder why Flintrock exists when we have EMR ), but I know of several companies that have used Flintrock at one point or other in production at large scale (like, 400+ node clusters).
Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.
A boilerplate for writing PySpark JobsProject mention: Packaging Pyspark Applications | reddit.com/r/apachespark | 2021-07-15
pyspark methods to enhance developer productivity 📣 👯 🎉 (by MrPowers)Project mention: Pyspark now provides a native Pandas API | reddit.com/r/Python | 2022-01-02
Pandas syntax is far inferior to regular PySpark in my opinion. Goes to show how much data analysts value a syntax that they're already familiar with. Pandas syntax makes it harder to reason about queries, abstract DataFrame transformations, etc. I've authored some popular PySpark libraries like quinn and chispa and am not excited to add Pandas syntax support, haha.
Train and run Pytorch models on Apache Spark.Project mention: Spark2 + pytorch on GPU | reddit.com/r/pytorch | 2021-09-17
Was reading the documentation of sparktorch (https://github.com/dmmiller612/sparktorch) which says you need spark >= 2.4.4. But to the best of my knowledge spark2 doesn't have gpu compute capabilities. Does that mean it can only use cpu compute? Am I missing something?
Python apache-spark related posts
Pyspark now provides a native Pandas API
3 projects | reddit.com/r/Python | 2 Jan 2022
Machine Learning adventures with MLFlow - Deploying models from local system to Production
1 project | reddit.com/r/learnmachinelearning | 22 Dec 2021
How to store preprocessing and feature engineering pipeline?
1 project | reddit.com/r/datascience | 21 Oct 2021
Spark2 + pytorch on GPU
1 project | reddit.com/r/pytorch | 17 Sep 2021
Register Native Functions in PySpark
1 project | reddit.com/r/apachespark | 19 Aug 2021
[D] Looking for good refreshers on stats / ML to go back to the ML engineer interview game after 2 years doing mostly Software.
3 projects | reddit.com/r/MachineLearning | 17 Aug 2021
Packaging Pyspark Applications
1 project | reddit.com/r/apachespark | 15 Jul 2021
What are some of the best open-source apache-spark projects in Python? This list will help you:
Are you hiring? Post a new remote job listing for free.