Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more! Learn more →
Top 14 Python apache-spark Projects
-
MLflow
The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.
Project mention: DevOps, MLOps, or Platform Engineering, In 2025, who will own the pipeline? | dev.to | 2025-06-20MLflow or Weights & Biases for experiment tracking
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
-
-
-
-
Project mention: Show HN: Pyper – Concurrent Python Made Simple | news.ycombinator.com | 2025-01-12
-
-
Sevalla
Deploy and host your apps and databases, now with $50 credit! Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more!
-
-
-
covid-19-data-engineering-pipeline
A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.
-
e2e-structured-streaming
End-to-end data pipeline that ingests, processes, and stores data. It uses Apache Airflow to schedule scripts that fetch data from an API, sends the data to Kafka, and processes it with Spark before writing to Cassandra. The pipeline, built with Python and Apache Zookeeper, is containerized with Docker for easy deployment and scalability.
-
Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data
Mobile robot data were analyzed with Apache-Spark to extract five different statistical result such as travel time, waiting time, average speed, occupancy and density were produced.
-
transactional-datalake-using-amazon-msk-and-apache-iceberg-on-aws-glue
Stream CDC into an Amazon S3 data lake in Apache Iceberg format with AWS Glue Streaming using Amazon MSK and MSK Connect (Debezium)
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python apache-spark discussion
Python apache-spark related posts
-
How to Use KitOps with MLflow
-
Mlflow: Open-source platform for the machine learning lifecycle
-
Observations on MLOps–A Fragmented Mosaic of Mismatched Expectations
-
Explain me how websites like Dall-E, chatgpt, thispersondoesntexit process the user data so quickly
-
[D] What licensed software do you use for machine learning experimentation tracking?
-
[Q] Is there a tool to keep track of my ML experiments?
-
Remote file access vulnerability in `mlflow server` and `mlflow ui` CLIs
-
A note from our sponsor - Sevalla
sevalla.com | 1 Sep 2025
Index
What are some of the best open-source apache-spark projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | MLflow | 21,856 |
2 | quinn | 675 |
3 | flintrock | 647 |
4 | PySpark-Boilerplate | 394 |
5 | sparktorch | 339 |
6 | pysparkling | 271 |
7 | dataproc-templates | 132 |
8 | pyjaws | 43 |
9 | Apache-Spark-Guide | 31 |
10 | covid-19-data-engineering-pipeline | 23 |
11 | e2e-structured-streaming | 20 |
12 | Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data | 13 |
13 | transactional-datalake-using-amazon-msk-and-apache-iceberg-on-aws-glue | 5 |
14 | livyc | 3 |