Top 14 Python apache-spark Projects

MLflow

1 76 21,856 10.0 Python

The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

Project mention: DevOps, MLOps, or Platform Engineering, In 2025, who will own the pipeline? | dev.to | 2025-06-20

MLflow or Weights & Biases for experiment tracking
InfluxDB

www.influxdata.com featured

InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
quinn

2 9 675 6.2 Python

pyspark methods to enhance developer productivity 📣 👯 🎉 (by mrpowers-io)
flintrock

3 1 647 4.7 Python

A command-line tool for launching Apache Spark clusters.
PySpark-Boilerplate

4 1 394 2.5 Python

A boilerplate for writing PySpark Jobs
sparktorch

5 1 339 2.5 Python

Train and run Pytorch models on Apache Spark.
pysparkling

6 1 271 0.0 Python

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

Project mention: Show HN: Pyper – Concurrent Python Made Simple | news.ycombinator.com | 2025-01-12
dataproc-templates

7 1 132 7.8 Python

Dataproc templates and pipelines for solving in-cloud data tasks
Sevalla

sevalla.com featured

Deploy and host your apps and databases, now with $50 credit! Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more!
pyjaws

8 4 43 0.0 Python

PyJaws: A Pythonic Way to Define Databricks Jobs and Workflows
Apache-Spark-Guide

9 2 31 1.8 Python

Apache Spark Guide
covid-19-data-engineering-pipeline

10 1 23 5.3 Python

A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.
e2e-structured-streaming

11 1 20 5.8 Python

End-to-end data pipeline that ingests, processes, and stores data. It uses Apache Airflow to schedule scripts that fetch data from an API, sends the data to Kafka, and processes it with Spark before writing to Cassandra. The pipeline, built with Python and Apache Zookeeper, is containerized with Docker for easy deployment and scalability.
Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data

12 1 13 0.0 Python

Mobile robot data were analyzed with Apache-Spark to extract five different statistical result such as travel time, waiting time, average speed, occupancy and density were produced.
transactional-datalake-using-amazon-msk-and-apache-iceberg-on-aws-glue

13 1 5 4.6 Python

Stream CDC into an Amazon S3 data lake in Apache Iceberg format with AWS Glue Streaming using Amazon MSK and MSK Connect (Debezium)
livyc

14 2 3 0.0 Python

Apache Spark as a Service with Apache Livy Client
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python apache-spark discussion

Python apache-spark related posts

How to Use KitOps with MLflow

1 project | dev.to | 29 Nov 2024
Mlflow: Open-source platform for the machine learning lifecycle

1 project | news.ycombinator.com | 16 May 2024
Observations on MLOps–A Fragmented Mosaic of Mismatched Expectations

1 project | dev.to | 26 Apr 2024
Explain me how websites like Dall-E, chatgpt, thispersondoesntexit process the user data so quickly

1 project | /r/dataengineering | 17 Jun 2023
[D] What licensed software do you use for machine learning experimentation tracking?

1 project | /r/MachineLearning | 11 Jun 2023
[Q] Is there a tool to keep track of my ML experiments?

1 project | /r/datascience | 13 May 2023
Remote file access vulnerability in `mlflow server` and `mlflow ui` CLIs

1 project | /r/LanguageTechnology | 24 Mar 2023
A note from our sponsor - Sevalla
sevalla.com | 1 Sep 2025

Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more! Learn more →

Index

What are some of the best open-source apache-spark projects in Python? This list will help you:

#	Project	Stars
1	MLflow	21,856
2	quinn	675
3	flintrock	647
4	PySpark-Boilerplate	394
5	sparktorch	339
6	pysparkling	271
7	dataproc-templates	132
8	pyjaws	43
9	Apache-Spark-Guide	31
10	covid-19-data-engineering-pipeline	23
11	e2e-structured-streaming	20
12	Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data	13
13	transactional-datalake-using-amazon-msk-and-apache-iceberg-on-aws-glue	5
14	livyc	3

Python apache-spark

Top 14 Python apache-spark Projects

Python apache-spark discussion

Python apache-spark related posts

How to Use KitOps with MLflow

Mlflow: Open-source platform for the machine learning lifecycle

Observations on MLOps–A Fragmented Mosaic of Mismatched Expectations

Explain me how websites like Dall-E, chatgpt, thispersondoesntexit process the user data so quickly

[D] What licensed software do you use for machine learning experimentation tracking?

[Q] Is there a tool to keep track of my ML experiments?

Remote file access vulnerability in `mlflow server` and `mlflow ui` CLIs

Index

Did you know that Python is the 2nd most popular programming language based on number of references?

Did you know that Python is
the 2nd most popular programming language
based on number of references?