Python apache-spark

Open-source Python projects categorized as apache-spark

Top 14 Python apache-spark Projects

apache-spark
  1. MLflow

    The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

    Project mention: DevOps, MLOps, or Platform Engineering, In 2025, who will own the pipeline? | dev.to | 2025-06-20

    MLflow or Weights & Biases for experiment tracking

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. quinn

    pyspark methods to enhance developer productivity 📣 👯 🎉 (by mrpowers-io)

  4. flintrock

    A command-line tool for launching Apache Spark clusters.

  5. PySpark-Boilerplate

    A boilerplate for writing PySpark Jobs

  6. sparktorch

    Train and run Pytorch models on Apache Spark.

  7. pysparkling

    A pure Python implementation of Apache Spark's RDD and DStream interfaces.

    Project mention: Show HN: Pyper – Concurrent Python Made Simple | news.ycombinator.com | 2025-01-12
  8. dataproc-templates

    Dataproc templates and pipelines for solving in-cloud data tasks

  9. Sevalla

    Deploy and host your apps and databases, now with $50 credit! Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more!

    Sevalla logo
  10. pyjaws

    PyJaws: A Pythonic Way to Define Databricks Jobs and Workflows

  11. Apache-Spark-Guide

    Apache Spark Guide

  12. covid-19-data-engineering-pipeline

    A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.

  13. e2e-structured-streaming

    End-to-end data pipeline that ingests, processes, and stores data. It uses Apache Airflow to schedule scripts that fetch data from an API, sends the data to Kafka, and processes it with Spark before writing to Cassandra. The pipeline, built with Python and Apache Zookeeper, is containerized with Docker for easy deployment and scalability.

  14. Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data

    Mobile robot data were analyzed with Apache-Spark to extract five different statistical result such as travel time, waiting time, average speed, occupancy and density were produced.

  15. transactional-datalake-using-amazon-msk-and-apache-iceberg-on-aws-glue

    Stream CDC into an Amazon S3 data lake in Apache Iceberg format with AWS Glue Streaming using Amazon MSK and MSK Connect (Debezium)

  16. livyc

    Apache Spark as a Service with Apache Livy Client

  17. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python apache-spark discussion

Log in or Post with

Python apache-spark related posts

  • How to Use KitOps with MLflow

    1 project | dev.to | 29 Nov 2024
  • Mlflow: Open-source platform for the machine learning lifecycle

    1 project | news.ycombinator.com | 16 May 2024
  • Observations on MLOps–A Fragmented Mosaic of Mismatched Expectations

    1 project | dev.to | 26 Apr 2024
  • Explain me how websites like Dall-E, chatgpt, thispersondoesntexit process the user data so quickly

    1 project | /r/dataengineering | 17 Jun 2023
  • [D] What licensed software do you use for machine learning experimentation tracking?

    1 project | /r/MachineLearning | 11 Jun 2023
  • [Q] Is there a tool to keep track of my ML experiments?

    1 project | /r/datascience | 13 May 2023
  • Remote file access vulnerability in `mlflow server` and `mlflow ui` CLIs

    1 project | /r/LanguageTechnology | 24 Mar 2023
  • A note from our sponsor - Sevalla
    sevalla.com | 1 Sep 2025
    Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more! Learn more →

Index

What are some of the best open-source apache-spark projects in Python? This list will help you:

# Project Stars
1 MLflow 21,856
2 quinn 675
3 flintrock 647
4 PySpark-Boilerplate 394
5 sparktorch 339
6 pysparkling 271
7 dataproc-templates 132
8 pyjaws 43
9 Apache-Spark-Guide 31
10 covid-19-data-engineering-pipeline 23
11 e2e-structured-streaming 20
12 Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data 13
13 transactional-datalake-using-amazon-msk-and-apache-iceberg-on-aws-glue 5
14 livyc 3

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?