Python apache-spark

Open-source Python projects categorized as apache-spark

Top 15 Python apache-spark Projects

apache-spark
  1. MLflow

    Open source platform for the machine learning lifecycle

    Project mention: Future AI Deployment: Automating Full Lifecycle Management with Rollback Strategies and Cloud Migration | dev.to | 2025-03-15

    AI Model Lifecycle Management:MLflow Documentation

  2. Judoscale

    Save 47% on cloud hosting with autoscaling that just works. Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.

    Judoscale logo
  3. quinn

    pyspark methods to enhance developer productivity 📣 👯 🎉 (by mrpowers-io)

  4. flintrock

    A command-line tool for launching Apache Spark clusters.

  5. PySpark-Boilerplate

    A boilerplate for writing PySpark Jobs

  6. sparktorch

    Train and run Pytorch models on Apache Spark.

  7. pysparkling

    A pure Python implementation of Apache Spark's RDD and DStream interfaces.

    Project mention: Show HN: Pyper – Concurrent Python Made Simple | news.ycombinator.com | 2025-01-12
  8. dataproc-templates

    Dataproc templates and pipelines for solving in-cloud data tasks

  9. InfluxDB

    InfluxDB high-performance time series database. Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems.

    InfluxDB logo
  10. pyjaws

    PyJaws: A Pythonic Way to Define Databricks Jobs and Workflows

  11. Apache-Spark-Guide

    Apache Spark Guide

  12. covid-19-data-engineering-pipeline

    A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.

  13. e2e-structured-streaming

    End-to-end data pipeline that ingests, processes, and stores data. It uses Apache Airflow to schedule scripts that fetch data from an API, sends the data to Kafka, and processes it with Spark before writing to Cassandra. The pipeline, built with Python and Apache Zookeeper, is containerized with Docker for easy deployment and scalability.

    Project mention: End-to-End Realtime Streaming Data Engineering Project | dev.to | 2024-08-07

    $ git clone https://github.com/akarce/e2e-structured-streaming.git

  14. xonai-dashboard

    A Grafana-based application to assist Big Data infrastructure optimization initiatives where Spark applications are a dominant cost driver

  15. Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data

    Mobile robot data were analyzed with Apache-Spark to extract five different statistical result such as travel time, waiting time, average speed, occupancy and density were produced.

  16. transactional-datalake-using-amazon-msk-and-apache-iceberg-on-aws-glue

    Stream CDC into an Amazon S3 data lake in Apache Iceberg format with AWS Glue Streaming using Amazon MSK and MSK Connect (Debezium)

  17. livyc

    Apache Spark as a Service with Apache Livy Client

  18. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python apache-spark discussion

Log in or Post with

Python apache-spark related posts

  • How to Use KitOps with MLflow

    1 project | dev.to | 29 Nov 2024
  • Mlflow: Open-source platform for the machine learning lifecycle

    1 project | news.ycombinator.com | 16 May 2024
  • Observations on MLOps–A Fragmented Mosaic of Mismatched Expectations

    1 project | dev.to | 26 Apr 2024
  • Explain me how websites like Dall-E, chatgpt, thispersondoesntexit process the user data so quickly

    1 project | /r/dataengineering | 17 Jun 2023
  • [D] What licensed software do you use for machine learning experimentation tracking?

    1 project | /r/MachineLearning | 11 Jun 2023
  • [Q] Is there a tool to keep track of my ML experiments?

    1 project | /r/datascience | 13 May 2023
  • Remote file access vulnerability in `mlflow server` and `mlflow ui` CLIs

    1 project | /r/LanguageTechnology | 24 Mar 2023
  • A note from our sponsor - CodeRabbit
    coderabbit.ai | 24 Apr 2025
    Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →

Index

What are some of the best open-source apache-spark projects in Python? This list will help you:

# Project Stars
1 MLflow 20,230
2 quinn 670
3 flintrock 642
4 PySpark-Boilerplate 396
5 sparktorch 340
6 pysparkling 268
7 dataproc-templates 127
8 pyjaws 43
9 Apache-Spark-Guide 30
10 covid-19-data-engineering-pipeline 23
11 e2e-structured-streaming 18
12 xonai-dashboard 14
13 Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data 12
14 transactional-datalake-using-amazon-msk-and-apache-iceberg-on-aws-glue 5
15 livyc 3

Sponsored
Save 47% on cloud hosting with autoscaling that just works
Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.
judoscale.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?