Spark

Open-source projects categorized as Spark

Top 23 Spark Open-Source Projects

  1. Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

    Project mention: Apache Spark VS cocoindex - a user suggested alternative | libhunt.com/r/spark | 2025-04-01
  2. Judoscale

    Save 47% on cloud hosting with autoscaling that just works. Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.

    Judoscale logo
  3. data-engineering-zoomcamp

    Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.

    Project mention: Study Notes 2.2.7: Managing Schedules and Backfills with BigQuery in Kestra | dev.to | 2025-02-04

    DE Zoomcamp Resources: Data Engineering Zoomcamp

  4. data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  5. Redash

    Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

    Project mention: The 50 best open-source alternatives to popular SaaS software | dev.to | 2024-07-10

    GitHub: Redash GitHub Repository

  6. ChuanhuChatGPT

    GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.

  7. ds-cheatsheets

    List of Data Science Cheatsheets to rule the world

  8. horovod

    Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

  9. InfluxDB

    InfluxDB high-performance time series database. Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems.

    InfluxDB logo
  10. Deeplearning4j

    Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...

  11. doris

    Apache Doris is an easy-to-use, high performance and unified analytics database.

    Project mention: Apache Doris: open-source data warehouse for real time data analytics | news.ycombinator.com | 2024-10-26
  12. Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

    Project mention: Wk 3 Orchestration: MLOPs with DataTalks | dev.to | 2025-02-22

    Here, we use the free Mage Ai orchestration tool.

  13. delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

    Project mention: Twitter's 600-Tweet Daily Limit Crisis: Soaring GCP Costs and the Open Source Fix Elon Musk Ignored | dev.to | 2025-04-10

    Delta Lake: Delta Lake is an open-source storage layer that provides ACID transactions, scalable metadata management, and data versioning on top of existing data lakes. It aims to bring reliability and performance optimizations to big data workloads while ensuring data integrity and consistency.

  14. risingwave

    Stream processing and management platform.

    Project mention: Simplifying SQL function implementation with Rust procedural macro | dev.to | 2025-03-13

    Then, utilize declarative macros to generate various types of kernel functions, including functions with 1, 2, and 3 parameters, as well as the input/output combinations of T and Option. Common kernels like unary, binary, ternary, unary_nullable and unary_bytes are generated, partially addressing the last two issues. (For the implementation details, see RisingWave's earlier code.) Theoretically, type exercise could also be used here. For example, introducing a trait to unify (A,), (A, B) and (A, B, C), or utilizing traits of Into and AsRef to unify T, Option, and Result, etc. However, you will probably encounter some type challenges posed by rustc.

  15. sqlglot

    Python SQL Parser and Transpiler

    Project mention: Duckberg! | dev.to | 2025-03-12

    This could be a nice option to add sqlglot here. As an advanced sql parsing library.

  16. H2O

    H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

  17. Alluxio (formerly Tachyon)

    Alluxio, data orchestration for analytics and machine learning in the cloud

  18. Zeppelin

    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

    Project mention: Serverless Data Processing on AWS : AWS Project | dev.to | 2024-11-13

    To do so, we will use Kinesis Data Analytics to run an Apache Flink application. To enhance our development experience, we will use Studio notebooks for Kinesis Data Analytics that are powered by Apache Zeppelin.

  19. dev-setup

    macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.

  20. SynapseML

    Simple and Distributed Machine Learning

  21. spark-nlp

    State of the Art Natural Language Processing

  22. TensorFlowOnSpark

    TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

  23. HELK

    The Hunting ELK

  24. RoaringBitmap

    A better compressed bitset in Java: used by Apache Spark, Netflix Atlas, Apache Pinot, Tablesaw, and many others

    Project mention: Roaring Bitmap Compression | news.ycombinator.com | 2024-11-08

    Theres actually a whole website about it! I found it useful when I was doing deeper research into ElasticSearch: https://roaringbitmap.org

  25. deequ

    Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

    Project mention: Deequ: Your Data's BFF | dev.to | 2024-08-23

    Deequ GitHub Repository

  26. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Spark discussion

Log in or Post with

Spark related posts

  • Hybrid in-memory and disk cache in Rust

    2 projects | news.ycombinator.com | 5 Mar 2025
  • Study Note DE Zoomcamp 1.2.4 - Dockerizing the Ingestion Script

    1 project | dev.to | 4 Feb 2025
  • Stream Processing Systems in 2025: RisingWave, Flink, Spark Streaming, and What's Ahead

    7 projects | dev.to | 27 Jan 2025
  • Data Engineering Zoomcamp 2025 Cohort: Introduction - Self-Study Notes

    1 project | dev.to | 25 Jan 2025
  • Apache Zeppelin

    6 projects | news.ycombinator.com | 2 Sep 2024
  • Migrating C# to Python with Claude 3.5 Sonnet.

    1 project | dev.to | 5 Sep 2024
  • Deequ: Your Data's BFF

    3 projects | dev.to | 23 Aug 2024
  • A note from our sponsor - Judoscale
    judoscale.com | 21 Apr 2025
    Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues. Learn more →

Index

What are some of the best open-source Spark projects? This list will help you:

# Project Stars
1 Apache Spark 40,958
2 data-engineering-zoomcamp 30,063
3 data-science-ipython-notebooks 27,993
4 Redash 27,219
5 ChuanhuChatGPT 15,421
6 ds-cheatsheets 15,110
7 horovod 14,451
8 Deeplearning4j 13,926
9 doris 13,529
10 Mage 8,245
11 delta 7,957
12 risingwave 7,657
13 sqlglot 7,531
14 H2O 7,117
15 Alluxio (formerly Tachyon) 6,977
16 Zeppelin 6,476
17 dev-setup 6,167
18 SynapseML 5,119
19 spark-nlp 3,951
20 TensorFlowOnSpark 3,874
21 HELK 3,822
22 RoaringBitmap 3,647
23 deequ 3,405

Sponsored
Save 47% on cloud hosting with autoscaling that just works
Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.
judoscale.com